[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/statistics/missing.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Treating missing values

In this notebook, you will have to treat missing values of a subset of the housing dataset.

You must apply deletion and (single column) imputation/attribution to the dataset, so that it will no longer contain missing values.

In [2]:
# make sure the required packages are installed
%pip install numpy matplotlib pandas --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/statistics'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone --depth 1 https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/img/* img/.
    !cp {directory}/data/* data/.

Note: you may need to restart the kernel to use updated packages.


## Loading the dataset

Let's start by loading the dataset.

In [7]:
import pandas as pd
data = pd.read_csv('data/missing_housing.csv')
number_of_instances = len(data)
print(f'Total number of instances: {number_of_instances:,}.')

Total number of instances: 20,818.


## Showing the number of missing values per column

Show the number of missing values per column in the dataset.

In [9]:
print("Number of missing values per column:")
columns_with_20_percent_missing_values = []
columns_with_50_percent_missing_values = []
columns_with_a_few_missing_values = []
for column in data.columns:
    number_of_missing_instances = data[column].isnull().sum()
    missing_values_percentage = number_of_missing_instances / number_of_instances * 100
    print(f'\t{column}: {number_of_instances:,}\t\t{number_of_missing_instances/number_of_instances*100:.2f}%')
    if 20 <= missing_values_percentage < 50:
        columns_with_20_percent_missing_values.append(column)
    if missing_values_percentage >= 50:
        columns_with_50_percent_missing_values.append(column)
    if number_of_missing_instances > 0 and missing_values_percentage < 20:
        columns_with_a_few_missing_values.append(column)
print(f'Columns with between 20% and 50% missing values: {columns_with_20_percent_missing_values}')
print(f'Columns with more than 50% missing values: {columns_with_50_percent_missing_values}')
print(f'Columns with missing values, less than 20%: {columns_with_a_few_missing_values}')

Number of missing values per column:
	Id: 20,818		0.00%
	Sold Price: 20,818		0.00%
	Type: 20,818		0.00%
	Year built: 20,818		1.95%
	Cooling: 20,818		86.67%
	Lot: 20,818		87.48%
	Bathrooms: 20,818		7.25%
	Total interior livable area: 20,818		5.39%
	Total spaces: 20,818		1.66%
	Region: 20,818		0.01%
	Parking features: 20,818		9.75%
	Annual tax amount: 20,818		9.51%
	Listed Price: 20,818		0.00%
	Last Sold Price: 20,818		38.71%
	City: 20,818		0.00%
	State: 20,818		0.00%
Columns with between 20% and 50% missing values: ['Last Sold Price']
Columns with more than 50% missing values: ['Cooling', 'Lot']
Columns with missing values, less than 20%: ['Year built', 'Bathrooms', 'Total interior livable area', 'Total spaces', 'Region', 'Parking features', 'Annual tax amount']


## Deletion

Delete the missing values you think deletion is the most appropriate method. Justify your choices.

In [5]:
# write your code here



## Single-column imputation/attribution

Analyze all the columns with missing values and decide which imputation/attribution method is more appropriate for each one of them. Then, apply the chosen method to the dataset. Do it *only* for single column imputation/attribution. For multiple column imputation/attribution, just include them in the cell at the end of the notebook (you do not need to implement it).

In [6]:
# write your code here

 

## Multiple-column imputation/attribution

Write below a list of columns that you think must be used a method that requires multiple columns for the imputation. Include, for each column, an indication of the columns you think are required for the imputation. Indicate if it is a regression or classification imputation. Justify your choices. 

*Write your answer here*

