[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/statistics/missing.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Treating missing values

In this notebook, you will have to treat missing values of a subset of the housing dataset.

You must apply deletion and (single column) imputation/attribution to the dataset, so that it will no longer contain missing values.

In [1]:
# make sure the numpy package is installed
%pip install numpy --quiet
repo='data-science-course'
module='statistics'
# if running in colab, install the required packages and copy the necessary files
if get_ipython().__class__.__module__.startswith('google.colab'):
    import os
    if not os.path.exists(repo):
        !git clone --filter=blob:none --sparse https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
        !cd {repo} && git sparse-checkout init --cone && git sparse-checkout set {module}  2>/dev/null
    !cp --update {repo}/{module}/*.py . 2>/dev/null
    !mkdir -p img data
    !mv {repo}/{module}/img/* img/.  2>/dev/null
    !mv {repo}/{module}/data/* data/.  2>/dev/null

Note: you may need to restart the kernel to use updated packages.


## Loading the dataset

Let's start by loading the dataset.

In [2]:
import pandas as pd
data = pd.read_csv('data/missing_housing.csv')
number_of_instances = len(data)
print(f'Total number of instances: {number_of_instances:,}.')
print(f"Shape of the dataset: {data.shape}.")
data.head(10)

Total number of instances: 20,818.
Shape of the dataset: (20818, 16).


Unnamed: 0,Id,Sold Price,Type,Year built,Cooling,Lot,Bathrooms,Total interior livable area,Total spaces,Region,Parking features,Annual tax amount,Listed Price,Last Sold Price,City,State
0,0,3825000,SingleFamily,1969.0,"Multi-Zone, Central AC, Whole House / Attic Fan",1.0,0.0,1.0,0.0,Los Altos,"Garage, Garage - Attached, Covered",12580.0,4198000,,Los Altos,CA
1,1,505000,SingleFamily,1926.0,"Wall/Window Unit(s), Evaporative Cooling, See ...",4047.0,2.0,872.0,1.0,Los Angeles,"Detached Carport, Garage",6253.0,525000,328000.0,Los Angeles,CA
2,2,140000,SingleFamily,1958.0,,9147.0,3.0,1152.0,0.0,Strawberry,,468.0,180000,,Strawberry,CA
3,3,1775000,SingleFamily,1947.0,Central Air,,3.0,2612.0,0.0,Culver City,"Detached Carport, Driveway, Garage - Two Door",20787.0,1895000,1500000.0,Culver City,CA
4,4,1175000,VacantLand,,,,,,,Creston,,,1595000,900000.0,Creston,CA
5,5,221000,SingleFamily,1905.0,Window Unit(s),3576.0,2.0,1311.0,0.0,Stockton,Carport,2531.0,224900,200000.0,Stockton,CA
6,6,1589000,Unknown,1926.0,Central Air,,,,0.0,Los Angeles,"Driveway, Garage",19220.0,1599000,500000.0,Los Angeles,CA
7,7,480000,SingleFamily,2005.0,Other,1771149.6,2.0,2519.0,4.0,Taylorsville,"Carport, Garage - Attached, Covered",,499000,,Taylorsville,CA
8,8,1590000,Condo,2001.0,,,3.0,1601.0,1.0,San Francisco,"Attached, Enclosed, Garage Door Opener, Interi...",13793.0,1650000,,San Francisco,CA
9,9,1275000,SingleFamily,1973.0,,66211.2,2.0,2123.0,0.0,Aptos,"Garage, Garage - Attached",1909.0,1050000,,Aptos,CA


## Showing the number of missing values per column

Show the number of missing values per column in the dataset.

In [3]:
print("Number of missing values per column:")
columns_with_20_percent_missing_values = []
columns_with_50_percent_missing_values = []
columns_with_a_few_missing_values = []
columns_with_no_missing_values = []
for column in data.columns:
    number_of_missing_instances = data[column].isnull().sum()
    missing_values_percentage = number_of_missing_instances / number_of_instances * 100
    print(f'\t{column}: {number_of_instances:,}\t\t{number_of_missing_instances/number_of_instances*100:.2f}%')
    if 20 <= missing_values_percentage < 50:
        columns_with_20_percent_missing_values.append(column)
    elif missing_values_percentage >= 50:
        columns_with_50_percent_missing_values.append(column)
    elif number_of_missing_instances > 0 and missing_values_percentage < 20:
        columns_with_a_few_missing_values.append(column)
    elif number_of_missing_instances == 0:
        columns_with_no_missing_values.append(column)
print(f'Columns with between 20% and 50% missing values: {columns_with_20_percent_missing_values}.')
print(f'Columns with more than 50% missing values: {columns_with_50_percent_missing_values}.')
print(f'Columns with missing values, less than 20%: {columns_with_a_few_missing_values}.')
print(f'Columns with no missing values: {columns_with_no_missing_values}.')

Number of missing values per column:
	Id: 20,818		0.00%
	Sold Price: 20,818		0.00%
	Type: 20,818		0.00%
	Year built: 20,818		1.95%
	Cooling: 20,818		86.67%
	Lot: 20,818		87.48%
	Bathrooms: 20,818		7.25%
	Total interior livable area: 20,818		5.39%
	Total spaces: 20,818		1.66%
	Region: 20,818		0.01%
	Parking features: 20,818		9.75%
	Annual tax amount: 20,818		9.51%
	Listed Price: 20,818		0.00%
	Last Sold Price: 20,818		38.71%
	City: 20,818		0.00%
	State: 20,818		0.00%
Columns with between 20% and 50% missing values: ['Last Sold Price'].
Columns with more than 50% missing values: ['Cooling', 'Lot'].
Columns with missing values, less than 20%: ['Year built', 'Bathrooms', 'Total interior livable area', 'Total spaces', 'Region', 'Parking features', 'Annual tax amount'].
Columns with no missing values: ['Id', 'Sold Price', 'Type', 'Listed Price', 'City', 'State'].


## Deletion

Delete the missing values you think deletion is the most appropriate method. Justify your choices.

In [4]:
# write your code here



## Single-column imputation/attribution

Analyze all the columns with missing values and decide which imputation/attribution method is more appropriate for each one of them. Then, apply the chosen method to the dataset. Do it *only* for single column imputation/attribution. For multiple column imputation/attribution, just include them in the cell at the end of the notebook (you do not need to implement it).

In [5]:
# write your code here

 

## Multiple-column imputation/attribution

Write below a list of columns that you think must be used a method that requires multiple columns for the imputation. Include, for each column, an indication of the columns you think are required for the imputation. Indicate if it is a regression or classification imputation. Justify your choices. 

*Write your answer here*

