# Treating missing values

In this notebook, you will have to treat missing values of a subset of the housing dataset.

You must apply deletion and (single column) imputation/attribution to the dataset, so that it will no longer contain missing values.

In [82]:
%%capture
# make sure the required packages are installed
%pip install numpy matplotlib pandas

## Loading the dataset

Let's start by loading the dataset.

In [83]:
import pandas as pd
data = pd.read_csv('data/missing_housing.csv')

## Showing the number of missing values per column

Show the number of missing values per column in the dataset.

In [84]:
# write your code here

# print the number of missing values per column
number_of_instances = len(data)
print(f'Total number of instances: {number_of_instances}')
print("Number of missing values per column:")
columns_with_60_percent_missing_values = []
columns_with_missing_values = []
for column in data.columns:
    number_of_missing_instances = data[column].isnull().sum()
    missing_values_percentage = number_of_missing_instances / number_of_instances * 100
    print(f'\t{column}: {number_of_instances}    {number_of_missing_instances/number_of_instances*100:.2f}%')
    if missing_values_percentage > 60:
        columns_with_60_percent_missing_values.append(column)
    if number_of_missing_instances > 0 and missing_values_percentage <= 60:
        columns_with_missing_values.append(column)
print(f'Columns with more than 60% missing values: {columns_with_60_percent_missing_values}')
print(f'Columns with missing values: {columns_with_missing_values}')

Total number of instances: 20818
Number of missing values per column:
	Id: 20818    0.00%
	Sold Price: 20818    0.00%
	Type: 20818    0.00%
	Year built: 20818    1.95%
	Cooling: 20818    86.67%
	Lot: 20818    87.48%
	Bathrooms: 20818    7.25%
	Total interior livable area: 20818    5.39%
	Total spaces: 20818    1.66%
	Region: 20818    0.01%
	Parking features: 20818    9.75%
	Annual tax amount: 20818    9.51%
	Listed Price: 20818    0.00%
	Last Sold Price: 20818    38.71%
	City: 20818    0.00%
	State: 20818    0.00%
Columns with more than 60% missing values: ['Cooling', 'Lot']
Columns with missing values: ['Year built', 'Bathrooms', 'Total interior livable area', 'Total spaces', 'Region', 'Parking features', 'Annual tax amount', 'Last Sold Price']


## Deletion

Delete the missing values you have identified.

In [85]:
# write your code here

# drop columns with more than 60% missing values
data = data.drop(columns=columns_with_60_percent_missing_values)
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20818 entries, 0 to 20817
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Id                           20818 non-null  int64  
 1   Sold Price                   20818 non-null  int64  
 2   Type                         20818 non-null  object 
 3   Year built                   20413 non-null  float64
 4   Bathrooms                    19308 non-null  float64
 5   Total interior livable area  19696 non-null  float64
 6   Total spaces                 20473 non-null  float64
 7   Region                       20816 non-null  object 
 8   Parking features             18788 non-null  object 
 9   Annual tax amount            18838 non-null  float64
 10  Listed Price                 20818 non-null  int64  
 11  Last Sold Price              12760 non-null  float64
 12  City                         20818 non-null  object 
 13  State           

## Single-column imputation/attribution

Analyze all the columns with missing values and decide which imputation/attribution method is more appropriate for each one of them. Then, apply the chosen method to the dataset. Do it for single column imputation/attribution. For multiple column imputation/attribution, just include them in the cell at the end of the notebook.

In [86]:
# write your code here

# Year built
# It is an ordinal value. No way we can infer the year of construction from other columns. We'll use the median value.
median_year_of_construction = data['Year built'].median()
data['Year built'] = data['Year built'].fillna(median_year_of_construction)
print(f"Previous median of year: {median_year_of_construction}. New median: {data['Year built'].median()}")
print(f"Number of instances with year of construction missing: {data['Year built'].isnull().sum()}.")

# Region
# show the row with missing region
print("Row with missing region:")
print(data[data['Region'].isnull()][['Id', 'City', 'Region']])
# We can infer the region from the city. Let's find the citi 
city = data[data['Region'].isnull()]['City'].values[0]  # get the city name (all the values are the same)
print(f"City of the missing region: {city}")
region = data[(data['City'] == city) & (data['Region'].notnull())]['Region'].values[0]
print(f"Region of the missing region: {region}")
# imputation
data['Region'] = data['Region'].fillna(region)
print(f"Number of instances with region missing: {data['Region'].isnull().sum()}.")


# Parking features
# it is a nominal value. We cannot infer it. We'll use the mode.
mode_parking_features = data['Parking features'].mode()[0]
data['Parking features'] = data['Parking features'].fillna(mode_parking_features)
print(f"Number of instances with parking features missing: {data['Parking features'].isnull().sum()}.")


print("Columns with missing values that require imputation with multiple variables:")
for column in data.columns:
    if data[column].isnull().sum() > 0:
        print(f'\t{column}.')

Previous median of year: 1964.0. New median: 1964.0
Number of instances with year of construction missing: 0.
Row with missing region:
      Id     City Region
964  964  Isleton    NaN
969  969  Isleton    NaN
City of the missing region: Isleton
Region of the missing region: Isleton Region
Number of instances with region missing: 0.
Number of instances with parking features missing: 0.
Columns with missing values that require imputation with multiple variables:
	Bathrooms.
	Total interior livable area.
	Total spaces.
	Annual tax amount.
	Last Sold Price.


## Multiple-column imputation/attribution

Write below a list of columns that you think must be used a method that requires multiple columns for the imputation. Include, for each column, an indication of the columns required for the imputation. Indicate if it is a regression or classificiation imputation. Justify your choices. 

*Write your answer here*

1. Bathrooms. Classification imputation. We can infer the number of bathrooms from the total interior livable area, total spaces, and listed price.
2. Total interior livable area. Regression imputation. We can infer the total interior livable area from the total spaces, the number of bedrooms, and the listed price.
3. Total spaces. Classification imputation. We can infer the total spaces from the total interior livable area, the number of bedrooms, and the listed price.
4. Annual tax amount. Regression imputation. We can infer the annual tax amount from the listed price and the lot.
5. Last sold price. Regression imputation. We can infer the last sold price from probably all (most) of the features.