# C. Pre-processing<a id = 'C_Pre-processing'></a>

# 1. Contents<a id='1._Contents'></a>
* [C. Pre-processing](#C_Pre-processing)
* [2. Objective](#2_Objective)
    * [2.1 Tasks](#2.1_Tasks)
* [3. Imports](#3_Imports)
* [4. Load Data](#4._Load_Data)
* [5. Revising and Creating New Features](#5._Revising_and_Creating_New_Features)
    * [5.1 Revising the yearBuilt feature](#5.1_Revising_the_yearBuilt_feature)
    * [5.2 Creating a new feature for state rent clusters](#5.2_Creating_a_new_feature_for_state_rent_clusters)
* [6. Create a Dataframe for the features and target variable](#6._Create_a_Dataframe_for_the_features_and_target_variable)
* [7. Develop an Imputation Strategy](#7._Develop_an_Imputation_Strategy)
* [8. Create Train-Test Split](#8._Create_Train-Test_Split)
* [9. Create One Hot Encoding Object](#9._Create_One_Hot_Encoding_Object)
* [10. Create Scaling Object](#10._Create_Scaling_Object)

# 2. Objective<a id = '2_Objective'></a>

- Prepare data for modelling

## 2.1 Tasks<a id = '2.1_Tasks'></a>

1. Find features with missing values
2. Determine relevance of feature for modelling. Drop features you feel are not important
3. Impute numerical missing values
4. Convert categorical variables and datetime variables to numerical variables
5. Impute categorical missing values
6. Scale all data

# 3. Imports<a id = '3_Imports'></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from datetime import datetime
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder, OrdinalEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.pipeline import FeatureUnion
from collections import defaultdict

# 4. Load Data<a id='4._Load_Data'></a>

In [2]:
read = pd.read_csv('..\data\interim\\rentals_process.csv', parse_dates = ['yearConstructed', 'date'])

In [3]:
rentals = read.copy()

In [4]:
rentals.shape

(266220, 37)

In [5]:
rentals.head()

Unnamed: 0,state,serviceCharge,heatingType,telekomTvOffer,newlyConst,balcony,telekomUploadSpeed,yearConstructed,firingTypes,hasKitchen,...,sqm_per_room,area_km2,population_2019,population_per_km2,gdp_per_capita_2018,hdi,total_state_listings,total_state_sqm,listings_per_1000capita,listings_per_100sqm
0,Nordrhein_Westfalen,245.0,central_heating,ONE_YEAR_FREE,False,False,10.0,1965-01-01,oil,False,...,21.5,34085,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748
1,Nordrhein_Westfalen,95.0,self_contained_central_heating,ONE_YEAR_FREE,False,False,40.0,1953-01-01,gas,False,...,24.0,34085,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748
2,Nordrhein_Westfalen,200.0,central_heating,ONE_YEAR_FREE,False,False,40.0,1951-01-01,oil,False,...,30.86,34085,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748
3,Nordrhein_Westfalen,215.0,gas_heating,ONE_YEAR_FREE,True,True,2.4,2018-01-01,gas,False,...,29.0,34085,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748
4,Nordrhein_Westfalen,121.0,central_heating,ONE_YEAR_FREE,False,True,40.0,1914-01-01,gas,False,...,26.0,34085,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748


In [6]:
rentals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266220 entries, 0 to 266219
Data columns (total 37 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   state                    266220 non-null  object        
 1   serviceCharge            259490 non-null  float64       
 2   heatingType              221782 non-null  object        
 3   telekomTvOffer           233938 non-null  object        
 4   newlyConst               266220 non-null  bool          
 5   balcony                  266220 non-null  bool          
 6   telekomUploadSpeed       233208 non-null  float64       
 7   yearConstructed          209427 non-null  datetime64[ns]
 8   firingTypes              209806 non-null  object        
 9   hasKitchen               266220 non-null  bool          
 10  cellar                   266220 non-null  bool          
 11  rent                     266220 non-null  float64       
 12  livingSpace     

# 5. Revising and Creating New Features<a id='5._Revising_and_Creating_New_Features'></a>

## 5.1 Revising the yearBuilt feature<a id='5.1_Revising_the_yearBuilt_features'></a>

In [7]:
#Convert yearBuilt feature to numeric format
rentals['yearBuilt'] = rentals.yearConstructed.dt.year

## 5.2 Creating a new feature for state rent clusters<a id='5.2_Creating_a_new_feature_for_state_rent_clusters'></a> 

In [8]:
#Create three lists of states according to their median rent seen during exploratory data analysis
high_rent = ['Berlin', 'Hamburg']
medium_rent = ['Baden_Württemberg', 'Bayern', 'Hessen']
low_rent = ['Schleswig_Holstein', 'Thüringen', 'Niedersachsen', 'Saarland', 'Bremen', 'Sachsen',\
                  'Mecklenburg_Vorpommern', 'Sachsen_Anhalt', 'Brandenburg', 'Nordrhein_Westfalen', 'Rheinland_Pfalz']

In [9]:
#Create a dictionary for mapping state to rent cluster
rent_mapping = {}
for state in high_rent:
    rent_mapping[state] = 'high_rent_state'
    
for state in medium_rent:
    rent_mapping[state] = 'medium_rent_state'

for state in low_rent:
    rent_mapping[state] = 'low_rent_state'

rent_mapping

{'Berlin': 'high_rent_state',
 'Hamburg': 'high_rent_state',
 'Baden_Württemberg': 'medium_rent_state',
 'Bayern': 'medium_rent_state',
 'Hessen': 'medium_rent_state',
 'Schleswig_Holstein': 'low_rent_state',
 'Thüringen': 'low_rent_state',
 'Niedersachsen': 'low_rent_state',
 'Saarland': 'low_rent_state',
 'Bremen': 'low_rent_state',
 'Sachsen': 'low_rent_state',
 'Mecklenburg_Vorpommern': 'low_rent_state',
 'Sachsen_Anhalt': 'low_rent_state',
 'Brandenburg': 'low_rent_state',
 'Nordrhein_Westfalen': 'low_rent_state',
 'Rheinland_Pfalz': 'low_rent_state'}

In [10]:
#Create new feature - state_rent_class - for each observation
rentals['state_rent_class'] = rentals['state'].replace(rent_mapping)
rentals.loc[:, ['state', 'rent', 'state_rent_class']].sample(5)

Unnamed: 0,state,rent,state_rent_class
44410,Nordrhein_Westfalen,795.0,low_rent_state
13704,Nordrhein_Westfalen,499.0,low_rent_state
208738,Bayern,1040.0,medium_rent_state
248529,Berlin,1100.0,high_rent_state
41907,Nordrhein_Westfalen,1500.0,low_rent_state


# 6. Create a Dataframe for the features and target variable<a id='5.6._Create_a_Dataframe_for_the_features_and_target_variable'></a> 

In [11]:
'''Create a features datafram and drop high cardinality (zip_code), low variance (telekomTvOffer & 'telekomUploadSpeed') 
, and redundant (yearConstructed) features. Also drop the target variable'''
X = rentals.drop(columns = ['rent', 'yearConstructed', 'telekomUploadSpeed', 'telekomTvOffer', 'zip_code'])
X.head()

Unnamed: 0,state,serviceCharge,heatingType,newlyConst,balcony,firingTypes,hasKitchen,cellar,livingSpace,condition,...,population_2019,population_per_km2,gdp_per_capita_2018,hdi,total_state_listings,total_state_sqm,listings_per_1000capita,listings_per_100sqm,yearBuilt,state_rent_class
0,Nordrhein_Westfalen,245.0,central_heating,False,False,oil,False,True,86.0,well_kept,...,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748,1965.0,low_rent_state
1,Nordrhein_Westfalen,95.0,self_contained_central_heating,False,False,gas,False,True,60.0,well_kept,...,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748,1953.0,low_rent_state
2,Nordrhein_Westfalen,200.0,central_heating,False,False,oil,False,False,123.44,first_time_use_after_refurbishment,...,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748,1951.0,low_rent_state
3,Nordrhein_Westfalen,215.0,gas_heating,True,True,gas,False,True,87.0,first_time_use,...,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748,2018.0,low_rent_state
4,Nordrhein_Westfalen,121.0,central_heating,False,True,gas,False,True,65.0,well_kept,...,17932651,526,39678,0.936,62069,4615660.71,3.461228,1.344748,1914.0,low_rent_state


In [12]:
#Create a dataframe with the target variable
y = rentals[['rent']]
y.head()

Unnamed: 0,rent
0,595.0
1,300.0
2,950.0
3,972.6
4,329.0


# 7. Develop an Imputation Strategy<a id='7._Develop_an_Imputation_Strategy'></a>

In [13]:
# Review datatypes
X.dtypes

state                              object
serviceCharge                     float64
heatingType                        object
newlyConst                           bool
balcony                              bool
firingTypes                        object
hasKitchen                           bool
cellar                               bool
livingSpace                       float64
condition                          object
interiorQual                       object
petsAllowed                        object
lift                                 bool
typeOfFlat                         object
noRooms                           float64
thermalChar                       float64
numberOfFloors                    float64
garden                               bool
district                           object
town_municipality                  object
date                       datetime64[ns]
rent_per_sqm                      float64
sqm_per_room                      float64
area_km2                          

In [14]:
#Create a list of names for categorical features
cat_cols = X.select_dtypes('object').columns.to_list()
cat_cols

['state',
 'heatingType',
 'firingTypes',
 'condition',
 'interiorQual',
 'petsAllowed',
 'typeOfFlat',
 'district',
 'town_municipality',
 'state_rent_class']

In [15]:
#Create a list of names for numerical features
num_cols = X.select_dtypes(['float64', 'int64']).columns.to_list()
num_cols

['serviceCharge',
 'livingSpace',
 'noRooms',
 'thermalChar',
 'numberOfFloors',
 'rent_per_sqm',
 'sqm_per_room',
 'area_km2',
 'population_2019',
 'population_per_km2',
 'gdp_per_capita_2018',
 'hdi',
 'total_state_listings',
 'total_state_sqm',
 'listings_per_1000capita',
 'listings_per_100sqm',
 'yearBuilt']

In [16]:
#Use median imputation for numerical features
imputations = []
imputations.extend([([numeric_col], SimpleImputer(missing_values = np.nan, strategy = 'median')) \
                                for numeric_col in num_cols])
#Use 'most_frequent' imputation for categorical features
imputations.extend([([cat_col], SimpleImputer(strategy = 'most_frequent')) for cat_col in cat_cols])
imputations

[(['serviceCharge'], SimpleImputer(strategy='median')),
 (['livingSpace'], SimpleImputer(strategy='median')),
 (['noRooms'], SimpleImputer(strategy='median')),
 (['thermalChar'], SimpleImputer(strategy='median')),
 (['numberOfFloors'], SimpleImputer(strategy='median')),
 (['rent_per_sqm'], SimpleImputer(strategy='median')),
 (['sqm_per_room'], SimpleImputer(strategy='median')),
 (['area_km2'], SimpleImputer(strategy='median')),
 (['population_2019'], SimpleImputer(strategy='median')),
 (['population_per_km2'], SimpleImputer(strategy='median')),
 (['gdp_per_capita_2018'], SimpleImputer(strategy='median')),
 (['hdi'], SimpleImputer(strategy='median')),
 (['total_state_listings'], SimpleImputer(strategy='median')),
 (['total_state_sqm'], SimpleImputer(strategy='median')),
 (['listings_per_1000capita'], SimpleImputer(strategy='median')),
 (['listings_per_100sqm'], SimpleImputer(strategy='median')),
 (['yearBuilt'], SimpleImputer(strategy='median')),
 (['state'], SimpleImputer(strategy='mos

In [17]:
#Create pipeline element that enables imputation of missing values
df_imputation = DataFrameMapper(imputations, input_df = True, df_out = True)
df_imputation

DataFrameMapper(df_out=True, drop_cols=[],
                features=[(['serviceCharge'], SimpleImputer(strategy='median')),
                          (['livingSpace'], SimpleImputer(strategy='median')),
                          (['noRooms'], SimpleImputer(strategy='median')),
                          (['thermalChar'], SimpleImputer(strategy='median')),
                          (['numberOfFloors'],
                           SimpleImputer(strategy='median')),
                          (['rent_per_sqm'], SimpleImputer(strategy='m...
                           SimpleImputer(strategy='most_frequent')),
                          (['petsAllowed'],
                           SimpleImputer(strategy='most_frequent')),
                          (['typeOfFlat'],
                           SimpleImputer(strategy='most_frequent')),
                          (['district'],
                           SimpleImputer(strategy='most_frequent')),
                          (['town_municipality'],
      

# 8. Create Train-Test Split<a id='8._Create_Train-Test_Split'></a>

In [18]:
#Split X and y to test and training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 77)

# 9. Create One Hot Encoding Object<a id='9._Create_One_Hot_Encoding_Object'></a>

In [19]:
onehot = OneHotEncoder(handle_unknown = 'ignore')

# 10. Create Scaling Object<a id='10._Create_Scaling_Object'></a>

In [20]:
scaler = StandardScaler(with_mean = False)