# Pre-processing

# Objective

- Prepare data for modelling

## Tasks

1. Find features with missing values
2. Determine relevance of feature for modelling. Drop features you feel are not important
3. Impute numerical missing values
4. Convert categorical variables and datetime variables to numerical variables
5. Impute categorical missing values
6. Scale all data

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Load Data

In [2]:
read = pd.read_csv('..\data\interim\\rentals_complete.csv')

In [3]:
rentals = read.copy()

In [4]:
rentals.shape

(267690, 36)

In [5]:
rentals.head()

Unnamed: 0,state,serviceCharge,newlyConst,balcony,hasKitchen,baseRent,livingSpace,condition,interiorQual,lift,...,medianServiceCharge_ct,medianThermalChar_ct,medianPictureCount_ct,medianNoRooms_muni,medianLivingSpace_muni,medianServiceCharge_muni,medianPictureCount_muni,medianNoRooms_zip,medianLivingSpace_zip,medianServiceCharge_zip
0,Nordrhein_Westfalen,245.0,No,No,No,595.0,86.0,well_kept,normal,No,...,121.0,126.0,8.0,3.0,78.0,161.62,9.0,3.0,78.0,150.0
1,Nordrhein_Westfalen,141.0,No,Yes,No,579.0,70.95,unknown,unknown,No,...,121.0,126.0,8.0,3.0,78.0,161.62,9.0,3.0,78.0,150.0
2,Nordrhein_Westfalen,141.0,No,Yes,No,569.0,70.95,unknown,unknown,No,...,121.0,126.0,8.0,3.0,78.0,161.62,9.0,3.0,78.0,150.0
3,Nordrhein_Westfalen,322.16,Yes,Yes,Yes,1398.6,115.58,first_time_use,sophisticated,Yes,...,121.0,126.0,8.0,3.0,78.0,161.62,9.0,3.0,78.0,150.0
4,Nordrhein_Westfalen,112.92,No,No,Yes,725.62,51.1,mint_condition,sophisticated,Yes,...,121.0,126.0,8.0,3.0,78.0,161.62,9.0,3.0,78.0,150.0


In [6]:
rentals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267690 entries, 0 to 267689
Data columns (total 36 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   state                      267690 non-null  object 
 1   serviceCharge              257456 non-null  float64
 2   newlyConst                 267690 non-null  object 
 3   balcony                    267690 non-null  object 
 4   hasKitchen                 267690 non-null  object 
 5   baseRent                   267670 non-null  float64
 6   livingSpace                267690 non-null  float64
 7   condition                  267690 non-null  object 
 8   interiorQual               267690 non-null  object 
 9   lift                       267690 non-null  object 
 10  typeOfFlat                 267690 non-null  object 
 11  zip_code                   267690 non-null  int64  
 12  noRooms                    267690 non-null  float64
 13  city_town                  26

# Reviewing missing values

In [7]:
#Check columns with missing values
rentals.columns[rentals.isnull().any()]

Index(['serviceCharge', 'baseRent', 'description', 'facilities',
       'medianServiceCharge_muni', 'medianServiceCharge_zip'],
      dtype='object')

We will impute median instead of mean values for the missing values for serviceCharge and baseRent to avoid the impact of outliers. We will impute mean values for the missing values for medianServiceCharge_muni and medianServiceCharge_zip because these values are unlikely to have outliers.

In [8]:
#Replace missing values for serviceCharge and baseRent with median
rentals.loc[rentals.serviceCharge.isnull(), 'serviceCharge'] = rentals.serviceCharge.median()
rentals.loc[rentals.baseRent.isnull(), 'baseRent'] = rentals.baseRent.median()
#Replace missing values for medianServiceCharge_muni and medianServiceCharge_zip with their means
rentals.loc[rentals.medianServiceCharge_muni.isnull(), 'medianServiceCharge_muni'] = rentals.medianServiceCharge_muni.mean()
rentals.loc[rentals.medianServiceCharge_zip.isnull(), 'medianServiceCharge_zip'] = rentals.medianServiceCharge_zip.mean()

In [9]:
# We will drop description and facilities for now
rentals.drop(columns = ['description', 'facilities'], inplace = True)

In [10]:
#Verify that there are no missing values
rentals.isnull().any().sum()

0

# Encode categorical variables

We will using a dummy encoder to encode categorical variables. But we have to first of all drop those features with a lot of categories. These include city / town, municipality and zip code. We have already obtained relevant information for these features related to such information as to their median livingSpace. 

In [11]:
#Drop categorical features with many unique values
rentals.drop(columns = ['city_town', 'municipality', 'zip_code'], inplace = True)

In [12]:
rentals.columns

Index(['state', 'serviceCharge', 'newlyConst', 'balcony', 'hasKitchen',
       'baseRent', 'livingSpace', 'condition', 'interiorQual', 'lift',
       'typeOfFlat', 'noRooms', 'gdp_per_capita_2018', 'hdi_2018',
       'medianLivingSpace_state', 'medianServiceCharge_state',
       'medianThermalChar_state', 'medianHeatingCosts_state',
       'medianPictureCount_state', 'no_Listings_per_100people',
       'medianLivingSpace_ct', 'medianServiceCharge_ct',
       'medianThermalChar_ct', 'medianPictureCount_ct', 'medianNoRooms_muni',
       'medianLivingSpace_muni', 'medianServiceCharge_muni',
       'medianPictureCount_muni', 'medianNoRooms_zip', 'medianLivingSpace_zip',
       'medianServiceCharge_zip'],
      dtype='object')

In [13]:
#Encode cateogrical features
rentals_df = pd.get_dummies(rentals)

In [14]:
rentals_df.shape

(267690, 74)

# Create a Dataframe for the features and target variable

In [15]:
X = rentals_df.drop(columns = ['baseRent'])
y = rentals_df.baseRent
X_features = X.columns

In [16]:
X_features

Index(['serviceCharge', 'livingSpace', 'noRooms', 'gdp_per_capita_2018',
       'hdi_2018', 'medianLivingSpace_state', 'medianServiceCharge_state',
       'medianThermalChar_state', 'medianHeatingCosts_state',
       'medianPictureCount_state', 'no_Listings_per_100people',
       'medianLivingSpace_ct', 'medianServiceCharge_ct',
       'medianThermalChar_ct', 'medianPictureCount_ct', 'medianNoRooms_muni',
       'medianLivingSpace_muni', 'medianServiceCharge_muni',
       'medianPictureCount_muni', 'medianNoRooms_zip', 'medianLivingSpace_zip',
       'medianServiceCharge_zip', 'state_Baden_Württemberg', 'state_Bayern',
       'state_Berlin', 'state_Brandenburg', 'state_Bremen', 'state_Hamburg',
       'state_Hessen', 'state_Mecklenburg_Vorpommern', 'state_Niedersachsen',
       'state_Nordrhein_Westfalen', 'state_Rheinland_Pfalz', 'state_Saarland',
       'state_Sachsen', 'state_Sachsen_Anhalt', 'state_Schleswig_Holstein',
       'state_Thüringen', 'newlyConst_No', 'newlyConst_Yes', 'b

# Scaling the features

In [17]:
scaler = StandardScaler(with_mean = False)
X_scaled = scaler.fit_transform(X)

In [18]:
X_scaled[:1]

array([[ 2.76549704,  2.67693789,  4.07245698,  5.32533454, 68.61803019,
         9.77585149,  5.9431358 , 10.88529221, 13.23630657, 11.63627771,
         0.72906719,  7.72816355,  3.92932394,  7.48335937,  5.89461833,
         5.94689157,  6.29448499,  4.08547482,  4.0282086 ,  6.00296784,
         6.18214023,  3.62621026,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  2.36124829,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  3.72039186,  0.        ,
         2.05666549,  0.        ,  2.10778982,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  2.31626557,
         0.        ,  2.17236613,  0.        ,  0.        ,  0.        ,
         2.35800584,  0.        ,  0.        ,  3.10683919,  0.        ,
         0.        ,  0.        ,  0.        ,  0. 

# Create Train-Test Split

In [19]:
#Split X and y to test and training set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 77)

# Baseline model

We will be using a linear regression algorithm to model the relationship between base rent and the features. We will also use the r-squared score and the mean absolute error to assess the model's performance. The r-squared score gives us an indication as to how much of the variation of base rent is due to the model, while the mean absolute error gives us a realistic sense for the margin of error associated with the model in terms of euros. 

In [20]:
#Instantiate linear regression model
lm = LinearRegression()

In [21]:
#Fit the model to the data
lm.fit(X_train, y_train);

In [22]:
#Determine predictions for train and the test set
y_train_pred = lm.predict(X_train)
y_test_pred = lm.predict(X_test)

In [23]:
#Review performance of model
r2_score(y_train, y_train_pred), mean_absolute_error(y_train, y_train_pred)

(0.735007318521693, 147.1256859213482)

In [24]:
#Review performance of model
r2_score(y_test, y_test_pred), mean_absolute_error(y_test, y_test_pred)

(0.7285429457365407, 147.26695580709028)

# Summary

After processing the data to enable modelling, our base model gives us a 72% r2-score and a 147 euros mean absolute error on our test set. It seems like our model is informative and could be improved upond with other algorithms and modelling techniques. Our training set scores were 73.5% for r2-score 147 euros for mae. The closeness of set of values for our training and test sets suggests that there is minimal overfitting of the model. 

Going forward, we will attempt to improve on these scores by:
- Choosing select features
- Using other modelling algorithms
- Tuning hyperparameters

# Saving

In [25]:
import pickle
filepath = '../data/interim/'
pickle.dump(lm, open(filepath + 'lm_base.pkl', 'wb'))
pickle.dump(X_test, open(filepath + 'X_test_base.pkl', 'wb'))
pickle.dump(y_test, open(filepath + 'y_test_base.pkl', 'wb'))
pickle.dump(X_train, open(filepath + 'X_train_base.pkl', 'wb'))
pickle.dump(y_train, open(filepath + 'y_train_base.pkl', 'wb'))
pickle.dump(X_features, open(filepath + 'X_features_base.pkl', 'wb'))
pickle.dump(X_scaled, open(filepath + 'X_scaled.pkl', 'wb'))
pickle.dump(y, open(filepath + 'y.pkl', 'wb'))