 # <b>1 <span style='color:#F76241'>|</span>Some good ol' fashion data munging</b>
 
<font size="9">A</font>lright!. It's time to start preparing the data for machine learning algorithms. No matter the machine learning system, there is _always_ a need to perform some kind of preprocessing. In our case, for the housing dataset,
we are going to:

- Scale the data so every number lies in the same range
- Adding missing values
- Transform the categorical columns to be numerical

First, lets retrieve the relevant code from the previous notebook

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 

df = pd.read_csv("data/housing.csv")

Now we need to deal with missing values. If you recall, in the **housing** dataset, the **total_bedrooms** columns has missing values. There are several ways to approach this such as:

1. Get rid of the rows with missing values for **total_bedrooms**.
2. Get rid of the whole attribute.
3. Set the missing values to some value (zero, the mean, the median, etc.). This is called imputation.

Option 3 is the most recommended. For this showcase however, I'm going to keep things simple and just drop the records with missing values. 

In [2]:
df = df.dropna()

In [3]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [4]:
print(df_train.shape)
df_train.head()

(16346, 10)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
17727,-121.8,37.32,14.0,4412.0,924.0,2698.0,891.0,4.7027,227600.0,<1H OCEAN
2057,-119.63,36.64,33.0,1036.0,181.0,620.0,174.0,3.4107,110400.0,INLAND
6453,-118.06,34.12,25.0,3891.0,848.0,1848.0,759.0,3.6639,248100.0,INLAND
4619,-118.31,34.07,28.0,2362.0,949.0,2759.0,894.0,2.2364,305600.0,<1H OCEAN
15266,-117.27,33.04,27.0,1839.0,392.0,1302.0,404.0,3.55,214600.0,NEAR OCEAN


Next, we need to extract the **y values** from the training dataframe

In [5]:
X_train = df_train.drop(["median_house_value"],axis=1)
y_train = df_train['median_house_value']

X_test = df_test.drop(["median_house_value"],axis=1)
y_test = df_test["median_house_value"]

In [6]:
X_train.shape

(16346, 9)

In [7]:
y_train.shape

(16346,)

In [8]:
X_test.shape

(4087, 9)

In [9]:
y_test.shape

(4087,)

When it comes to text columns, there's numerous approaches as well. For this example, we're just going to exclude them. But realistically, they should be converted into numericals and used for prediction.

In [10]:
X_train = X_train.drop(columns=["ocean_proximity"])
X_test = X_test.drop(columns=["ocean_proximity"])

The last preprocessing step we'll perform is feature scaling. This is done to make sure that each numerical value is within the same range. Values of varying ranges tend to trip up machine learning algorithms. We will use sklearn's **StandardScaler** class to do this.

It's important to fit the model to the training data, and then **only** transform the test data, as shown below:

In [11]:
from sklearn.preprocessing import StandardScaler 

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)