# Getting our data ready to be used with machine learning

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Data doesn't always come ready to use with a Scikit-Learn machine learning model.

### Three main things we have to do: 
1. Split the data into features and labels (usually X & y) 
2. Converting non-numerical values to numerical values (also called feature encoding)
3. Filling (also called imputing) or disregarding missing values

In [5]:
heart_disease = pd.read_csv("resources/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# 1. Split the data into features and labels (usually X & y)

In [7]:
# Features
X = heart_disease.drop("target", axis=1)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


Nice! Looks like our dataset has 303 samples with 13 features (13 columns).

Let's check out the labels.

In [8]:
# Label
y = heart_disease["target"]
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

303 labels with values of 0 (no heart disease) and 1 (heart disease).

Now let's split our data into training and test sets, we'll use an 70/30 split (70% of samples for training and 30% of samples for testing).

In [9]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.3)

In [10]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((212, 13), (91, 13), (212,), (91,))

In [11]:
X.shape

(303, 13)

In [12]:
len(heart_disease)

303

# 2. Make sure that the data is numerical

Computers love numbers.

So one thing you'll often have to make sure of is that your datasets are in numerical form.

This even goes for datasets which contain non-numerical features that you may want to include in a model.

For example, if we were working with a car sales dataset, how might we turn features such as `Make` and `Colour` into numbers?

Let's figure it out.

First, we'll import the [car-sales-extended.csv](https://github.com/balaji1974/python-and-machinelearning/blob/main/09%20-%20scikit-learn/resources/car-sales-extended.csv) dataset.

In [14]:
car_sales = pd.read_csv("resources/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [15]:
len(car_sales)

1000

We can check the dataset types with .dtypes.

In [16]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

Notice the `Make` and `Colour` features are of dtype=object (they're strings) where as the rest of the columns are of `dtype=int64`.

If we want to use the `Make` and `Colour` features in our model, we'll need to figure out how to turn them into numerical form.

In [17]:
# Split into X/y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

In [18]:
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.2)

Now let's try and build a model on our car_sales data.

In [19]:
# Build a machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

In [20]:
model.fit(X_train, y_train)

ValueError: could not convert string to float: 'Honda'

We get another `ValueError` (some of the data is in string format rather than numerical format).

    ValueError: could not convert string to float: 'Honda'.
    
Machine learning models prefer to work with numbers rather than text.

So we'll have to convert the non-numerical features into numbers first.

The process of turning categorical features into numbers is often referred to as <b>encoding</b>.

Scikit-Learn has a fantastic in-depth guide on [encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).

But let's look at one of the most straightforward ways to turn categorical features into numbers, one-hot encoding.

In machine learning, [one-hot encoding](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) gives a value of `1` to the target value and a value of `0` to the other values.

For example, let's say we had five samples and three car make options: Honda, Toyota, and BMW.

And our samples were:

1. Honda
2. BMW
3. BMW
4. Toyota
5. Toyota

If we were to one-hot encode these, it would look like:

| Sample | Honda | Toyota |	BMW |
|--------|-------|--------|-----|
|  1	 |   1	 |  0	  |   0 |
|  2	 |   0	 |  0	  |   1 |
|  3	 |   0	 |  0	  |   1 |
|  4	 |   0	 |  1	  |   0 |
|  5	 |   0	 |  1	  |   0 |

Notice how there's a 1 for each target value but a 0 for each other value.

We can use the following steps to one-hot encode our dataset:

1. Import `sklearn.preprocessing.OneHotEncoder` to one-hot encode our features and `sklearn.compose.ColumnTransformer` to target the specific columns of our DataFrame to transform.
2. Define the categorical features we'd like to transform.
3. Create an instance of the OneHotEncoder.
4. Create an instance of ColumnTransformer and feed it the transforms we'd like to make.
5. Fit the instance of the ColumnTransformer to our data and transform it with the fit_transform(X) method.

Note: In Scikit-Learn, the term "transformer" is often used to refer to something that transforms data.

In [75]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [76]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

> Note: You might be thinking why we considered Doors as a categorical variable. Which is a good question considering Doors is already numerical. Well, the answer is that Doors could be either numerical or categorical. However, I've decided to go with categorical, since where I'm from, number of doors is often a different category of car. For example, you can shop for 4-door cars or shop for 5-door cars (which always confused me since where's the 5th door?). However, you could experiment with treating this value as numerical or categorical, training a model on each, and then see how each model performs.

Looks like our samples are all numerical, what did our data look like previously?

In [77]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


It seems `OneHotEncoder` and `ColumnTransformer` have turned all of our data samples into numbers.

Let's check out the first transformed sample.

In [78]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [79]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y, 
                                                    test_size=0.2)

model.fit(X_train, y_train)

In [80]:
model.score(X_test, y_test)

0.3235867221569877

Another way we can numerically encode data is directly with pandas.

We can use the `pandas.get_dummies()` (or `pd.get_dummies()` for short) method and then pass it our target columns.

In return, we'll get a one-hot encoded version of our target columns.

Wonderful, now let's use pd.get_dummies() to turn our categorical variables into one-hot encoded variables.

In [81]:
# Another way to do it with pd.dummies...
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]],dtype='int')
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


Nice!

Notice how there's a new column for each categorical option (e.g. `Make_BMW`, `Make_Honda`, etc).

But also notice how it also missed the `Doors` column?

This is because `Doors` is already numeric, so for `pd.get_dummies()` to work on it, we can change it to type `object`.

By default, `pd.get_dummies()` also turns all of the values to bools (`True` or `False`).

We can get the returned values as `0` or `1` by setting `dtype=float`.

We've now turned our data into a fully numeric form using Scikit-Learn and pandas.

Now you might be wondering...

<b>Should you use Scikit-Learn or pandas for turning data into numerical form?</b>

And the answer is either.

But as a rule of thumb:

* If you're performing <b>quick data analysis and running small modelling experiments</b>, use pandas as it's generally quite fast to get up and running.
* If you're performing <b>a larger scale modelling experiment</b> or would like to put your <b>data processing steps into a production pipeline</b>, I'd recommend leaning towards Scikit-Learn, specifically a [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) (chaining together multiple estimator/modelling steps).

Since we've turned our data into numerical form, how about we try and fit our model again?

Let's recreate a train/test split except this time we'll use `transformed_X` instead of `X`.

In [82]:
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(dummies,
                                                    y, 
                                                    test_size=0.2)

model.fit(X_train, y_train)

In [83]:
model.score(X_test, y_test)

0.2920788538619685

# 3. What if there were missing values?

Holes in the data means holes in the patterns your machine learning model can learn.

Many machine learning models don't work well or produce errors when they're used on datasets with missing values.

A missing value can appear as a blank, as a `NaN` or something similar.

There are two main options when dealing with missing values:
1. <b>Fill them with some given or calculated value (imputation)</b> - For example, you might fill missing values of a numerical column with the mean of all the other values. The practice of calculating or figuring out how to fill missing values in a dataset is called imputing. For a great resource on <b>imputing</b> missing values, I'd recommend referring to the [Scikit-Learn user guide](https://scikit-learn.org/stable/modules/impute.html).
2. <b>Remove them</b> - If a row or sample has missing values, you may opt to remove them from your dataset completely. However, this potentially results in using less data to build your model.

> <b>Note</b>: Dealing with missing values differs from problem to problem, meaning there's no 100% best way to fill missing values across datasets and problem types. It will often take careful experimentation and practice to figure out the best way to deal with missing values in your own datasets.

To practice dealing with missing values, let's import a version of the `car_sales` dataset with several missing values (namely [car-sales-extended-missing-data.csv](https://github.com/balaji1974/python-and-machinelearning/blob/main/09%20-%20scikit-learn/resources/car-sales-extended-missing-data.csv)).

In [182]:
# Import car sales missing data
car_sales_missing = pd.read_csv("resources/car-sales-extended-missing-data.csv")
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,,4.0,20306.0
8,,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


Notice the `NaN` value in row 7 for the `Odometer (KM)` column, which means pandas has detected a missing value there.

However, if your dataset is large, likely that you aren't going to go through it sample by sample to find the missing values.

Luckily, pandas has a method called `pd.DataFrame.isna()` which is able to detect missing values.

Let's try it on our DataFrame.

In [86]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Hmm... seems there's about 50 or so missing values per column.

How about we try and split the data into features and labels, then convert the categorical data to numbers, then split the data into training and test and then try and fit a model on it (just like we did before)?

In [87]:
# Create X & y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

Number of missing y values: 50

Now we can convert the categorical columns into one-hot encodings (just as before).

In [88]:
# Let's try and convert our data to numbers
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
# In the current version of Scikit-learn, we will not see an error as missing values are handled internally by the library
transformed_X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4000 stored elements and shape (1000, 16)>

In [89]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [90]:
car_sales_missing["Doors"].value_counts()

Doors
4.0    811
5.0     75
3.0     64
Name: count, dtype: int64

In [91]:
car_sales_missing["Doors"].isna().sum()

50

## 3.1 Option 1: Fill missing data with some other data (imputation) using Pandas

Let's see how we might fill missing values with pandas.

For categorical values, one of the simplest ways is to fill the missing fields with the string `"missing"`.

We could do this for the `Make` and `Colour` features.

As for the `Doors` feature, we could use `"missing"` or we could fill it with the most common option of `4`.

With the `Odometer (KM)` feature, we can use the mean value of all the other values in the column.

And finally, for those samples which are missing a `Price` value, we can remove them (since `Price` is the target value, removing probably causes less harm than imputing, however, you could design an experiment to test this).

In summary:

| Column/Feature |	Fill missing value with         |
|----------------|----------------------------------|
| Make	         | "missing"                        |
| Colour	     | "missing"                        |
| Doors	         | 4 (most common value)            |
| Odometer (KM)	 | mean of Odometer (KM)            |
| Price (target) | NA, remove samples missing Price |

> <b>Note</b>: The practice of filling missing data with given or calculated values is called imputation. And it's important to remember there's no perfect way to fill missing data (unless it's with data that should've actually been there in the first place). The methods we're using are only one of many. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques".

Let's start with the `Make` column.

We can use the pandas method `fillna(value="missing", inplace=True)` to fill all the missing values with the string `"missing"`.

In [93]:
# Fill the "Make" column with the value "missing."
car_sales_missing["Make"] = car_sales_missing["Make"].fillna("missing")

# Fill the "Colour" column with the value "missing."
car_sales_missing["Colour"]=car_sales_missing["Colour"].fillna("missing")

# Fill the "Odometer (KM)" column with the mean of the non-missing values
car_sales_missing["Odometer (KM)"]=car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean())

# Now let's fill the Doors column with 4 (the most common value), this is the same as filling it with the median or mode of the Doors column.
car_sales_missing["Doors"] = car_sales_missing["Doors"].fillna(4)

In [94]:
# Check our dataframe again
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

Finally, we can remove the rows that are missing the target value `Price`.

><b>Note</b>: Another option would be to impute the `Price` value with the mean, median, or some other calculated value (such as by using similar cars to estimate the price); however, to keep things simple and prevent introducing too many fake labels to the data, we'll remove the samples missing a `Price` value.

We can remove rows with missing values in place from a pandas DataFrame with the `pandas.DataFrame.dropna(inplace=True)` method.

In [95]:
# Remove rows with missing Price value
car_sales_missing.dropna(inplace=True)

That should be no more missing values!

In [96]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [97]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

Can we fit a model now?

Let's try!

First we'll create the features and labels.

Then we'll convert categorical variables into numbers via one-hot encoding.

Then we'll split the data into training and test sets just like before.

Finally, we'll try to fit a `RandomForestRegressor()` model to the newly filled data.

In [98]:
# Let's try and convert our data to numbers
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [99]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y, 
                                                    test_size=0.2)

model.fit(X_train, y_train)

In [100]:
model.score(X_test, y_test)

0.9998421058539825

Looks like filling the missing values with pandas worked!

Our model can be fit to the data without issues.

## 3.2 Option 2: Filling missing data and transforming categorical data with Scikit-Learn
Note: This section is different to the video. The video shows filling and transforming the entire dataset (X) and although the techniques are correct, it's best to fill and transform training and test sets separately (as shown in the code below).

The main takeaways:

Split your data first (into train/test)
Fill/transform the training set and test sets separately

In [201]:
car_sales_missing = pd.read_csv("resources/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [203]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

To begin, we'll remove the rows which are missing a Price value.

In [205]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

Now there are no rows missing a Price value.

Since we don't have to fill any Price values, let's split our data into features (X) and labels (y).

We'll also split the data into training and test sets.

In [207]:
# Split into X & y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

In [209]:
# Check missing values
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

> <b>Note</b>: We've split the data into train & test sets here first to perform filling missing values on them separately. This is best practice as the test set is supposed to emulate data the model has never seen before. For categorical variables, it's generally okay to fill values across the whole dataset. However, for numerical variables, you should <b>only fill values on the test set that have been computed from the training set</b>.

Training and test sets created!

Let's now set up a few instances of `SimpleImputer()` to fill various missing values.

We'll use the following strategies and fill values:

For categorical columns `(Make, Colour), strategy="constant", fill_value="missing"` (fill the missing samples with a consistent value of `"missing"`.
For the `Door` column, `strategy="constant", fill_value=4` (fill the missing samples with a consistent value of `4`).
For the numerical column `(Odometer (KM)), strategy="mean"` (fill the missing samples with the mean of the  target column).
* Note: There are more `strategies` and fill options in the `SimpleImputer()` documentation.

In [211]:
# Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

Imputers created!

Now let's define which columns we'd like to impute on.

Why?

Because we'll need these shortly (I'll explain in the next text cell).

In [213]:
# Define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

Columns defined!

Now how might we transform our columns?

Hint: we can use the `sklearn.compose.ColumnTransformer` class from Scikit-Learn, in a similar way to what we did before to get our data to all numeric values.

That's the reason we defined the columns we'd like to transform.

So we can use the `ColumnTransformer()` class.

`ColumnTransformer()` takes as input a list of tuples in the form `(name_of_transform, transformer_to_use, columns_to_transform)` specifying which columns to transform and how to transform them.

For example:

imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, categorical_features)
])

In this case, the variables in the tuple are:

* name_of_transform = "cat_imputer"
* transformer_to_use = cat_imputer (the instance of SimpleImputer() we defined above)
* columns_to_transform = categorical_features (the list of categorical features we defined above).

Let's expand upon this by extending the example.

In [220]:
# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

The next step is to fit our `ColumnTransformer()` instance `(imputer)` to the training data and transform the testing data.

In other words we want to:

1. Learn the imputation values from the training set.
2. Fill the missing values in the training set with the values learned in 1.
3. Fill the missing values in the testing set with the values learned in 1.
Why this way?

In our case, we're not calculating many variables (except the mean of the `Odometer (KM)` column), however, remember that the test set should always remain as unseen data.

So <b>when filling values in the test set, they should only be with values calculated or imputed from the training sets</b>.

We can achieve steps 1 & 2 simultaneously with the `ColumnTransformer.fit_transform()` method (fit = find the values to fill, transform = fill them).

And then we can perform step 3 with the `ColumnTransformer.transform()` method (we only want to transform the test set, not learn different values to fill).

In [224]:
# Fill train and test values separately
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.fit_transform(X_test)

# Check filled X_train
filled_X_train

array([['Honda', 'White', 4.0, 71934.0],
       ['Toyota', 'Red', 4.0, 162665.0],
       ['Honda', 'White', 4.0, 42844.0],
       ...,
       ['Toyota', 'White', 4.0, 196225.0],
       ['Honda', 'Blue', 4.0, 133117.0],
       ['Honda', 'missing', 4.0, 150582.0]], dtype=object)

Let's now turn our filled_X_train and filled_X_test arrays into DataFrames to inspect their missing values.

In [232]:
# Get our transformed data arrays back into DataFrames
car_sales_filled_train = pd.DataFrame(filled_X_train, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test, 
                                     columns=["Make", "Colour", "Doors", "Odometer (KM)"])

And is there any missing data in the test/train set?

In [236]:
# Check missing data in training set
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [238]:
# Check missing data in test set
car_sales_filled_test.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

What about the original?

In [241]:
# Check to see the original... still missing values
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

No more missing values!

But wait...

Is our data all numerical?

In [248]:
car_sales_filled_train.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,71934.0
1,Toyota,Red,4.0,162665.0
2,Honda,White,4.0,42844.0
3,Honda,White,4.0,195829.0
4,Honda,Blue,4.0,219217.0


Looks like our `Make` and `Colour` columns are still strings.

Let's one-hot encode them along with the `Doors` column to make sure they're numerical, just as we did previously.

In [252]:
# Now let's one hot encode the features with the same code as before 
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(car_sales_filled_train)
transformed_X_test = transformer.fit_transform(car_sales_filled_test)

# Check transformed and filled X_train
transformed_X_train.toarray()

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 7.19340e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.62665e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.28440e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.96225e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.33117e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.50582e+05]])

Now our data is:

1. All numerical  
2. No missing values  

Let's try to fit a model!  

In [111]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# Make sure to use transformed (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.25366332156443805

In [112]:
# Check length of transformed data (filled and one-hot encoded)
# vs. length of original data
len(transformed_X_train.toarray())+len(transformed_X_test.toarray()), len(car_sales)

(950, 1000)

You might have noticed this result is slightly different to before.

Why do you think this is?

It's because we've created our training and testing sets differently.

We split the data into training and test sets before filling the missing values.

Previously, we did the reverse, filled missing values before splitting the data into training and test sets.

Doing this can lead to information from the training set leaking into the testing set.

Remember, one of the most important concepts in machine learning is making sure your model doesn't see any testing data before evaluation.

We'll keep practicing but for now, some of the main takeaways are:

* Keep your training and test sets separate.
* Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
* For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as `feature engineering` or `feature encoding`.
* Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as `data imputation`.