# Getting Data Ready

Three main things are:

1. Split the data into features and labels (usually `X` and `y`)
2. Converting non-numerical values to numerical values (also called feature encoding)
3. Filling (also called imputing) or disregarding missing values

- [Splitting the data](#splitting-the-data)
- [Making sure the data is all numerical](#making-sure-the-data-is-all-numerical)
  - [One-Hot Encoding](#one-hot-encoding)
  - [Column Transformer](#column-transformer)
- [Filling missing data](#filling-missing-data)
  - [Filling missing data with pandas](#fill-missing-data-with-pandas)
  - [Filling missing data with Scikit-learn](#filling-missing-data-with-scikit-learn)
    - [Steps](#steps)


In [1]:
# Importing packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Splitting the Data

Splitting data is a crucial step in any scikit-learn workflow because it helps avoid biased evaluation of a machine learning model. Imagine training a model on all of the available data, then using that same data to measure its performance. The model might appear to do very well on this task, but that's because it's basically memorizing the data it was trained on.

This can lead to **overfitting**, where the model performs well on the training data but struggles with new, unseen data. Splitting the data into separate training and test sets ensures the model is evaluated on data it hasn't seen before. This provides a more realistic assessment of the model's ability to generalize to new situations. By training the model on the training set and then testing its performance on the test set, you get a more accurate picture of how well it will perform in real-world scenarios

The process of splitting the data of a dataset consists of a few steps:


1. Load the dataset. This is exactly as previously done, read the dataset with the appropriate function from pandas.


In [2]:
# Importing heart disease DF
heart_disease = pd.read_csv('../datasets/heart-disease.csv')

heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


2. Separate the target from the data. In this step the data is split into 2 parts, the bulk of the data and the targeted value to be predicted (**_X_** and **_y_**)


In [3]:
# Separating target from the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

3. Split the data and target into training and test sets. The next step is to finalize the splitting of the data, the training data and the test data. This split is done only with a smaller part of the entirety of the data, usually around $20$%


In [4]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [5]:
# Checking the sizes of each set of data
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

## Making sure the data is all numerical

Scikit-learn algorithms are designed to work with numerical data. This is because they rely on mathematical operations like addition, subtraction, multiplication, and distance calculations to analyze features and make predictions. Text data, categorical data (like colors or sizes), and dates all require special handling before feeding them into scikit-learn models.

If you leave your data into these non-numerical formats, the algorithms won't be able to understand the relationships between features an the target variable. This can lead to nonsensical results or errors during training.


In [6]:
# Importing dataset with non-numerical data
car_sales = pd.read_csv('../datasets/car-sales-extended.csv')
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [7]:
# Split into X/y
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

To do this, scikit learn offers two tools.

### One-Hot Encoding

The first one called `OneHotEncoder` object that uses **_One-Hot Encoding_**. One Hot Encoding is a technique used in machine learning to convert categorical variables, like "shirt size" (small, medium, large) or "fruit" (apple, banana, orange), into a numerical format that scikit-learn models can understand. It does this by creating a new binary feature for each unique category within the original variable.

Each new feature is assigned a value of 1 if the data point belongs to that category and 0 otherwise. This essentially creates a new "one-hot" vector for each data point, where only one element is active at a time. This allows scikit-learn models to treat these categorical variables like numerical features and learn the relationships between them and the target variable.

### Column Transformer

The second would be a `Column Transformer`. This is a powerful tool, that is a part of scikit-learn, for handling mixed data types in a machine learning workflow. Imagine a dataset containing numerical values like age and income, alongside categorical data like job title or city. Traditional methods might require separate transformations for each data type.

The ColumnTransformer simplifies this process by allowing you to define different transformers for different data subsets. You can specify transformations for numerical columns (like scaling) and categorical columns (like one-hot encoding) within a single transformer object. This streamlines your workflow, keeps your code organized, and ensures each data type is transformed appropriately before feeding it into your machine learning model

A `Column Transformer` takes a few parameters, but the most important one would be the transformers. This parameters takes a list of tuples in the following syntax `(name, transformer, columns)`. `name` is a user-defined name for the transformer, it helps identify the purpose of the transformation applied to that specific set of columns. `transformer` is the actual scikit-learn transformer to be applied. such as a `StandardScaler`, numerical scaling, `OneHotEncoder`, or custom defined transformers. `columns` specifies the columns in the data that the transformer should be applied to, this can be a list of column names or their corresponding indexes.


In [8]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Selecting categories by name
categorical_features = ['Make', 'Colour', 'Doors']

# Creating one-hot-encoder
one_hot = OneHotEncoder()

# Creating the transformer
transformer = ColumnTransformer(
    [('one_hot', one_hot, categorical_features)],
    remainder='passthrough',
)

# Fitting the data into the transformer
transformed_X = transformer.fit_transform(X)

In [9]:
# Fitting the model
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.3235867221569877

## Filling missing data

There are two main ways of dealing with missing data

1. Fill them with some value (imputation)
2. Remove the samples with missing data altogether

Filling missing data is crucial in a scikit-learn workflow because many machine learning algorithms simply cannot handle data with missing values. These missing entries, often represented by _NaN_ (Not a Number), can disrupt the calculations and logic these algorithms rely on. Leaving them unaddressed can lead to errors during training or nonsensical results.

Filling, or imputing, the missing data allows you to utilize the full potential of your dataset and train more robust models. By replacing missing values with appropriate estimates (like mean, median, or mode), you ensure the algorithms can process the data effectively and learn the underlying relationships between features,. This can significantly improve the accuracy and generalizability of a machine learning model.


### Fill missing Data with pandas

The aforementioned techniques can be implemented through pandas before fitting into scikit-learns models.


In [10]:
# Importing dataset with missing data
car_sales_missing = pd.read_csv(
    '../datasets/car-sales-extended-missing-data.csv')
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [11]:
# Finding how many missing values there are
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

To fill the missing data use the `fillna` method on pandas DFs

On the other hand, use `dropna` to drop any values that are composed of `Na` values.


In [12]:
# Filling the columns with values
car_sales_missing.fillna({'Make': 'missing'}, inplace=True)

car_sales_missing.fillna({'Colour': 'missing'}, inplace=True)

car_sales_missing.fillna(
    {'Odometer (KM)': car_sales_missing["Odometer (KM)"].mean()}, inplace=True)

car_sales_missing.fillna({'Doors': 4}, inplace=True)

# Removing the rows without a target
car_sales_missing.dropna(inplace=True)

# Checking how many missing values there are
car_sales_missing.isna().sum()


Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [13]:
# Splitting the data
X_missing = car_sales_missing.drop('Price', axis=1)
y_missing = car_sales_missing['Price']

In [14]:
# Converting missing values to numbers

# Setting the categories
categorical_features = ['Make', 'Colour', 'Doors']

# Creating an instance of a one-hot encoder
one_hot = OneHotEncoder()

# Creating transformer
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)],
                                remainder='passthrough',
                                sparse_threshold=0)

# Fitting the transformer into the data and showing it
transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [17]:
# Split data into training and test sets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                    y_missing,
                                                    test_size=0.2)

# Fit and score a model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.22011714008302485

### Filling missing data with Scikit-Learn

Very much like pandas, Scikit-learn offers tools for preprocessing data


In [18]:
# Re-instantiating dataset
car_sales_missing = pd.read_csv(
    '../datasets/car-sales-extended-missing-data.csv')
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [19]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [20]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=['Price'], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [21]:
# Split into X and y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train

Unnamed: 0,Make,Colour,Odometer (KM),Doors
986,Honda,White,71934.0,4.0
297,Toyota,Red,162665.0,4.0
566,Honda,White,42844.0,4.0
282,Honda,White,195829.0,4.0
109,Honda,Blue,219217.0,4.0
...,...,...,...,...
106,Toyota,,218803.0,4.0
277,BMW,Blue,245427.0,5.0
904,Toyota,White,196225.0,4.0
450,Honda,Blue,133117.0,


#### Steps

1. The first step to doing preprocessing in Scikit-learn is to create transformers for the intended data.

If the entirety of the data should be transformed in one single way, simply create one transformer, otherwise, each type of change should be make by its own transformer. A simple transformer can be created with the `SimpleImputer` class. This class receives a strategy, referring to the strategy of imputing to be used. The strategies can be:

- `mean`, that replaces the values using the mean along each column, only with numeric data
- `median`, that replaces the missing values using the median along each column, also only with numerical data
- `most_frequent`, that replaces the values with the most frequent one along each column, being able to be used with bot numerical and string values.
- `constant`, that replaces missing values with the value inserted in the `fill_value` parameter. This can also be used with both numerical and string values.

2. The second step would be to define which columns are to be changed by the transformer.

3. The third step is to use `ColumnTransformer` to apply the transformers to each column.

4. Fit the data from the imputer/transformer into the desired output.


In [22]:
# Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# Define columns
cat_features = ['Make', 'Colour']
door_feature = ['Doors']
num_feature = ['Odometer (KM)']

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ('cat_imputer', cat_imputer, cat_features),
    ('door_imputer', door_imputer, door_feature),
    ('num_features', num_imputer, num_feature),
])

filled_X = imputer.fit_transform(X)
filled_X


array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [23]:
# Converting the transformed data into a DF
car_sales_filled = pd.DataFrame(
    filled_X, columns=['Make', 'Colour', 'Doors', 'Odometer'])
car_sales_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0


In [24]:
# Finding the amount of Na data in the DF
car_sales_filled.isna().sum()

Make        0
Colour      0
Doors       0
Odometer    0
dtype: int64

After dealing with all of the `Na` data, the next step would be to convert the non-numerical data into numerical data. This is done following [this](#column-transformer) step.


In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder

categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)],
                                remainder='passthrough')

transformed_X = transformer.fit_transform(car_sales_filled)

And finally model the data


In [26]:
# Fitting data into model
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.21990196728583944