# Data cleaning and preparation
We have splitted our data into train and test, but did not make any other modifications. To make our data fit for machine learning, we need to:
* Handle missing, corrupt or incorrect data
* Do feature normalization

Let's start by looking how much missing data we have:

In [23]:
import pandas as pd
penguins_train = pd.read_csv('data/penguins_train.csv')

In [24]:
penguins_train.isna().sum()

species              0
island               0
bill_length_mm       2
bill_depth_mm        2
flipper_length_mm    2
body_mass_g          2
sex                  9
dtype: int64

We need to make a choice here what to do with missing values. We choose to throw out any records that have missing values for the numerical features.

In [25]:
numerical_features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
penguins_train_nona = penguins_train.dropna(subset=numerical_features)

In [26]:
penguins_train_nona.to_csv('data/penguins_train_nona.csv', index=False)

It could also be that we have corrupt data, leading e.g. to outliers in the data set. The pair plot in the previous episode could have hinted to this. For this dataset, we don't need to do anything about outliers.

## Feature normalization
As we saw in the pairplot, the magnitudes of the different features are not directly comparable with each other. Some of the features are in mm, others in grams, and those in mm denote different properties.

Most Machine Learning algorithms regard all features together in one multi-dimensional space. To do calculations in this space that make sense, the features should be comparable to each other, e.g. they should be scaled. There are two options for scaling:
- Normalization (Min-Max scaling)
- Standardization (scale by mean and variance)
In this case, we choose min_max scaling, because we do not know much about the distribution of our features.

In [27]:
import sklearn.preprocessing
min_max_scaler = sklearn.preprocessing.MinMaxScaler()

In [28]:
penguins_train_scaled = penguins_train_nona.copy()
penguins_train_scaled[numerical_features] = min_max_scaler.fit_transform(penguins_train_scaled[numerical_features])

### Exercise
Compare the distributions of the numerical features before and after scaling. What do you notice?

In [29]:
penguins_train_nona.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,273.0,273.0,273.0,273.0
mean,44.133333,17.185348,201.098901,4206.501832
std,5.345417,1.98674,14.140879,791.673893
min,33.1,13.1,172.0,2700.0
25%,39.6,15.7,190.0,3550.0
50%,44.9,17.3,197.0,4050.0
75%,48.6,18.7,214.0,4750.0
max,59.6,21.5,231.0,6300.0


In [30]:
penguins_train_scaled.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,273.0,273.0,273.0,273.0
mean,0.416352,0.486351,0.493202,0.418473
std,0.201714,0.236517,0.239676,0.219909
min,0.0,0.0,0.0,0.0
25%,0.245283,0.309524,0.305085,0.236111
50%,0.445283,0.5,0.423729,0.375
75%,0.584906,0.666667,0.711864,0.569444
max,1.0,1.0,1.0,1.0


In [31]:
penguins_train_scaled.to_csv('data/penguins_train_scaled.csv', index=False)