# Data cleaning and preparation
We have splitted our data into train and test, but did not make any other modifications. To make our data fit for machine learning, we need to:
* Handle missing, corrupt or incorrect data
* Do feature normalization

Let's start by looking how much missing data we have:

In [3]:
import pandas as pd
weather_train = pd.read_csv('data/weather_train.csv')

In [4]:
weather_train.isna().sum().sum()

0

We have no missing values so we can continue.

It could also be that we have corrupt data, leading e.g. to outliers in the data set. The pair plot in the previous episode could have hinted to this. For this dataset, we don't need to do anything about outliers.

## Feature normalization
As we saw in the pairplot, the magnitudes of the different features are not directly comparable with each other. Some of the features are in mm, others in degrees celcius, and the scales are different.

Most Machine Learning algorithms regard all features together in one multi-dimensional space. To do calculations in this space that make sense, the features should be comparable to each other, e.g. they should be scaled. There are two options for scaling:
- Normalization (Min-Max scaling)
- Standardization (scale by mean and variance)
In this case, we choose min_max scaling, because we do not know much about the distribution of our features. If we know that (some) features have a normal distrubtion, it makes more sense to do standardization.

In [1]:
import sklearn.preprocessing
min_max_scaler = sklearn.preprocessing.MinMaxScaler()

In [5]:
feature_names = weather_train.columns[1:]

In [6]:
weather_train_scaled = weather_train.copy()
weather_train_scaled[feature_names] = min_max_scaler.fit_transform(weather_train[feature_names])

### Exercise
Compare the distributions of the numerical features before and after scaling. What do you notice?

In [13]:
weather_train.describe()

Unnamed: 0,MONTH,BASEL_cloud_cover,BASEL_humidity,BASEL_pressure,BASEL_global_radiation,BASEL_precipitation,BASEL_sunshine,BASEL_temp_mean,BASEL_temp_min,BASEL_temp_max,...,STOCKHOLM_temp_min,STOCKHOLM_temp_max,TOURS_wind_speed,TOURS_humidity,TOURS_pressure,TOURS_global_radiation,TOURS_precipitation,TOURS_temp_mean,TOURS_temp_min,TOURS_temp_max
count,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,...,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0
mean,6.438642,5.48564,0.754778,1.017806,1.290052,0.263773,4.572193,11.236292,7.29282,15.624543,...,5.272063,11.545039,3.871671,0.799987,1.016433,1.286057,0.225927,12.137859,7.947128,16.329504
std,3.461575,2.346835,0.102952,0.008106,0.942928,0.522791,4.340088,6.850607,6.102385,8.113162,...,7.435019,8.944256,1.581145,0.102116,0.008825,0.910704,0.458577,5.815459,5.230505,6.934869
min,1.0,0.0,0.46,0.9882,0.06,0.0,0.0,-7.6,-13.0,-5.5,...,-19.7,-13.0,0.8,0.48,0.9762,0.05,0.0,-3.5,-6.9,-1.4
25%,3.0,4.0,0.68,1.0132,0.49,0.0,0.4,6.5,3.225,9.4,...,-0.075,4.225,2.8,0.73,1.0114,0.49,0.0,8.0,4.3,11.3
50%,6.0,6.0,0.76,1.0173,1.03,0.0,3.5,11.25,7.45,15.4,...,5.5,11.0,3.6,0.81,1.0167,1.125,0.02,11.85,8.3,15.9
75%,9.0,7.0,0.84,1.0225,1.99,0.3,7.8,16.9,12.0,22.375,...,11.5,18.8,4.9,0.88,1.0218,1.9475,0.22,16.875,11.875,21.4
max,12.0,8.0,0.98,1.0406,3.52,4.11,15.3,26.8,18.7,34.5,...,20.4,31.8,9.6,0.98,1.0383,3.56,3.8,26.2,21.2,34.4


In [12]:
weather_train_scaled.describe()

Unnamed: 0,MONTH,BASEL_cloud_cover,BASEL_humidity,BASEL_pressure,BASEL_global_radiation,BASEL_precipitation,BASEL_sunshine,BASEL_temp_mean,BASEL_temp_min,BASEL_temp_max,...,STOCKHOLM_temp_min,STOCKHOLM_temp_max,TOURS_wind_speed,TOURS_humidity,TOURS_pressure,TOURS_global_radiation,TOURS_precipitation,TOURS_temp_mean,TOURS_temp_min,TOURS_temp_max
count,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,...,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0,766.0
mean,6.438642,0.685705,0.566881,0.565003,0.355506,0.064178,0.298836,0.547567,0.640152,0.528114,...,0.622745,0.54788,0.349054,0.639974,0.647869,0.352153,0.059454,0.526527,0.528368,0.495238
std,3.461575,0.293354,0.197984,0.154692,0.272523,0.1272,0.283666,0.199146,0.192504,0.202829,...,0.185412,0.199649,0.179676,0.204231,0.142106,0.25946,0.120678,0.195807,0.186139,0.193711
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,0.5,0.423077,0.477099,0.124277,0.0,0.026144,0.409884,0.51183,0.3725,...,0.489401,0.384487,0.227273,0.5,0.566828,0.125356,0.0,0.387205,0.398577,0.354749
50%,6.0,0.75,0.576923,0.555344,0.280347,0.0,0.228758,0.547965,0.64511,0.5225,...,0.628429,0.535714,0.318182,0.66,0.652174,0.306268,0.005263,0.516835,0.540925,0.48324
75%,9.0,0.875,0.730769,0.65458,0.557803,0.072993,0.509804,0.712209,0.788644,0.696875,...,0.778055,0.709821,0.465909,0.8,0.7343,0.540598,0.057895,0.686027,0.668149,0.636872
max,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We save the data to use in our next notebook

In [7]:
weather_train_scaled.to_csv('data/weather_train_scaled.csv', index=False)