# 02 - End-to-End Machine Learning Project

The goal of this exercise is to develop an understanding of the flow of a machine learning project from begining to end. In addition, this exercise is intended to provide an introduction to the most frequently used methods, when solving a ML-task. Since these projects are almost always structured the same way, we will go through it step by step.

<div class="alert alert-block alert-info"> This exercise requires basic skills in Python and working with Jupyter Notebooks. If you are not familiar with the two topics, take a look at the Python introduction notebook, which you can find in the Github-Repository of this course.<br><br>
To solve the following exercises it's also recommended to read the chapter 2 of the book in advance.</div>

**Task**: In this exercise, we want to predict the height of a person based on weight and gender. For this we need to load the dataset first. 

In [None]:
# Run this cell two import the following modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_theme(style="whitegrid")

<h2 style="color:blue" align="left"> 1. Get the Data </h2>

The [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function loads the data from the CSV file into a pandas DataFrame. With [`DataFrame.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) the first five rows of the DataFrame are output.

In [None]:
df = pd.read_csv('dataset/weight-height.csv')
df.head()

As we can see, the dataset consists of 3 features. One row describes one person. The feature _Gender_ defines the sex of the respective person, the feature _Height_ defines the height measured in inches and the feature _Weight_ defines the weight measured in pounds. 

Before examining the data, we convert it to metric system so that the data will be more familiar to us.

### Converting to Metric System

In order not to overwrite the original data, the data set is copied to the variable `dataset`.

In [None]:
dataset = df.copy()

Now the first column to be converted is _Height_. To convert the data from inches to centimeters, the column must be multiplied by a factor of $2.54$.

In [None]:
conv_factor_height = 2.54
dataset['Height'] = round(df['Height'] * conv_factor_height, 1)
dataset.head()

<div class="alert alert-block alert-success"><b>Task</b><br> 
    Convert the column <i>Height</i> from pounds to kilograms and save it in the <i>Height</i> column of the Dataframe dataset. Have a look on the exampe above.
</div>

In [None]:
# Write your code here
conv_factor_weight = 0.4535
dataset['Weight'] = round(dataset['Weight'] * conv_factor_weight, 1)

In [None]:
dataset.head()

<h2 style="color:blue" align="left"> 2. Discover and Visualize the Data to Gain Insights </h2>

### Overview

To get a quick overview of the data, pandas offers many built-in functions.

The property [`DataFrame.shape`](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.shape.html) returns a tuple with the number of rows and columns of the DataFrame.

In [None]:
dataset.shape

The data set consists of a total of 10,000 datapoints.

The[`DataFrame.info()`]() method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
dataset.info()

The property [`DataFrame.dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html)returns a Series with the data type of each column.

In [None]:
dataset.dtypes

To see how many of the people in the dataset are male or female pandas provides the function [`DataFrame.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). This method returns a Series containing counts of unique values.

In [None]:
dataset['Gender'].value_counts()

We can see that there are as many women as men in the data set. This is very good, because otherwise the model would have a bias for one of the two genders.

### Barplot

This can also be solved graphically.

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use the <a href=https://seaborn.pydata.org/generated/seaborn.countplot.html>countplot()</a> method of the module seaborn to plot the count of Males and Females as barplot.
</div>

In [None]:
# Write your code here
sns.countplot(x=dataset["Gender"])

### Boxplot

Now that we have examined the number and distribution of the data, we can look at the measurements itself. Therefor pandas has a very useful built-in function, called [`DataFrame.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). This generates some descriptive statistics of the data.

In [None]:
dataset.describe()

We see that the average height is 168 cm and the average weight is 73 kg. We can also see the min and max values for each measurment. This allows you to check the plausibility of the data at a glance. In addition we have the quantiles, which give you an overview over the disribution of the data.

All these statistics can be also seen in a boxplot. In the following you see the boxplot for the the Feature _Height_.

In [None]:
plt.figure(figsize=(7,6))
plt.title('Boxplot Height', fontsize=14);
sns.boxplot(x='Gender', y='Height', data=dataset);

<div class="alert alert-block alert-success"><b>Task</b><br> 
    Do the boxplot for the feature <i>Weight</i>.
</div>

In [None]:
# Write Your Code Here
plt.figure(figsize=(7,6))
plt.title('Boxplot Weight', fontsize=14);
sns.boxplot(x='Gender', y='Weight', data=dataset);

### Histogram
Another good way to show the distribution of the data is to create the histograms.

<div class="alert alert-block alert-success"><b>Task</b><br> 
Create the histogramms for the features weight and height in the dataset. The function used to create this kind of plot can be found in the second chapter of the book written by Géron.
</div>

In [None]:
# Write Your Code Here
sns.displot(dataset, binwidth=5, height=6)

### Correlation Matrix

We used the last few cells to evaluate each feature individually. Now we can have a look at the correlations between the two features. For this purpose pandas provides the function [`DataFrame.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html). It computes the pairwise correlation of the columns.

In [None]:
dataset.corr()

As we can see, the correlation of weight and height is very good (>0.9). Because we have only two features, the matrix is very simple and easy to read. The more features the data set has, the larger the correlation matrix will be. Then it is useful to have a graphical representation of this matrix. 

<div class="alert alert-block alert-success"><b>Task</b><br> 
Visualize the correlation matrix using the function <a href=https://seaborn.pydata.org/generated/seaborn.heatmap.html>heatmap()</a> of seaborn. 
</div>

In [None]:
# Write Your Code Here
sns.heatmap(dataset.corr(), vmin=-1, vmax=1, annot=True)

### Scatter plot

We can see in the correlation matrix, that there is a high positive correlation between the weight and the height of a person. To proof the relationship between this features we can plot each datapoint and see how they are distributed.

<div class="alert alert-block alert-success"><b>Task</b><br> 
Create a scatter plot of the dataset. Plot the Height on the x-axis and the Width on the y-axis.
</div>

In [None]:
# Write Your Code Here
sns.scatterplot(data=dataset, x='Height', y='Weight')

We can see in the plot, that there is linear relationship between the weight and the height. This knowlegde can be used later.

It is also interesting to color the data points which belong to woman and men differently. Therefor the dataset is split up by genders.

<div class="alert alert-block alert-success"><b>Task</b><br> 
Create a scatter plot of the dataset, where the datapoints are colored by their gender. Plot the Height on the x-axis and the Width on the y-axis.
</div>

In [None]:
males=dataset[dataset['Gender']=='Male']
females=dataset[dataset['Gender']=='Female']
fig,ax = plt.subplots()
# Write Your Code Here
sns.scatterplot(data=dataset, x='Height', y='Weight', hue='Gender')

As we can see, the data points for men and women do not follow the same course. If we try to draw a straight line through the data for men and women, we find that the straight line for men is a bit steeper. This means, that it's important to train your model not only on the weights, but use the gender as feature in addition. 

<h2 style="color:blue" align="left"> 3. Prepare the Data for Machine Learning Algorithms </h2>

### Converting Categorical Variables to Numeric using OneHotEncoder

The Gender column is a text based feature. But most of the ML-algorithms can not deal with these kinds of data types. Therefor we have to transform into a interpretable features. Like in chapter 2 of the book, we use the `OneHotEncoder` of the module scikit-learn. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

<div class="alert alert-block alert-success"><b>Task</b><br> 
Transform the Feature <i>Gender</i> into two new binary Features <i>Male</i> and <i>Female</i> and save it in the variable dataset_with_gender. Do not forget to drop the <i>Gender</i> column.
</div>

In [None]:
ohe = OneHotEncoder(dtype=int)
dataset_with_gender = dataset.copy()
# Write Your Code Here
# create one hot encoding array
genders = ohe.fit_transform(dataset_with_gender[['Gender']])
# create columns for encoded values in dataframe
dataset_with_gender[ohe.categories_[0]] = genders.toarray()
# drop column "Gender"
dataset_with_gender = dataset_with_gender.drop('Gender', axis=1)
dataset_with_gender.head(3)

### Replace Missing Values

In a real life dataset, you will often face missing data in the dataset you are using. In this dataset the data is complete, so we can feel lucky.

In [None]:
dataset_with_gender.isnull().sum()

### Train-Test-Split

In order to evaluate your trained model, we must spilt your dataset in a Train and Test-Set. Therefor we use the function [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from scikit-learn.

In [None]:
from sklearn.model_selection import train_test_split

In general, assigning 20% of the data to the test set is a good choise, but this depends in the number of datapoints. If you have small dataset, you take a smaller portion for the test set.

In [None]:
X = dataset_with_gender.drop('Height',axis=1) # train size
y = dataset_with_gender['Height'] # test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

`train_test_split()` splits the data into the two subsets randomly. This can make some troublesas the number of male and female datapoints can become imbalanced. This would lead to a bias for the respective gender, which is to be avoided.

In [None]:
X_train['Female'].value_counts()

<div class="alert alert-block alert-success"><b>Task</b><br> 
Find a way to split the dataset with with a equal number of male and female datapoints in test and training set. Have a look in the second chapter of the book.
</div>

In [None]:
dataset_with_gender.drop('Height',axis=1)

In [None]:
# stratified shuffle split

from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=2) # group elemenst (male & female)
stratified_X = dataset_with_gender.drop('Height',axis=1)
stratified_y = dataset_with_gender['Height']
groups = dataset['Gender']

for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

### Feature Scaling

Because we use features in different units of measurement, the data needs to be scaled. In this case, we use the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from scikit-learn.

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use the StandardScaler sc to scale the datasets X_train and X_test and save the scaled data in the variables X_train_scaled and X_fit_scaled. It's important to fit the Scaler only ones.
</div>

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = []
X_test_scaled = []
# Write Your Code Here

# fit the scalers
sc.fit(X_train)
sc.fit(X_test)
# transform
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

print('X_train_scaled:\n', X_train_scaled)
print('X_test_scaled:\n', X_test_scaled)

<h2 style="color:blue" align="left"> 4. Select and Train a Model </h2>

### Linear Regression

After we have analyzed and prepared the data, we can now train our model. Since we have already shown a linear relationship between the height and the weight, we will use linear regression in this case. For this we will use the [`LinearRegression()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) of scikit-learn.

In [None]:
from sklearn.linear_model import LinearRegression

<div class="alert alert-block alert-success"><b>Task</b><br> 
Fit your model using the X_train_scaled and y_train. Evaluate the model with the X_test_scaled and y_test by computing the Mean Squared Error in the next cell. You can use the plot below to visualize your model.
</div>

In [None]:
lin_reg = LinearRegression()
# Write Your Code Here
reg = lin_reg.fit(X_train_scaled, y_train)
lin_reg_prediction = lin_reg.predict(X_test_scaled)
score = reg.score(X_test_scaled, y_test)
print('Linear regression prediction size:\t', lin_reg_prediction.size)
print('Linear regression score:\t\t', score)

**Mean Squared Error**

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
# Write Your Code Here
mse = mean_squared_error(y_test, lin_reg_prediction)
print('Mean squared error:\t', mse) # this doesn't look right :D

**Plot Model**

In [None]:
plt.figure(figsize=(7,6))
sns.scatterplot(data=X_train, x='Weight', y=y_train);
plt.plot(X_train['Weight'], lin_reg.predict(X_train_scaled), c='orange');

The red line represents your predicts of your model. As you can see, this is not a perfect line. This is because we do not trained your model on the two-dimensional dataset (Weight, Height), but on a four-dimesional data set (Weight, Height, Male, Femal). So the line of your model is a line thorugth a four-dimensional space. This plot is only the projection on the two-dimesional Height-Weight-plane.