# Python beginners course - Level 2 - Scikit (Machine Learning)
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.

In machine learning, we have a set of features and a target and we are aiming at finding the relationship between the features and the target. In our example, the features are the properties of the house (number of rooms/size/...) and some demographic information (neighborhood information), and our target is the house price. We are thus trying to figure out what the relation is between the properties of the house, and the price.

Generally, machine learning consists of the following steps
1. Gathering data
2. Preparing that data (imputing missing values/aggregating/feature extraction/test-train split)
3. Choosing a model
4. Training (fit your model on the training data)
5. Evaluation (see how well the model fits the test data)
6. Hyperparameter tuning
7. Prediction (predict on new data)

Each (except 1, 6, 7) of these steps will be elaborated on in more detail below. 


[Source](https://en.wikipedia.org)

In [None]:
data = pd.read_csv('../data/boston_dataset.csv')
data.head(3)

### 2. Preparing the data

Data preparation is the process of transforming raw data so that data scientists and analysts can run it through machine learning algorithms to uncover insights or make predictions.

The data preparation process can be complicated by issues such as:

1. **Missing or incomplete records.** It is difficult to get every data point for every record in a dataset. Missing data sometimes appear as empty cells or a particular character, such as a question mark. 
2. **Improperly formatted data.** Data sometimes needs to be extracted into a different format or location. A good way to address this is to consult domain experts or join data from other sources.
3. **The need for techniques such as feature engineering.** Even if all of the relevant data is available, the data preparation process may require techniques such as feature engineering to generate additional content that will result in more accurate, relevant models. For example, you might want to extract the 'month of year', 'week of month' or 'day of month' from a date.
4. **Splitting the data.** In order to assess the performance of a model, we need to have a dataset. Therefore, the dataset needs to be split into a training and test set. The model will be trained on the training set, and then the performance will be assessed on the test set.

In our dataset, the first three steps have already been performed. It thus only remains for us to do the train-test split.

Remember from earlier that we can do slicing on our dataframes:

In [None]:
first_5_rows = data[0:5]
first_5_rows.head()

In [None]:
data.shape

**Exercise:** 
1. Split the data (506 rows) into two parts: `data_train` and `data_test`. `data_train` contains the first 400 rows, and `data_test` the remaining 106 rows. 
2. Print the size of each of them.

In [None]:
data_train = data.iloc[___]
data_test  = data.iloc[___]
    
print('The size of data_train = ', data_train.___)
print('The size of data_test  = ', data_test.___)

Now that we split the rows, we also want to split the features from the target that we are trying to predict. We can do this very easily with the datasets we just created:

In [23]:
X_train = data_train[['LSTAT', 'RM']]
y_train = data_train['MEDV']

**Exercise:**
1. Do the same, but now for the test set (split features and target).

In [None]:
X_... = data_...[[...]]
y_... = data_...[...]

We have now manually split the data in a training set and a test set, and split the target from the features. However, we as programmers are lazy and do not want to do this manually... So let's use a library that can do this for us! (but does exactly the same as we just did!)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['LSTAT', 'RM']], data['MEDV'], test_size=0.20, random_state=42)

print(f'The training set contains {X_train.shape[0]} samples')
print(f'The training set contains {X_test.shape[0]} samples')

### 3. Choosing a model
In our dataset, we have a set of features and we are trying to predict the house price (our target). This is a classic example of a regression problem, where we are trying to find the relation between features and a target. 

To illustrate regression, imagine we only have 1 feature and our target. Visualizing our data, it could look something like the blue dots in the figure below. The challenge would then be to find a function (red line) that fits our data best, which in this case is a linear line (Linear Regression).

<img src="../assets/linear-regression.png" alt="drawing" width="400"/>

Choosing the model that does this best is not trivial. Usually, many different types of models are trained. For each of them, it is then checked how 'good' they fit the data and the one with the best performance is used. However, since the process of training and checking the performance is often very costly, data scientists try to find information in the dataset that hints at using a specific type of model.

### 4. Training
For our problem, we will be training two different types of models:
- Linear regression model
- Random forest model (used for example in the Quick-Pay algorithm)

The market standard library for making these models in Python is `sklearn`. The main reason for this is (as we will see) that it is **very** easy to make and train machine learning models using this library.

In [None]:
# Import the model we want to use from sklearn
from sklearn.linear_model import LinearRegression

# Create an estimator (not officially a model yet, since we haven't trained it yet)
linear_regression_estimator = LinearRegression()

# Fit the estimator, now we have a model!
linear_regression_model = linear_regression_estimator.fit(X_train, y_train)

That's it! We now have a linear regression model with which we can predict our target.

Let's do the same, but now for a Random Forest model.

**Exercise:**
1. Create a Random Forrest estimator
2. Fit the estimator to create the model

In [None]:
# Import the Random Forrest model from sklearn
from sklearn.ensemble import RandomForestRegressor

# Create an estimator
random_forrest_estimator = ___

# Fit the estimator to create a model
random_forrest_model = ___

### 5. Evaluation
Now that we have our trained machine learning models, we want to know which one of them has the best performance. In order to be able to assess the performance of a model, we need to define how we are going to measure performance. The metrics we will try out are:
- Mean Absolute Error (MAE)
- Mean Squared  Error (MSE)

Let's imagine again that we have 1 feature and our target, and that we have fit a linear regression model to our data, as shown in the figure below. We then can calculate our metrics as follows:
- MAE: calculate the distance from each point to our model (red line), add all of these distances, and divide by the number of points we have (in this case 7). [More info](https://en.wikipedia.org/wiki/Mean_absolute_error).
- MSE: calculate the distance from each point to our model (red line), square these values and add all of them, and divide by the number of points we have (in this case 7). [More info](https://en.wikipedia.org/wiki/Mean_squared_error).

<img src="../assets/MSE.png" alt="drawing" width="300"/>

Many more metrics exist. The choice often depends on the type of problem that is being solved. For example, if you do not want to have large errors, the MSE metric is preferred.

**Challenge:** Why is the MSE metric preferred for problems where large errors are not wanted?

Let's now calculate the MAE and MSE metrics for the linear regression model. In the code below, we first import both the functions which do this for us; `mean_absolute_error` and `mean_squared_error` respectively.

Then, the predictions of the target `MEDV` are calculated on the test set that we defined. Using the real values and the predicted values, the metrics can then be calculated and printed.

In [None]:
# Import the metrics that we want to use
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Calculate our predictions for the linear regression model
linear_regression_predictions = linear_regression_model.predict(X_test)

# Calculate the metrics for the linear regression predictions
linear_regression_MAE = mean_absolute_error(y_test, linear_regression_predictions)
linear_regression_MSE = mean_squared_error(y_test, linear_regression_predictions)

# Print the values of the metrics
print('The MAE for the linear regression model = ', linear_regression_MAE)
print('The MSE for the linear regression model = ', linear_regression_MSE)

**Exercise:** 
1. Calculate the predictions for the Random Forrest model
2. Calculate the metrics for the Random Forrest predictions

In [None]:
# Calculate our predictions for the Random Forrest model
random_forrest_predictions = random_forrest_model.___

# Calculate the metrics for the linear regression predictions
random_forrest_MAE = ___
random_forrest_MSE = ___

# Print the values of the metrics
print('The MAE for the linear regression model = ', ___)
print('The MSE for the linear regression model = ', ___)

**Exercise:** Which of the two models performs best? Why?

Ofcouse, for mathematicians it is nice to see the performance of a model expressed as a number. However, it is much more intuitive to visualize the model.

**Exercise:** Add the Random Forrest predictions to the plot.

In [None]:
# Set the figure size
fig, ax = plt.subplots(figsize=[16,6])

# Plot the original data
plt.plot(y_test.values, 'k', label='Original data')

# Plot the linear regression predictions
plt.plot(linear_regression_predictions, label='Linear Regression')

# Plot the random forrest predictions
plt.plot(___)

# Show a legend in the plot
plt.legend()
plt.show()

### 6. Hyperparameter tuning

In this course, we trained a Linear regression and Random Forrest model. All machine learning models have a lot of settings (called hyperparameters) that can be changed, in order to improve the performance. Since this is a quite advanced topic, we chose not to include this topic in the course. However, there is plenty of documentation online that nicely describe the concept of hyperparameter tuning. The interested reader is referred to external sources such as:

[Hyperparameter explanation](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) <br>
[Hyperparameter optimization](https://medium.com/criteo-labs/hyper-parameter-optimization-algorithms-2fe447525903)

**Challenge:** Change the hyperparameters of the Random Forrest model (and retrain the model) such that the performance of the MSE metric improves.

In [None]:
# Create an estimator
random_forrest_estimator = RandomForestRegressor(...)

# Fit the estimator to create a model
random_forrest_model = ...

### 7. Prediction

Now that we have a trained model, we have a mapping between our features and the target (house price). Once a new house comes on the marget now, we do not need any external information to determine the price. We can simply determine the features (number of rooms/total area/...) and predict the price using our model.