# Data Science Nigeria: Introductory Machine Learning

![](../Images/banner.jpeg)

## INTRODUCTION TO REGRESSION


## Course Overview 

Upon completion of this study unit, you should be able to:

- Have a general understanding of how a regression algorithm works

- List types of regression algorithms 

- Build a linear regression algorithms using SKLearn

- Evaluate regression models performance 


Remember that **Machine Learning Models/ Algorithms** allows computer to automate tasks that would otherwise take manual efforts, time, as well as resources. It learns how to interpret data to provide insight to humans. **Machine Learning performance improves with experience**


## Regression 
Regression is a set of processes used to estimate relationship between variables. 

### Examples of tasks that can be solved using regression

Before a dataset can be train on a regression model, the label must be a continuous variable not discret
* Predicting salary from years of experience
* Determining Glucose level from Age of patients
* Predicting salary from years of experience
* Predicting students grade based on total study time.
* Predicting examination score based on students' test score etc.

### Regression Machine Learning Model using Mama Tee restaurant dataset

The objective of the regression task is to predict the amount of tip (gratuity in Nigeria naira) given to a food server based on total_bill, gender, smoker (whether they smoke in the party or not), day (day of the week for the party), time (time of the day whether for lunch or dinner), and size (size of the party) in Mama Tee restaurant..

**Label**: The label for this problem is tip.
    
**Features**: There are $6$ features and they include total bill, gender, smoker, day, time, and size.

We plan to use the following regression models (regressor) to predict the amount of tips that will be given during a particular party in the restaurant:

- Ordinary Least Square (OLS)

- Support Vector Machine (SVM)

- Extreme Gradient Boosting (XGBoost)

- Decision Tree

- Random Forest

# Import Python modules

We need to import some packages that will enable us to explore the data and build machine learning models

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport # Please install pandas profiling if you don't have it installed already

In [None]:
import seaborn as sns

In [None]:
tip = pd.read_csv("../Data/tips.csv")

tip.head(10)

In [None]:
tip.shape

We can use pandas_profiling to do some data exploration before training our models

In [None]:
tip.profile_report()

## Relationship with categorical variables

## tip vs. gender

In [None]:
sns.boxplot(x = "gender", y = "tip", data = tip)

plt.ylabel("Amount of tip");

The amount of tips given by both gender is almost the same although there was an extreme amount of tip given by some men.

## tip vs. smoker

In [None]:
sns.boxplot(x = "smoker", y = "tip", data = tip)

plt.ylabel("Amount of tip");

Smoker and non smoker gave almost amount of tip.

## tip vs. time

In [None]:
sns.boxplot(x = "time", y = "tip", data = tip)

plt.ylabel("Amount of tip");

Smoker and non smoker gave almost amount of tip.

# Model building

After getting some insight about the data, we can now prepare the data for machine learning modelling

- Importing machine learning models

In [None]:
from sklearn import metrics # For model evaluation

from sklearn.model_selection import train_test_split # To divide the data into training and test set

# Data Preprocessing 
- Separating features and the label from the data

Now is the time to build machine learning models for the task of predicting the amount of tip that would be given for any party in the restuarant. Therefore, we shall separate the set of features (X) from the label (Y).

In [None]:
tip.head(4)

In [None]:
# split data into features and target

X = tip.drop(["tip"], axis= "columns") # droping the label variable (tip) from the data

y = tip["tip"]

In [None]:
X.head()

In [None]:
y.head()

Since the label is continuous, this is a regression task.

- One-hot encoding

As dicussed in Part 3, we need to create a one-hot encoding for all the categorical features in the data because some algorithms cannot work with categorical data directly. They require all input variables and output variables to be numeric. In this case, we will create a one-hot encoding for gender, smoker, day and time by using `pd.get_dummies()`. 

In [None]:
pd.get_dummies(X)

We now save this result of one-hot encoding into X.

In [None]:
X = pd.get_dummies(X)

In [None]:
X.head()

- Split the data into training and test set

We will split our dataset (Features (X) and Label (Y)) into training and test data by using `train_test_split()` function from the sklearn. The training set will be $80\%$ while the test set will be $20\%$. The `random_state` that is set to 1234 is for all of us to have the same set of data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 1234)

We now have the pair of training data `(X_train, y_train)` and test data `(X_test, y_test)`

- Model training

We will use the training data to build the model and then use test data to make prediction and evaluation respectively.

## Linear Regression

Let's train a linear regression model with our training data. We need to import the Linear regression from the sklearn model

In [None]:
# Fitting Linear Regression to the Training set

from sklearn.linear_model import LinearRegression

We now create an object of class `LinearRegression` to train the model on

In [None]:
linearmodel = LinearRegression()

linearmodel.fit(X_train, y_train)

`linearmodel.fit` trained the Linear regression model. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
linearmodel.predict(X_test)

Let's save the prediction result into `linearmodel_prediction`. This is what the model predicted for us.

In [None]:
linearmodel_prediction = linearmodel.predict(X_test)

### Model evaluation

<center><img src="../Images/MSE.png" style="width: 300px; height:300px"/></center>


<center><img src="../Images/RMSE.png" style="width: 500px; height:200px"/></center>

Since the prediction is continous, we can only measure how far the prediction is from the actual values. Let's check the error for each prediction.

In [None]:
y_test - linearmodel_prediction 

The positive ones show that the prediction is higher than the actual values while the negative ones are below the actual values. Let's now measure this error by using the Root Mean Squared Error (RMSE).

In [None]:
MSE = metrics.mean_squared_error(y_test, linearmodel_prediction)

In [None]:
MSE

We now take the square root of the Mean Squared Error to get the value of the RMSE.

In [None]:
np.sqrt(MSE)

Therefore, the RMSE for the linear regression is 142.1316828752442.

## Random Forest Model

Let's train a Random Forest model with our training data. We need to import the model from the sklearn module

In [None]:
from sklearn.ensemble import RandomForestRegressor

randomforestmodel = RandomForestRegressor()

randomforestmodel.fit(X_train, y_train)

`randomforestmodel.fit()` trained the Random Forest model on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
randomforestmodel_prediction = randomforestmodel.predict(X_test)

In [None]:
MSE = metrics.mean_squared_error(y_test, randomforestmodel_prediction)

In [None]:
MSE

We now take the square root of the Mean Squared Error to get the value of the RMSE.

In [None]:
np.sqrt(MSE)

Therefore, the RMSE for the linear regression is 160.3155113080993.

## Extreme Gradient Boost (XGBoost) Model

Let's train an XGBoost model with our training data. We need to import the XGBoost model from the xgboost module.

In [None]:
from xgboost import XGBRegressor # Please install xgboost libarary if you don't have it installed already

xgboostmodel = XGBRegressor(use_label_encoder=False)

xgbboostmodel = xgboostmodel.fit(X_train, y_train)

`xgboostmodel.fit()` trained the XGBoost model on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
xgbboostmodel_prediction = xgboostmodel.predict(X_test)

You can call on `xgbboostmodel_prediction` to see the prediction

In [None]:
MSE = metrics.mean_squared_error(y_test, xgbboostmodel_prediction)

In [None]:
MSE

We now take the square root of the Mean Squared Error to get the value of the RMSE.

In [None]:
np.sqrt(MSE)

Therefore, the RMSE for the linear regression is 171.0289233753799

## Support Vector Machine (SVM)

Let's train a Support Vector Machine model with our training data. We need to import the Support Vector Machine model from the sklearn module

In [None]:
from sklearn.svm import SVR

SVMmodel = SVR()

SVMmodel.fit(X_train, y_train)

`SVMmodel.fit()` trained the Support Vector Machine on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
SVMmodel_prediction = SVMmodel.predict(X_test)

You can call on `SVMmodel_prediction` to see what has been predicted.

In [None]:
MSE = metrics.mean_squared_error(y_test, SVMmodel_prediction)

In [None]:
MSE

We now take the square root of the Mean Squared Error to get the value of the RMSE.

In [None]:
np.sqrt(MSE)

Therefore, the RMSE for the linear regression is 140.90188181480886

You can call on `SVMmodel_prediction` to see the prediction

## Decision Tree 

Let's train a Decision Tree model with our training data. We need to import the Decision Tree model from the sklearn module

In [None]:
from sklearn.tree import DecisionTreeRegressor

decisiontree =  DecisionTreeRegressor()

decisiontree.fit(X_train, y_train)

`decisiontree.fit()` trained the Decision Tree on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
decisiontree_prediction = decisiontree.predict(X_test)

You can call on `decisiontree_prediction` to see what has been predicted.

In [None]:
MSE = metrics.mean_squared_error(y_test, decisiontree_prediction)

In [None]:
MSE

We now take the square root of the Mean Squared Error to get the value of the RMSE.

In [None]:
np.sqrt(MSE)

Therefore, the RMSE for the linear regression is 215.84571333501313

## Models Summary

![](../Image/RMSE.png)

Having trained all the five (5) models, we can see that the best model that can accurately predict the amount of tips that would be given for a given party in the restaurant is the model with the lowest RMSE and that is Suport Vector Machine (SVM).

## Class Activity 

## Importing Scikit-learn Module

Use the following models to predict the amount of tips that would be given for a given party in the restuarant. Your teacher has also included how to import those models for you.

* **K Nearest Neighbor**: `from sklearn.neighbors import KNeighborsRegressor`

* **Ridge Regression**: ` from sklearn.linear_model import Ridge`

* **Gradient Boost Classifier**: `from sklearn.ensemble import GradientBoostingRegressor`

Which of the three (3) model is the best in term of RMSE?