# Scikit Learn Tutorial #6 - Regression Algorithms

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/6.%20Regression%20Algorithms.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/6.%20Regression%20Algorithms.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![Scikit Learn Logo](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

## Loading in Datasets

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [2]:
le = LabelEncoder()
computer_hardware = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data', header=None, names=['vendor', 'model', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'label'])
computer_hardware['vendor'] = le.fit_transform(computer_hardware['vendor'])
computer_hardware['model'] = le.fit_transform(computer_hardware['model'])
computer_hardware.head()

Unnamed: 0,vendor,model,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP,label
0,0,29,125,256,6000,256,16,128,198,199
1,1,62,29,8000,32000,32,8,32,269,253
2,1,63,29,8000,32000,32,8,32,220,253
3,1,64,29,8000,32000,32,8,32,172,253
4,1,65,29,8000,16000,32,8,16,132,132


In [3]:
parkinsons = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data')
parkinsons.rename(columns={'motor_UPDRS':'label'}, inplace=True)
parkinsons.head()

Unnamed: 0,subject#,age,sex,test_time,label,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
0,1,72,0,5.6431,28.199,34.398,0.00662,3.4e-05,0.00401,0.00317,...,0.23,0.01438,0.01309,0.01662,0.04314,0.01429,21.64,0.41888,0.54842,0.16006
1,1,72,0,12.666,28.447,34.894,0.003,1.7e-05,0.00132,0.0015,...,0.179,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.1081
2,1,72,0,19.681,28.695,35.389,0.00481,2.5e-05,0.00205,0.00208,...,0.181,0.00734,0.00844,0.01458,0.02202,0.02022,23.047,0.46222,0.54405,0.21014
3,1,72,0,25.647,28.905,35.81,0.00528,2.7e-05,0.00191,0.00264,...,0.327,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.4873,0.57794,0.33277
4,1,72,0,33.642,29.187,36.375,0.00335,2e-05,0.00093,0.0013,...,0.176,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.19361


## Comparing Models

Because the syntax is almost the same for all model it won't be the focus of this tutorial. We will rather focus on what model are available and when to use which model without focusing to much on the theory behind each model.

### Linear Regression

In statistics, linear regression is a linear approach to modelling the relationship between a dependent variable(y) and one or more independent variables(X). In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Linear Regression is one of the most popular algorithms in Machine Learning. That's due to its relative simplicity and well known properties.  
![Linear Regression](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/438px-Linear_regression.svg.png)

### Ridge and Lasso Regression

Ridge and Lasso Regression are extensions of Linear Regression. Ridge Regression uses L2 regularization to prevent overfitting. Lasso Regression uses L1 regularization. The difference between the two is the penalty term to the loss function. Ridged Regression uses a "squared penalty" whilst Lasso Regression uses a "absolute value penalty".

L1 Regularization:
![L1 Regularization](https://firebasestorage.googleapis.com/v0/b/programmingwithgilbert.appspot.com/o/Videos%2FScikit%20Learn%20Tutorials%2FScikit%20Learn%20Tutorial%20%236%20-%20Regression%20Algorithms%2FL1%20Regularization.PNG?alt=media&token=a20aa53c-2b6b-4c7a-8c12-b015bafc9a84)

L2 Regularization:
![L2 Regularization](https://firebasestorage.googleapis.com/v0/b/programmingwithgilbert.appspot.com/o/Videos%2FScikit%20Learn%20Tutorials%2FScikit%20Learn%20Tutorial%20%236%20-%20Regression%20Algorithms%2FL2%20Regularization.PNG?alt=media&token=2d71c368-32b6-4da4-861d-42424eff152d)

Differences between properties:
![Differences between properties](https://i.stack.imgur.com/m3otE.png) <sup>https://i.stack.imgur.com/m3otE.png</sup>

### Support Vector Machine

Support Vector Machines can also be used for regression this version is called Support Vector Regression (SVR). The SVR maintains all the main features of an SVM. The SVR uses the same priciples as the SVM for classification, with only a few minor changes. 
![SVR](http://scikit-learn.org/0.18/_images/sphx_glr_plot_svm_regression_001.png)

### Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. DTs are simple to understand and can be easily visualised they also require very little data preparation.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/CART_tree_titanic_survivors.png/240px-CART_tree_titanic_survivors.png)

### Comparison on our Datasets

In [4]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor

models = [
    ('LR', LinearRegression()),
    ('L', Lasso()),
    ('R', Ridge()),
    ('SVR', LinearSVR()),
    ('DT', DecisionTreeRegressor()),
]

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

for dataset_name, dataset in [('computer_hardware', computer_hardware), ('parkinsons', parkinsons)]:
    X = np.array(dataset.drop(['label'], axis=1))
    y = np.array(dataset['label'])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    for name, model in models:
        clf = model
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        print(dataset_name, name, accuracy)

computer_hardware LR 0.9438611618299355
computer_hardware L 0.9439215152431054
computer_hardware R 0.9438612466623202
computer_hardware SVR 0.5074912025922851
computer_hardware DT 0.8749070046048776
parkinsons LR 0.906854752090983
parkinsons L 0.8983382060322663
parkinsons R 0.9049789120180209
parkinsons SVR 0.90064727152137
parkinsons DT 0.9987385127588977
