In [None]:
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['savefig.dpi'] = 300

# plt.style.use(['dark_background'])
import numpy as np
np.random.seed(42)
import sklearn
%matplotlib inline

import warnings
warnings.filterwarnings(action='once')


# Machine Learning Refresher

- What is Machine Learning  

    ``` "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed" ```  
    ~ Arthur Samuel  
    
- Types of Machine Learning 
    - Supervised Learning
    - Unsupervised Learning
    - Semi-supervised Learning
    - Reinforcement Learning




# Supervised Learning

- Type of learning where the training set for the model has targets/labels i.e. for each sample in the set, for a given set of features, there is the desired output value provided for the machine to learn
- Major types of Supervised Learning:
    - Regression
    - Classification
- Some significant Supervised Learning Algorithms:
    - Linear Regression
    - Logistic Regression
    - Support Vector Machines

# Unsupervised Learning
- Unsupervised Learning is a system of learning wherein the training data is unlabeled
- Some significant Unsupervised Learning Algorithms:
    - K-means clustering 
    - Hierarchial Clustering 

# Linear Regression

- The simplest approach to Supervised learning  
- The simplest way to relate two variables would be a linear relationship  

## Simple Linear Regression
- Predicting the output `Y` from a single predictor variable `X`
- Mathematically, this is expressed by: $$ y \approx \beta_0 + \beta_1 X  $$
- $\beta_0$: Intercept and  $\beta_1$: Slope / Coefficient 
- Analogous to y = b + mx 

In [None]:
## Generate a random dataset
np.random.seed(42)
X0 = np.random.rand(100, 1)
y0 = 7 + 3 * X0

## Plot the dataset
plt.plot(X0, y0, 'bo', color= 'r')
plt.xlabel('X_0', fontsize= 13)
plt.ylabel('y_0', fontsize= 13)
plt.title('Points in the 2D space')
plt.show()

The goal is to **find the best fit for the given line equation** 

## Methods to find the 'Best Fit'

### Linear Regression Prediction Model (Multiple Linear Regression)

$$ h(\theta) \approx  \sum_j \theta_j x_{j} $$

$ h(\theta) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n $  
$ h(\theta) = \theta^T . X  $  
<br>
Here:  
$ h(\theta) $: Predicted value for given input vector [ $x_1$ to $x_n$ ]
<br>
$ \theta $: Parameter vector [$ \theta_0 \,\,  \theta_1 \,\, \theta_2 \,\, ... \,\, \theta_n $] 
<br>
$ x_0 : 1$

### Cost Function - MSE
MSE cost function for Linear Regression Model
$$ J =  MSE(h_{\theta}) = \frac{1}{2m} \sum_{i=1}^{m}(h({\theta})^{(i)} - y^{(i)})^2 $$

### Normal Equation
- The optimal parameter vector $ \theta $ that minimizes the cost function $ MSE(X, h_{\theta}) $ can be evaluated using a mathematical equation directly. This is called the Normal Equation  
$$ \hat{\theta} = (X^TX) \cdot X^T \cdot y $$

Here:  
$\hat{\theta}$: Optimal value of the parameter vector  
$ X $: Input vector  
y: Vector of the target values [ $y^{(1)}$ to $y^{(n)}$ ]

In [None]:
# Implementation of the Normal Equation
## Generate a random dataset
np.random.seed(42)
X = 3 * np.random.rand(75, 1)
y = 2 + 5 * X + np.random.randn(75, 1) # 𝑌 ≈ 𝛽0 + 𝛽1⋅𝑋 + 𝜖 -> y = 2 + 5x + noise

In [None]:
## Plot the dataset
plt.plot(X, y, 'bo', color= 'r')
plt.xlabel('X', fontsize= 13)
plt.ylabel('y', fontsize= 13)
plt.title('Randomly generated dataset')
plt.show()

In [None]:
## Compute 𝜃_hat
X_b = np.c_[np.ones((75, 1)), X]
# print(X_b.shape)
theta_hat = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) 
theta_hat

In [None]:
# Using the `?` utility in Jupyter Notebook
# np.ones?

In [None]:
# Check the optimal parameter value 
theta_hat # Should conform with 𝛽0= 2, 𝛽1= 5

In [None]:
#svd_theta = np.linalg.pinv(X_b).dot(y) # (𝑋𝑇⋅𝑋)⋅𝑋𝑇 is the pseudoinverse of X, computed using SVD
# svd_thetata

In [None]:
y_pred = X_b.dot(theta_hat)

## Plot the fit line
plt.plot(X, y, 'bo', color= 'r')
plt.plot(X, y_pred, color= 'b')
# plt.axis([0, 3, 0, 15])
plt.legend(('train points', 'predictions'))
plt.xlabel('X')
plt.ylabel('y / y_pred')
plt.show()

### Gradient Descent

- An optimization method for unconstrained optimization problems. 
- Performs better when there are a large number of instances
- Steps to perform gradient descent:
    0. Choose Loss function(J), learning rate ($\alpha$)
    1. Initialize parameters with random values (Random initialization)  
    2. **Repeat steps 3 through 5 until convergence**:
    3. Calculate the gradient of the cost function w.r.t. the parameter $\nabla J$
    4. Substitute the parameter values in the gradient
    5. Calculate the new gradient $\theta_{i_{new}} = \theta_{i_{old}} - \alpha \nabla J$  
    NB: $\alpha \nabla J$ is also known as the step for descent  
    
    $ h(\theta) = \theta_0 + \theta_1 x_1 $  
    $ J = \frac{1}{2m} \sum_{i=1}^{m}(h({\theta})^{(i)} - y^{(i)})^2 $

![GD](bell.png)
<a href="http://sebastianraschka.com/images/faq/closed-form-vs-gd/ball.png"><p style="text-align:center">Source</p></a>

# Scikit-Learn

![Scikit-Learn](scikit-learn-logo-notext.png)

- [Scikit-Learn](https://scikit-learn.org/stable/index.html#), better known as `sklearn`, is one of the most prominent ML libraries used in Python 
<br><br>
- Simple and efficient tools for predictive data analysis 
<br><br>
- Accessible to everybody, and reusable in various contexts 
<br><br>
- Built on NumPy, SciPy, and matplotlib 
<br><br>
- Open source, commercially usable - BSD license 


### Installation
<br><br>
**Via pip**: `pip install --upgrade scikit-learn` 
<br><br>
**Via conda**: `conda install scikit-learn`

## Implementations -  Closed Form and Gradient Descent

In [None]:
# Import the package
import sklearn

In [None]:
## Closed Form
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression() 
mse = lin_reg.fit(X, y)
mse.intercept_, mse.coef_

In [None]:
lin_reg.predict(X)

In [None]:
## SGD
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter= 100)
sgd_reg.fit(X, y)

In [None]:
sgd_reg.intercept_, sgd_reg.coef_

## Full Fledged example

#### sklearn datasets
The [sklearn.datasets](https://scikit-learn.org/stable/datasets/index.html#:~:text=The%20sklearn.datasets%20package%20embeds%20some%20small%20toy%20datasets,on%20data%20that%20comes%20from%20the%20%E2%80%98real%20world%E2%80%99.) package embeds some small toy datasets 
<br><br>
##### Boston Housing Dataset

![Boston Housing Dataset](boston.png)


In [None]:
from sklearn import datasets
housing_ds = datasets.load_boston()
housing_ds


In [None]:
features = housing_ds['feature_names']
features

In [None]:
import pandas as pd
housing_df = pd.DataFrame(housing_ds['data'], columns= features)
print(f'Housing dataset shape: {housing_df.shape}')
housing_df.head()

In [None]:
target = housing_ds['target']
print(type(target))
target.shape

In [None]:
from sklearn.model_selection import train_test_split 
#include random_state
X_train, X_test, y_train, y_test = train_test_split(housing_df, target, test_size = .15, random_state= 42) #default is 0.25
print(f'Train data shape: {X_train.shape}')
print(f'Test data shape: {X_test.shape}')

### Ordinary Linear Regression

In [None]:
lr = LinearRegression() # Ordinary least squares Linear Regression.
lr.fit(X_train, y_train)
print(f'Intercept: {lr.intercept_} \n Coefficient vector: {lr.coef_}')

In [None]:
lr_y_pred = lr.predict(X_test)
comp = list(zip(y_test, lr_y_pred))
comp_df = pd.DataFrame(data= comp, columns= ['y_true', 'y_predicted'])
comp_df.head()

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
lr_mse = mean_squared_error(y_test, lr_y_pred)
lr_rmse = np.sqrt(lr_mse)
lr_rmse

### R2_score
- Also known as the coefficient of determination
- $ R2\_score = 1 - \frac{u}{v}$, where:
    - $u = \sum_{i=1}^m (y - \hat y)^2$ (residual sum of squares)
    - $v =  \sum_{i=1}^m (y - \bar y)^2$ (total sum of squares)
- General range is [0, 1]

In [None]:
lr_r2_score = r2_score(y_test, lr_y_pred)
lr_r2_score

In [None]:
def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    
    # Plot RMSE
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
    
    plt.legend(loc="upper right", fontsize=14)   
    
    plt.xlabel("Training set size", fontsize=14) 
    plt.ylabel("RMSE", fontsize=14) 

In [None]:
plot_learning_curves(lr, X_train, y_train)
plt.axis([0, 80, 0, 80])                         
plt.show()                                      

### Stochastic Gradient Descent Linear Regression

In [None]:
sgd = SGDRegressor(max_iter= 1000, eta0= 0.3) # default hyperparameters
sgd.fit(X_train, y_train)

In [None]:
## Bleep-Bloop!!!
sgd_y_pred = sgd.predict(X_test)
comp = list(zip(y_test, sgd_y_pred))
comp_df = pd.DataFrame(data= comp, columns= ['y_true', 'y_predicted'])
comp_df.head()

In [None]:
mse = mean_squared_error(y_test, sgd_y_pred)
mse

In [None]:
# Scaling is important for algorithms like SGD

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

sgds = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3))
sgds.fit(X_train, y_train.ravel())
# print(f'Intercept: {sgds.intercept_} \n Coefficient vector: {sgds.coef_}')

In [None]:
#StandardScaler?

The standard score of a sample `x` is calculated as:  

$$ z = \frac{(x - u)}{s} $$  

where `u` is the mean of the training samples ,
and `s` is the standard deviation of the training samples.

- Scaling is applied to all features independently
- The sample has mean 0 and unit variance

In [None]:
# Stochastic Gradient Descent implementation
sgds_y_pred = sgds.predict(X_test)
comp = list(zip(y_test, sgds_y_pred))
comp_df = pd.DataFrame(data= comp, columns= ['y_true', 'y_predicted'])
comp_df.head()

In [None]:
sgds_mse = mean_squared_error(y_test, sgds_y_pred)
sgds_rmse = np.sqrt(sgds_mse)
sgds_rmse

In [None]:
sgds_r2_score = r2_score(y_test, sgds_y_pred)
sgds_r2_score

In [None]:
plot_learning_curves(sgds, X_train, y_train)
plt.axis([0, 80, 0, 100])                         
plt.show()  

## Variants - Regularized Models

### Ridge Regression
- Ridge Regression is a regularized version of Linear Regression. 
- The cost function is modified; an addition is made to it
- A regularization term equal to $ \frac{1}{2} \alpha \sum_{i=1}^{n} \theta_i^2 $ is added to the cost function  
 $$ J = MSE(\theta)+\frac{1}{2} \alpha \sum_{i=1}^{n} \theta_i^2 $$
- NB: $i \neq 0$


### Implementation with sklearn

In [None]:
## Ridge Example code
from sklearn.linear_model import Ridge
ridge = Ridge(alpha= 1)
ridge.fit(X_train, y_train)
ridge.predict(X_test)

### Lasso Regression

- Least Absolute Shrinkage and Selection Operator Regression
- Similar to Ridge Regression, a regularization term equal to the L1 norm of the parameter vector is added to the cost function
- The additional term is: $\alpha \sum_{i=1}^{n} |\theta_i| $

### Implementation with sklearn

In [None]:
## Lasso Example code
from sklearn.linear_model import Lasso
lasso = Lasso(alpha= 1)
lasso.fit(X_train, y_train)
lasso.predict(X_test)

### Elastic Net
- Elastic Net is a middle-ground between Ridge Regression and Lasso Regression
- Along with the regularization hyperparameter $\alpha$, there is another hyperparamter that controls the ratio of the mix: `r`
- The additional term to be added with OLS Linear Regression would be: $ r \alpha \sum_{i=1}^{n} |\theta_i| + \frac{1-r}{2}\alpha \sum_{i=1}^{n} \theta_i^2  $

### Implementation with sklearn

In [None]:
## Elastic Net Example code
from sklearn.linear_model import ElasticNet
enet = ElasticNet(alpha= 0.1, l1_ratio= 0.5)
enet.fit(X_train, y_train)
enet.predict(X_test)

### [Other regressors in sklearn](https://scikit-learn.org/stable/modules/classes.html#classical-linear-regressors)

### Disclaimer: Problems / Assumptions

![Warning!!!](xkcd.png)
<a href="https://www.explainxkcd.com/wiki/index.php/1725:_Linear_Regression"><p style="text-align:center">Source</p></a>

1. Non-linearity of the response-predictor relationships
2. Collinearity
3. Outliers