# Advanced Linear Regression
<br/>
# Improving the fit - Cross Validation

## Cross Validation

Until now, we have been training the model with a dataset and measuring the performance on the same dataset. This isnt a good way to measure the performance of a model as in reality the test set is different from the training dataset. Hence, the performance on the test set is the right way to evaluate how good the model is. Since test sets are not always available, evaluating the model on the data it hasn't been trained on will provide a true estimate of the performance. Cross Validation is a technique of splitting the input data set into training and test set with a ratio where majority is the training set. A one way split might not train the model for all the variance. Hence, we go on splitting the data set in different parts so that each time the test set is different from the previous split. The cross-validation splits are shown in the figure below. If we consider k splits, it is called as k-fold cross validation. K-fold cross validation will train k different models and evaluate the performance on k different test sets and consider the mean of all the errors as the benchmark performance error. Below is the visualization of how train-test splits are selected by varying the  window so that the model is evaluated with all different portions of data.

<br/><br/>
<div style='float: left;'><img src='../../../images/vis_7.png'></div>
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>

## Cross Validation using sklearn

cross validation is available in sklearn. Let us perform a 5-fold cross_validation on the Boston Housing dataset.


## Crosss Validation (CV) using Scikit-Learn

Import cross_validation and metrics function from sklearn. cross_validation is to perform CV on the dataset and metrics is used to evaluate the performance using the mean_squared_error function that it offers.

```python
import numpy as np
from sklearn import cross_validation, metrics
```
The target is the Housing Price variable 'MEDV' which we shall assign to the variable y.

```python
y = boston_data['MEDV']
```
Instantiate a 5 fold split using the KFold function on the boston_data:

```python
k_folds = KFold(n_splits=n_fold_var)
```

What the above command does is that it creates pairs of train and test indices that splits the dataset as explained in the illustration above. Hence we have 5 pairs of such splits of training and testing data over various parts. We can now iterate over all the parts and evaluate the Mean Squared Error (MSE) over each such split. We shall use a scores list to append all the MSEs for each such split. The iloc function over the dataframe can reference the values by index. Therefore, the part of the dataset where testing data or 'unseen' data is provided by KFold function can be accessed.

```python
scores = []
for train_index, test_index in k_folds.split(boston_data):
    lm = linear_model.LinearRegression().fit(boston_data.iloc[train_index], y.iloc[train_index])
    y_hat = lm.predict(boston_data.iloc[test_index])
    scores.append(metrics.mean_squared_error(y.iloc[test_index], y_hat))
```
Taking the mean of all such MSEs over all splits will give us a value that we can use to 'honestly' evaluate our model:

```python
print(np.mean(scores))
```

<br/>

###  Instructions
 Peform k-fold cross validation where k=3 without randomizing the dataset and determine:

- Mean scores and assign it to mean_scores variable. 
- Set the variables with _var appropriately.

In [None]:
from sklearn import feature_selection, metrics
from sklearn import datasets, linear_model
from sklearn.model_selection import KFold
from matplotlib import rcParams
import statsmodels.formula.api as sm


import numpy as np
import pandas as pd
import seaborn as sns

boston_dataset = datasets.load_boston()
boston_data = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston_data['MEDV'] = boston_dataset.target


## Solution

```python
# Perform linear regression using sklearn
reg_model = linear_model.LinearRegression()
X = boston_data[['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]
y = boston_data['MEDV']
reg_model.fit(X, y)

n_fold_var = 3
k_folds = KFold(n_splits=n_fold_var)
scores = []
for train_index, test_index in k_folds.split(X):
    lm = linear_model.LinearRegression().fit(boston_data.iloc[train_index], y.iloc[train_index])
    y_hat = lm.predict(boston_data.iloc[test_index])
    scores.append(metrics.mean_squared_error(y.iloc[test_index], y_hat))

mean_scores = np.mean(scores)
print(mean_scores)
```

In [None]:
# Run the above code here 





<br/><br/><br/>
# Regression using Scikit-Learn

Scikit-Learn can also be used for linear regression. 

```python
from sklearn import linear_model, feature_selection
reg_model = linear_model.LinearRegression()
X = boston_data[['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]
y = boston_data['MEDV']
reg_model.fit(X, y)
p_values = feature_selection.f_regression(X, y)[1]
residuals = (y-reg_model.predict(X)).values
```
The residuals are errors and we can plot a distribution of these errors using distplot:

```python
g = sns.distplot(residuals, color="m")
sns.plot.show()
```
<br/>
<div style='float: left;'><img src='../../../images/figure_6.png'></div>
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>

The residuals follow closely resembles gaussian distribution which shows that the data is good for modeling using linear regression.

<br/>
## Visualizing Regression Line Fit 
The linear regression fits a line to the data set. We can visualize the line when a single variable is involved in the modeling. We can henceforth, imagine a hyper-plane extended in several dimensions. Going back to the modeling of housing rate with crime rate variable, specifying kind argument as "reg" in the jointplot, the best line that is fit can be seen.

```python
sns.jointplot(x="CRIM", y="MEDV", data=boston_data, kind="reg")
```
<br/>
<div style='float: left;'><img src='../../../images/figure_7.png'></div>
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>

The line in the above figure shows the best fit to the crime rate for predicting housing prices. 

<br/>

### Instructions

Model the house prices for percent of lower status of the population using scikit-learn.

- Also generate a joint plot for the above data and assign the plot to g.
- Determine mean squared error and assign it to mse_lm


In [None]:
from sklearn import linear_model, feature_selection
from sklearn.metrics import mean_squared_error

# Modify the code below to predict housing prices for LSTAT


### Solution

```python
reg_model = linear_model.LinearRegression()
X = boston_data[['LSTAT']]
y = boston_data['MEDV']
reg_model.fit(X, y)
p_values = feature_selection.f_regression(X, y)[1]
y_hat = reg_model.predict(X)
mse_lm = mean_squared_error(y_hat, y)
g = sns.jointplot(x="LSTAT", y="MEDV", data=boston_data, kind="reg")
```

## Gradient Descent

We saw in previous sections that, in Linear Regression to get the best fit line , we need to reduce the MSE value to a minimum possible value. To identify a slope & intercept, we use the equation:

$$ y = mx + b $$

where:

‘m’ is the slope

‘x’ → independent variables

‘b’ is intercept

However this applies only to a Univariate Linear Regression where we have one target variable. In case of multivatriate Linear Regression, we need to apply what is called <b>Gradient Descent</b>.

Gradient descent algorithm’s main objective is to minimise the cost function. It is one of the best optimisation algorithms to minimise errors (difference of actual value and predicted value).

The goal is similar like the above operation that we did to find out a best fit of intercept line ‘y’ in the slope ‘m’. Using Gradient descent algorithm also, we will figure out a minimal cost function by applying various parameters for theta 0 and theta 1 and see the slope intercept until it reaches convergence.

In a real world example, it is similar to find out a best direction to take a step downhill. We take a step towards the direction to get down. From the each step, you look out the direction again to get down faster and downhill quickly. The similar approach is using in this algorithm to minimise cost function.

Similarly, starting at the top of the mountain in curve, we take our first step downhill in the direction specified by the negative gradient. Next we recalculate the negative gradient (passing in the coordinates of our new point) and take another step in the direction it specifies. We continue this process iteratively until we get to the bottom of our graph, or to a point where we can no longer move downhill–a local minimum.

<br/>
<div style='float: left;'><img src='../../../images/gradient_descent.png' style="width:55vw"></div>
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>

### Learning Rate

The size of these steps is called the learning rate. With a high learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.

### Cost Function

There are two parameters in our cost function we can control: m (weight) and b (bias). Since we need to consider the impact each one has on the final prediction, we need to use partial derivatives. We calculate the partial derivatives of the cost function with respect to each parameter and store the results in a gradient.

The cost function is given by:

$$ f(m,b) =  \frac{1}{N} \sum_{i=1}^{n} (y_i - (mx_i + b))^2 $$

The gradient Descent is given by :

$\begin{split}f'(m,b) =
   \begin{bmatrix}
     \frac{df}{dm}\\
     \frac{df}{db}\\
    \end{bmatrix}
=
   \begin{bmatrix}
     \frac{1}{N} \sum -2x_i(y_i - (mx_i + b)) \\
     \frac{1}{N} \sum -2(y_i - (mx_i + b)) \\
    \end{bmatrix}\end{split}$
    

Types of Gradient Descent:

Gradient Descent has three variants:

<li>Batch gradient descent
<li>Stochastic gradient descent
<li>Mini-batch gradient descent

In [None]:
# Sample Gradient Descent code
# Hit run
next_x = 6  # We start the search at x=6
gamma = 0.01  # Step size multiplier
precision = 0.00001  # Desired precision of result
max_iters = 10000  # Maximum number of iterations

# Derivative function
df = lambda x: 4 * x**3 - 9 * x**2

for i in range(max_iters):
    current_x = next_x
    next_x = current_x - gamma * df(current_x)
    step = next_x - current_x
    if abs(step) <= precision:
        break

print("Minimum at", next_x)

# The output for the above will be something like
# "Minimum at 2.2499646074278457"

## Solution

Just run the above code