# Linear Regression 

**Introduction:**

Regression is a process where we try to predict a continuous target variable based on independent variables. Scikit-Learn offers various regression models for performing regression learning.
Let’s use below scikit-learn's various regression models for our purpose.

Scikit-Learn also provides few datasets in-built with a package that we can load directly into memory and use for our purpose. We'll be using one such dataset called the Boston Housing dataset for our purpose. We'll be predicting the house price of a dataset based on other attributes from the dataset.



## 1 - Importing necessary libraries ##

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
import sys
print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
warnings.filterwarnings("ignore") ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)

## Beow magic function fits plot inside of current notebook. 
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline

**Linear Regression**

In the Linear Regression Model, we try to fit the line through data in a way that has a minimum distance from all points in the dataset. Once we have found out proper line which has a minimum distance from all points in data and further optimization is not possible then we use that line to do further prediction on unseen data in the future.
It's also known as Ordinary Least Squares because optimization function tries to minimize the squared distance between the line and all points in Train/Test Set.

## 2 - Load Datasets ##
We'll load Boston housing data provided by scikit-learn. It returns Bunch object which is almost the same as the dictionary. We'll also print details about the datase


In [None]:
from sklearn.datasets import load_boston ## function for loading boston data.
boston = load_boston()
#print(type(boston)) ## It returns Bunch object which is similar to dictionary.
#print(boston.DESCR) ## DESCR attribute describes dataset.
print('Feature Names : ' + str(boston.feature_names))
print('Dataset shape : ' + str(boston.data.shape))
print('Target shape : ' + str(boston.target.shape))

## 3 - Splitting Data Into Train/Test Sets ##
We'll split the dataset into two parts:
- Training data which will be used for the training model.

- Test data against which accuracy of the trained model will be checked.

train_test_split function of model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

In [None]:
from sklearn.model_selection import train_test_split # Function for splitting dataset into train/test set.
X = boston.data
Y = boston.target
## We can specify either one of train_size and test_size. Sklearn find out other by itself. I included both for explanation purpose.
## random_state is used to reproduce same data splits again. If we don't set random_state then it generates different splits everytime.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size = 0.2, random_state = 123)
print('Train & Test sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

## 4-Initializing Model ##
We are initializing the LinearRegression model below which is the basic model used extensively for regression tasks.

In [None]:
from sklearn.linear_model import LinearRegression ## Linear Regression Implementation
linear_regressor = LinearRegression()
linear_regressor

##5- Fitting Default Model To Train Data##
We can train a model by passing train data and train labels. It returns objects of trained classifier as well after training.

In [None]:
linear_regressor.fit(X_train,Y_train)

## 7-Evaluating Trained Model On Test Data ##
Almost all models in Scikit-Learn API provides predict() method which can be used to predict the target variables on Test Set passed to it.
We are comparing below housing prices predicted by our model with actual house prices of test data and train data.

In [None]:
y_test_pred = linear_regressor.predict(X_test)
print('First Few Actual Housing Prices(Test Data) : ' + str(Y_test[:5]))
print('First Few Predicted Housing Prices(Test Data) : ' + str(y_test_pred[:5]))

Scikit-Learn's LinearRegresson model has a score() method which returns coefficient of determination  R2  based on the dataset and target variables passed to it. It returns a value between [0-1] with 1 being best. If it returns negative value means that the model performed quite bad.

Note: Do not confuse  R2  with MSE as both are quite different. One can calculate MSE by using mean_squared_error provided by the metrics module of sklearn.

Formula of  R2: 

R2=(1−u/v) 
where

u=MSE=((ytrue−ypred)2).sum() 
v=((ytrue−ytrue.mean())2).sum()

In [None]:
print('R^2 Score on Test Data : %.3f'%linear_regressor.score(X_train, Y_train))

As we discussed above, linear regression tries to generate lines through data in a way that mean squared error between actual labels and target is least. It is also the reason why its referred to as Ordinary Least Squares by many ML Practitioners as it tries to minimize squared differences between predicted and actual labels. We can access coordinates of that line through coef_ and intercept_ attributes of regressor.

In [None]:
print('Weight Coefficients : '+ str(linear_regressor.coef_))
print('\nY-Axis Intercept : '+ str(linear_regressor.intercept_))

## 8 -Visualizing Prediction Results On Test Data ##


In [None]:
sorted_labels_acc_to_test_y = list(sorted(zip(Y_test, y_test_pred), key=lambda x: x[1]))
sorted_test_y, sorted_test_preds = zip(*sorted_labels_acc_to_test_y)

with plt.style.context(('ggplot', 'seaborn')):
    plt.scatter(range(len(sorted_test_y)),sorted_test_y, s=75, alpha=0.7, label='Actual')
    plt.scatter(range(len(sorted_test_preds)), sorted_test_preds, s=75, alpha=0.7, label='Prediction')
    plt.ylabel('House Price')
    plt.title('Actual vs Predicted House Prices of Test Data')
    plt.legend(loc='best')

## 9- Ridge Regression(L2 Penalty)


Ridge regression is another estimator where we introduce regularization(L2 regularization) in the cost minimization function. The introduction of this regularization pushes all weights near zero but not making them exactly zero. It makes all the weight quite small.

## 10 - Initializing Model ##

In [None]:
from sklearn.linear_model import Ridge ## Linear Regression Implementation
ridge_regressor = Ridge()
ridge_regressor

Below we are trying saga solver for our purpose. We can only use penalties l2, l1, elasticnet or no penalty(none) with this algorithm. It's the only algorithm which supports elasticnet penalty. It works faster for large datasets.# New Section

In [None]:
%%time

params = {'penalty' : ['l1', 'l2','elasticnet', 'none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='saga', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

**Printing First Few Cross Validation Results**

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.

Below we are trying sag solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm. It works faster for large datasets.

In [None]:
%%time

params = {'penalty' : ['l2', 'none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='sag', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

**Printing First Few Cross Validation Results**

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.

Below we are trying lbfgs solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm.

In [None]:
%%time

params = {'penalty' : ['l2','none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='lbfgs', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

**Printing First Few Cross Validation Results**

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

Below we are trying newton-cg solver for our purpose. We can only use penalty l2 or no penalty(none) with this algorithm

In [20]:
%%time

params = {'penalty' : ['l2','none'],
         'fit_intercept': [True, False],
         'C': np.linspace(0.1,1.0,10),
         'l1_ratio': np.linspace(0.1,1.0,10)}

grid = GridSearchCV(LogisticRegression(random_state=1, solver='newton-cg', n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.975
Test Accuracy : 0.967
Best Score Through Grid Search : 0.975
Best Parameters :  {'C': 0.4, 'fit_intercept': False, 'l1_ratio': 0.1, 'penalty': 'l2'}
CPU times: user 2.01 s, sys: 80.2 ms, total: 2.09 s
Wall time: 1min 2s


  "(penalty={})".format(self.penalty))


**Printing First Few Cross Validation Results**

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.

## 11 - K-Nearest Neighbors##

 K-nearest neighbor is one of the simplest algorithms which maintains all points from the train dataset and class to which it belongs. Later on, whenever a new unknown point comes for prediction it checks a predefined number of points nearer to that new point and based on majority class it assigns that majority class to a new point.n_neighbors is used to set the number of neighbors to check for predicting class for new unseen points.

### 11.1 Initializing Model

In [None]:
 from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_classifier

### 11.2 Fitting Model To Train Data

In [None]:
knn_classifier.fit(X_train,Y_train)

### 11.3 - Evaluating Trained Model On Test Data.###
    

In [None]:
Y_preds = knn_classifier.predict(X_test)
print(Y_preds)
print(Y_test)
print('Accuracy : %.3f'%(Y_preds == Y_test).mean())
print('Accuracy : %.3f'%knn_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.

In [None]:
print(knn_classifier.predict_proba(X_test)[:10]) ## It returns probability predicted by model for each class for each example.

### 11.4 Visualizing Prediction Results On Test Data

In [None]:
with plt.style.context(('ggplot','seaborn')):
    plt.figure(figsize=(12,5))
    plt.subplot(121)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_test==i,0],X_test[Y_test==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Actual')

    plt.subplot(122)
    for i,c in [(0,'red'),(1,'green'),(2,'blue')]:
        plt.scatter(X_test[Y_preds==i,0],X_test[Y_preds==i,3], c=c, s=40, marker='s', label=iris.target_names[i])
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[3])
    plt.legend(loc='best')
    plt.title('Prediction');

###11.5 -Finetuning Model By Doing Grid Search On Various Hyperparameters.#
Below are list of hypterparameters that we can tune to get best estimator for our data.

**n_neighbors** - Number of neighbors to use to determine class of target. default=5

**algorithm** - Algorithm for finding nearest neighbors. It takes one of the values from list [ball_tree, kd_tree, brute, auto]. default=auto

**leaf_size** - Leaf size of KDTree and BallTree. It controls speed of construction and quer of tree as well as memory requirement of tree.default=30

In [28]:
%%time

params = {'n_neighbors' : np.arange(1,10),
         'leaf_size': np.arange(5,50,5),
         'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto']}

grid = GridSearchCV(KNeighborsClassifier(n_jobs=-1), param_grid=params, cv=3, n_jobs=-1)
grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Train Accuracy : 0.983
Test Accuracy : 0.933
Best Score Through Grid Search : 0.983
Best Parameters :  {'algorithm': 'ball_tree', 'leaf_size': 5, 'n_neighbors': 3}
CPU times: user 2.38 s, sys: 164 ms, total: 2.55 s
Wall time: 53.1 s


### 11.6 Printing First Few Cross Validation Results

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))
cross_val_results.head() ## Printing first few results.

References:
Supervised Learning - Regression¶

https://coderzcolumn.com/tutorials/machine-learning/supervised-learning-regression-using-scikit-learn-sklearn