# **Linear Regression with Scikitlearn**

In [20]:
import numpy as np

## Generate synthetic data (X)

In [21]:
# Generate 4 samples having 2 features
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
print(X.shape)

(4, 2)


## Generate synthetic target (Y)

In general, Linear Regression will have the equation of the following form:
**y = b1 * x1 + b2 * x2 + b0**

Here, b1 abd b2 are the coefficients of the model and b0 is the intercept term (to be learned from data during training of the model).

Let's say, we would like to create the following function:
**y = 2 * x1 + 3 * x2 + 5**


In [22]:
# Generate target as per the chosen equation
y = np.dot(X, np.array([2, 3])) + 5
print(y)

[10 13 15 18]


The model that we will build doen't know the relationship between X and Y. It has to discover the relation by learning the parameters of the linear regression model. The parameters to be learned by the model are the coefficients of the two features (x_0 and x_1) and the intercept/bias term

## Build and Train a LinearRegression Model

In [23]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X, y)

LinearRegression()

## Check goodness of fit

The **coefficient of determination**  is defined as **R^2 = (1- u/v)**, where **u** is the residual sum of squares ((y_true - y_pred)^2).sum() and **v** is the total sum of squares ((y_true - y_true.mean())^2).sum(). 
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 
A constant model that always predicts the expected value of y, disregarding the input features, would get a  score of 0.0.


In [24]:
# R-squared- coefficient od determination
reg.score(X, y)

1.0

## Trained model coefficient

In [25]:
reg.coef_

array([2., 3.])

## Trained model intercept

In [26]:
reg.intercept_

5.0

So the model equation after training is:
**y = 5.0 + 2*x1 + 3*x2**

This is the relationship between X and Y (we have generated the synthetic data and established this relation earlier).
After training, the Linear Regression model is capable of finding out the correct values for the model parameters (coefficients and intercept).



## Make Prediction

In [27]:
reg.predict(np.array([[3, 5]]))

array([26.])

In [28]:
reg.predict(np.array([[5, 10]]))

array([45.])

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

## **Generate Synthetic Data for Regression with Scikitlearn** 

In [29]:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=5, random_state=0, noise=10.0, bias=100.0)

## Split data into training and test set

In [30]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ## keep 20% data for testing

In [31]:
X_train.shape

(800, 5)

In [32]:
X_test.shape

(200, 5)

## Build and Train a Linear Regression Model

In [33]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

In [34]:
# Train the model
reg.fit(X_train, y_train)

LinearRegression()

## Check trained model coefficients

In [35]:
reg.coef_

array([40.68497529, 66.77973907, 10.88259561, 60.39913941, 25.72777919])

In [36]:
reg.intercept_

100.09352838063292

So, the model equation after training is:

**y = 100 + 40.68*x1 + 66.77*x2 + 10.88*x3 + 60.39*x4 + 25.72*x6**


## Make prediction on Test data

In [37]:
reg.predict(X_test)

array([ 147.16278614,   50.41340967,   58.40592907,   36.61235445,
         48.94930909,   -5.41799738,  131.2258254 ,  216.88110982,
         73.5627408 ,  -28.93634998,   51.66988787,  -17.84522915,
        222.43342778,   11.11965928,   95.54410512,  154.35266743,
         90.14583654,   81.74566608, -114.06595397,  162.9680702 ,
         74.32188547,  192.10158557, -215.29541558,  111.70506127,
         63.37339592,   77.59885364,  169.41711029,    0.74502149,
       -133.98246357,   61.07678511,  183.02001647,   80.83135014,
         82.7199532 ,  179.90074649,   56.01911993,  157.07818046,
         96.60457596,   27.32859011,   46.59168629,    7.41593775,
        101.88716909,  132.36622858,   27.55008245,  110.94372464,
        200.9906544 ,  235.14253161,   55.060757  ,  -41.80592625,
       -139.21992041,   95.14225763,   69.04637441,   85.04628802,
        325.88733428,  324.62445738,  141.81912598,  -93.57707346,
         67.63811266,  164.70112492,   78.53483703,  201.66168

## Check goodness of fit - R-squared

The **coefficient of determination**  is defined as **R^2 = (1- u/v)**, where **u** is the residual sum of squares ((y_true - y_pred)^2).sum() and **v** is the total sum of squares ((y_true - y_true.mean())^2).sum(). 
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 
A constant model that always predicts the expected value of y, disregarding the input features, would get a  score of 0.0.


In [38]:
# For a good model, this value should be close to 1.
reg.score(X_test, y_test)

0.9872907376409152