# Linear Regression With scikit-learn

<b>Underfitting</b> occurs when a model can't accurately capture the dependencies among data, usually as a consequence of its own simplicity. It often yields a low R^2 with known data and bad generalization capabilities when applied with new data.

<b>Overfitting</b> happens when a model learns both data dependencies and random fluctuations. In other words, a model learns the existing data too well. Complex models, which have many features or terms, are often prone to overfitting. When applied to known data, such models usually yield high ùëÖ¬≤. However, they often don‚Äôt generalize well and have significantly lower ùëÖ¬≤ when used with new data.

- Mean Absolute error: The mean of the absolute value of the errors. This is the easiest of the metrics to understand since it's just average error.
- Mean Squared Error (MSE): The mean o the squared error. It's more popular than Mean Absolute Error becauset he focus is geared more towards large errors. This is due to the squared term exponentially increasing large errors in comparison to smaller ones.
- Root Mean Squared Error (RMSE): The square root of the MSE. 
- Coefficient of determination (R^2): Not an error, but rather a popular metric to measure the performance of your regression model. It represents how close the data points are to the fitted regression line. The higher the R-squared value, the better the model fits your data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). You can use the .score() method to get this metric. 


<img src="https://files.realpython.com/media/poly-reg.5790f47603d8.png" />

<img src="https://files.realpython.com/media/fig-lin-reg.a506035b654a.png" />

### Recommended Python Packges for LR: 
-  NumPy
- Scikit-learn
- Statsmodels

In [4]:
#Step 1: Import libraries
import numpy as np
from sklearn.linear_model import LinearRegression

The fundamental data type of NumPy is the array type called numpy.ndarray, which is referenced here as array.

In [10]:
#Step 2: Provide the data
#X must have one column and as many rows as necessary, specified by the .reshape(-1, 1)
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])

In [11]:
print(x)

[[ 5]
 [15]
 [25]
 [35]
 [45]
 [55]]


In [12]:
print(y)

[ 5 20 14 32 22 38]


In [13]:
#Step 3: Create a model and fit it
#.fit() calculates the optimal value of the weights b0 and b1, using the existing input and output, x and y, as the arguments.
#.fit() fits the model
#.fit() returns self, which is the variable model itself. 
model = LinearRegression()
model.fit(x, y)



In [16]:
#OR all in one line
model = LinearRegression().fit(x, y)

This statement creates the variable model as an instance of LinearRegression. You can provide several optional parameters to LinearRegression:

- <b>fit_intercept</b> is a Boolean that, if True, decides to calculate the intercept ùëè‚ÇÄ or, if False, considers it equal to zero. It defaults to True.
- <b>normalize</b> is a Boolean that, if True, decides to normalize the input variables. It defaults to False, in which case it doesn‚Äôt normalize the input variables.
- <b>copy_X</b> is a Boolean that decides whether to copy (True) or overwrite the input variables (False). It‚Äôs True by default.
- <b>n_jobs</b> is either an integer or None. It represents the number of jobs used in parallel computation. It defaults to None, which usually means one job. -1 means to use all available processors.

Your model as defined above uses the default values for all parameters. 


In [None]:
#Step 4: Get results

#Get R^2, the coefficient of determination
r_sq = model.score(x,y)

#Get b0, the intercept, and b1, the slope
b0 = model.intercept_
slope = model.coef_

print(r_sq)
print(b0)
print(slope)

Note: By convention, in scikit-learn, a trailing underscore (i.e. intercept_) indicates an attribute that is estimated.

In [17]:
#Print R^2, the coefficient of determination, and b0, the intercept, and b1, the coefficeint
print(f'intercept: {model.intercept_}')
print(f"slope: {model.coef_}")

intercept: 5.633333333333329
slope: [0.54]


In [18]:
#Once you have your model, you can use it for predictions with either existing or new data. To obtain the predicted response, use .predict()
y_pred = model.predict(x)
print(f"predicted response: \n{y_pred}")

predicted response: 
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]


In [24]:
#Same thing. below instead of .predict(x)
y_pred = model.intercept_ + model.coef_ * x
print(f'predicted response: \n{y_pred}')

predicted response: 
[[ 8.33333333]
 [13.73333333]
 [19.13333333]
 [24.53333333]
 [29.93333333]
 [35.33333333]]


In practice, regression models are often applied for forecasts. This means that you can use fitted models to calculate the outputs based on new inputs.

### Example 1

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

#Viewing how linear a relationjship is
plt.scatter(cdf.UELCONSUMPTION, cdf.C02EMISSIONS, color='blue')
plt.xolabel('FUEL CONSUMPTION')
plt.ylabel('Emission')
plt.show()

#Creating a train and test dataset
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

#Simple regression model
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINE SIZE']])
train_y = np.asanyarray(train[['C02EMISSIONS']])
regr.fit(train_x, train_y)
print('Coefficients: ', regr.coef_)
print('Intercept: ', regr.intercept_)

#Plot the fit line over the data
plt.scatter(train.ENGINESIZE, train.C02EMISSIONS, color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel('Engine size')
plt.ylabel('Emission')

from sklearn.metrics import r2_score
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['C02EMISSIONS']])
test_y_ = regr.predict(test_x)
print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y, test_y_))

# Multiple Linear Regression With scikit-learn

In [None]:
- Multivarite regression, or multipl elinear regression, is a case of linear regression with two or more independent variables. 
- It represents a regression plane in a three dimensional space. The goal of regression is to determine the values oft he weights b0, b1, and b2, such  that this plane is as close as possible to the actual responses, while yieldin the minimal SSR. 

In [8]:
import numpy as np
from sklearn.linear_model import LinearRegression
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)
print(x)
print(y)
type(x)
type(y)

[[ 0  1]
 [ 5  1]
 [15  2]
 [25  5]
 [35 11]
 [45 15]
 [55 34]
 [60 35]]
[ 4  5 20 14 32 22 38 43]


numpy.ndarray

In [9]:
#Create a model and fit it
model = LinearRegression().fit(x, y)

#Obtain the properties of the model the way as in the case of simple linear regression
r_sq = model.score(x, y)
print(f'coefficient of determination: {r_sq}')
print(f'intercept: {model.intercept_}')
print(f'coefficients: {model.coef_}')

#Predicting a response
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")

#obtain the response with .predict(), or the below:
y_pred = model.intercept_ + np.sum(model.coef_ * x, axis=1)
print(f"predicted response:\n{y_pred}")

Example 2

In [None]:
#Multiple linear regression
from sklearn import linear_model
regr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

#Predictions
y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

# Polynomial Regression With scikit-learn

Implementing polynomial regression with scikit-learn is very similar to linear regression. There's only one extrta step: you need to transform the array of inputs to include nonlinear terms such as x^2.

In [14]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

#Reshape is used here because we need a two-dimensional array
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([15, 11, 2, 8, 25, 32])
print(x)
print(y)

[[ 5]
 [15]
 [25]
 [35]
 [45]
 [55]]
[15 11  2  8 25 32]


In [15]:
#We now need to include x^2, so we need to transform the input array x to contain any aditional column swith the values of x^2.
#Class PolynomialFeatures is very convenient for this purpose

#We use transformer to transform input x
#Degree = degree of the polynomial regression function. Interaction_only is a boolean that decides whether to include only interaction features or all features.
#include_bias is a voolean that decides whether to include the bais, or intercept, column of 1 values or not. 
transformer = PolynomialFeatures(degree = 2, include_bias = False)

In [None]:
#Now we must fit the instance of the PolynomialFeatures variable, the transformer variable
transformer.fit(x)

#Now we create a new, modifid input array. 
x_ = transformer.transform(x)

#We can also use fit_transform to replace the three previous statements with only one
x_ = PolynomialFeatures(degree = 2, include_bias = False).fit_transform(x)

#The modified input array x_ contians two columns: one with the original inputs, and the other with their sqauares.

In [None]:
#Next, create a model and fit it
model = LinearRegression().fit(x_, y)

#Check the results
r_sq = model.score(x_, y) 
print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")

#Finally, we predict a response
y_pred = model.predict(x_)
print(f'predicted response: \n{y_pred}')

# Advanced Linear Regression with statsmodels

In [None]:
import numpy as np
import statsmodels.api as sm
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]

x,y = np.array(x), np.array(y)

#You need to addt he column of ones to he inputs if you want statsmodels to calculate the intercept b0. It doesn't take b0 into accoun by default.
x = sm.add_constant(x)

#The regression model based on ordinary least squares is an instance of the class statsmodels.regression.linear_model.ols
model = sm.OLS(y, x)

#Once your model is created, you can apply .fit() on it
results = model.fit()

#By calling fit, you obtain the variable results, which is an instanece of the class statsmodels.regression.linear_model.regressionResultsWrapper. 
#This object holds a lot of information about the regression model. 

#You can call .summaryh() to get the table witht he results of linear regressikon
print(results.summary())

In [None]:
print(f"coefficient of determination: {results.rsquared}")
print(f"adjusted coefficient of determination: {results.rsquared_adj}")
print(f"regression coefficients: {results.params}")

In [None]:
#Predict a response
print(f"predicted response:\n{results.fittedvalues}")
print(f"predicted response:\n{results.predict(x)}")