# Simple Linear Regression

In this notebook, we use scikit-learn to implement simple linear regression. We download a dataset that is related to fuel consumption and Carbon dioxide emission of cars. Then, we split our data into training and test sets, create a model using training set, evaluate our model using test set, and finally use model to predict unknown value.

In [114]:
import pandas as pd
import numpy as np
from scipy  import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('FuelConsumption.csv')
df.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


In [16]:
df_num = df.drop(['MAKE','MODEL','VEHICLECLASS','TRANSMISSION','FUELTYPE','CO2EMISSIONS'],axis=1)
df_num.head()

Unnamed: 0,MODELYEAR,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG
0,2014,2.0,4,9.9,6.7,8.5,33
1,2014,2.4,4,11.2,7.7,9.6,29
2,2014,1.5,4,6.0,5.8,5.9,48
3,2014,3.5,6,12.7,9.1,11.1,25
4,2014,3.5,6,12.1,8.7,10.6,27


In [36]:
for i in range(len(df_num.columns)):
    pearson_coef, p_value = stats.pearsonr(df_num[df_num.columns[i]],df['CO2EMISSIONS'])
    print("Pearson Coefficient of {} :\t Coef {:.2f} and P-value {}".format(df_num.columns[i],pearson_coef, p_value))

Pearson Coefficient of MODELYEAR :	 Coef nan and P-value nan
Pearson Coefficient of ENGINESIZE :	 Coef 0.87 and P-value 0.0
Pearson Coefficient of CYLINDERS :	 Coef 0.85 and P-value 2.770937203986145e-298
Pearson Coefficient of FUELCONSUMPTION_CITY :	 Coef 0.90 and P-value 0.0
Pearson Coefficient of FUELCONSUMPTION_HWY :	 Coef 0.86 and P-value 3.9186556e-316
Pearson Coefficient of FUELCONSUMPTION_COMB :	 Coef 0.89 and P-value 0.0
Pearson Coefficient of FUELCONSUMPTION_COMB_MPG :	 Coef -0.91 and P-value 0.0


We can see that except **MODELYEAR** all features are linearly correlated with the **CO2EMISSIONS**

## Train/Test Split

Lets use only ENGINESIZE fro now to predict CO2EMISSIONS

In [44]:
x = df[['ENGINESIZE']]
y = df[['CO2EMISSIONS']]

In [103]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.15,random_state=1)

In [104]:
lr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [105]:
lr.score(x_test,y_test)

0.7799606185162282

In [109]:
#Now predict values
yhat = lr.predict(x_test)
yhat[0:5]

array([[242.89590576],
       [180.10877717],
       [203.65395039],
       [203.65395039],
       [305.68303435]])

In [126]:
coef = round(lr.coef_[0][0],2)
intercept = round(lr.intercept_[0],2)
print("Intercept {} and Coefficient {}".format(intercept,coef))

Intercept 125.17 and Coefficient 39.24


**Model : 125.17 + 39.24 * ENGINESIZE**