# Simple Linear Regression

## Part 1. Using Statsmodel

In [1]:
#Necessary imports
import pandas as pd
import numpy as np

In [2]:
car_data = pd.read_csv('/users/VarinderSingh/desktop/data_models/CarPrice_Assignment.csv')

In [3]:
#Take a peak into the dataset
car_data.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [4]:
car_data.shape

(205, 26)

In [5]:
car_data.dtypes
#There are some columns that need to be changed to a different type such as doornumber which is an integer
#Columns to be changed - doornumber, cylindernumber but they are written out in text rather than being in numeric format
#I need to install word2number and convert the columns to actual numbers

car_ID                int64
symboling             int64
CarName              object
fueltype             object
aspiration           object
doornumber           object
carbody              object
drivewheel           object
enginelocation       object
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype           object
cylindernumber       object
enginesize            int64
fuelsystem           object
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
price               float64
dtype: object

In [6]:
from word2number import w2n
#w2n.word_to_num(car_data['carbody']) doesn't work
#car_data['carbody'] = car_data['carbody'].word_to_nu


In [7]:
def num(x):
    print(w2n.word_to_num(x))

In [8]:
num('three')

3


In [9]:
#Checking for any empty values
car_data.isnull().sum()

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

### The question I want to ask is what is the independent variable(s) directly contribute to price?

1. Y or response variable is price
2. X will be the feature 

In [10]:
#Doing the statistical analysis
import statsmodels.api as sm

In [11]:
#Setting the indepedent variables (X-variables)  
df = car_data[car_data.columns.difference(['price'])]

#The target (Y-variable)is the price of the car
target = car_data['price']

In [12]:
X = df['horsepower']
y = target

model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

model.summary()

0,1,2,3
Dep. Variable:,price,R-squared (uncentered):,0.901
Model:,OLS,Adj. R-squared (uncentered):,0.9
Method:,Least Squares,F-statistic:,1854.0
Date:,"Sat, 18 Jan 2020",Prob (F-statistic):,2.4299999999999997e-104
Time:,06:21:46,Log-Likelihood:,-2031.7
No. Observations:,205,AIC:,4065.0
Df Residuals:,204,BIC:,4069.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
horsepower,132.0043,3.066,43.055,0.000,125.959,138.049

0,1,2,3
Omnibus:,83.232,Durbin-Watson:,0.594
Prob(Omnibus):,0.0,Jarque-Bera (JB):,226.121
Skew:,1.804,Prob(JB):,7.92e-50
Kurtosis:,6.667,Cond. No.,1.0


### Interpreting the Results
1. OLS model is Ordinary least squares and the method is least squares. We are trying to fit a regression line that would minimize the square of distance from the regression line.
2. Df of Residuals and models relate to degrees of freedom. They are the values in which are free to vary.
3. Coef of 132.0043 means that as horsepower increases by 1, the price increases by 132.0043.
4. R-squared is the percentage of variance our model explains and standard error is the standard deviation of the sampling of the sampling distribution of a statistic, commonly the mean.
5. Our confidence level is at 97.5% meaning we are 97.5% confident the number of horsepower is between 125.959 and 138.049

Now the previous model did not contain a constant. The constant in a regression analysis is the value at which the regression line crosses the y-axis. It is also known as the y-intercept. Let's create a model with this data that has a constant.

In [13]:
X = sm.add_constant(X)

  return ptp(axis=axis, out=out, **kwargs)


In [15]:
model = sm.OLS(y, X).fit() #sm.OLS(output, input), there is a difference in the order specified
predictions = model.predict(X)
model.predict(X)

model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.653
Model:,OLS,Adj. R-squared:,0.651
Method:,Least Squares,F-statistic:,382.2
Date:,"Sat, 18 Jan 2020",Prob (F-statistic):,1.48e-48
Time:,06:33:21,Log-Likelihood:,-2024.0
No. Observations:,205,AIC:,4052.0
Df Residuals:,203,BIC:,4059.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3721.7615,929.849,-4.003,0.000,-5555.163,-1888.360
horsepower,163.2631,8.351,19.549,0.000,146.796,179.730

0,1,2,3
Omnibus:,47.741,Durbin-Watson:,0.792
Prob(Omnibus):,0.0,Jarque-Bera (JB):,91.702
Skew:,1.141,Prob(JB):,1.22e-20
Kurtosis:,5.352,Cond. No.,314.0


This time, the results are different to the previous model. The coefficient is -3721.7615 and the slope is 163.2631.

In [18]:
#Adding another independent variable

X = df[['horsepower','enginesize']]
y = target
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.793
Model:,OLS,Adj. R-squared:,0.791
Method:,Least Squares,F-statistic:,387.7
Date:,"Sat, 18 Jan 2020",Prob (F-statistic):,6.93e-70
Time:,06:49:13,Log-Likelihood:,-1970.9
No. Observations:,205,AIC:,3948.0
Df Residuals:,202,BIC:,3958.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8389.7331,822.532,-10.200,0.000,-1e+04,-6767.882
horsepower,58.8474,11.013,5.344,0.000,37.132,80.562
enginesize,122.4470,10.458,11.709,0.000,101.826,143.068

0,1,2,3
Omnibus:,10.756,Durbin-Watson:,0.774
Prob(Omnibus):,0.005,Jarque-Bera (JB):,16.522
Skew:,0.304,Prob(JB):,0.000258
Kurtosis:,4.251,Cond. No.,558.0


## Linear Regression using SKLearn

In [19]:
from sklearn import linear_model

In [22]:
X = df[['horsepower']]
y = target

In [23]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [25]:
predictions = lm.predict(X)


In [29]:
#A very long list of predictions
print(predictions)

[14400.43827331 14400.43827331 21420.749895   12931.07072458
 15053.49051719 14237.17521234 14237.17521234 14237.17521234
 19135.06704143 22400.32826082 12767.80766361 12767.80766361
 16033.068883   16033.068883   16033.068883   25992.11560215
 25992.11560215 25992.11560215  4114.86543222  7706.65277355
  7706.65277355  7380.12665161  7380.12665161 12931.07072458
  7380.12665161  7380.12665161  7380.12665161 12931.07072458
 10645.38787101 19951.38234628  5747.49604192  8686.23113937
  6074.02216386  8686.23113937  8686.23113937  8686.23113937
  8686.23113937 10318.86174907 10318.86174907 10318.86174907
 10318.86174907 12767.80766361 12604.54460264  9012.75726131
  7706.65277355  7706.65277355 10971.91399295 25012.53723634
 25012.53723634 39053.16047972  7380.12665161  7380.12665161
  7380.12665161  7380.12665161  7380.12665161 12767.80766361
 12767.80766361 12767.80766361 18318.75173658  9992.33562713
  9992.33562713  9992.33562713  9992.33562713  6727.07440773
  9992.33562713 15869.80

In [30]:
lm.score(X,y)

0.653088356490231

In [31]:
lm.coef_

array([163.26306097])

In [32]:
lm.intercept_

-3721.7614943227563