<a href="https://colab.research.google.com/github/deshanahan/DATA-602-Homework/blob/main/Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
# Simple regression using statsmodels

import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

In [21]:
boston = load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [22]:
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['Housing Value'] = boston.target
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Housing Value
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [18]:
X = df['RM']
Y = df['Housing Value']

X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:          Housing Value   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     471.8
Date:                Sat, 27 Feb 2021   Prob (F-statistic):           2.49e-74
Time:                        20:36:19   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -34.6706      2.650    -13.084      0.0

In [23]:
X = df[['RM', 'LSTAT']]
Y = df['Housing Value']

X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:          Housing Value   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     444.3
Date:                Sat, 27 Feb 2021   Prob (F-statistic):          7.01e-112
Time:                        20:45:33   Log-Likelihood:                -1582.8
No. Observations:                 506   AIC:                             3172.
Df Residuals:                     503   BIC:                             3184.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.3583      3.173     -0.428      0.6

This model has an R-Squared of 0.639, which means that about 64% of the variation of the housing values in the Boston dataset can be explained by the average number of rooms per dwelling and the percent of the population who are of lower status with respect to education level and those classified as laborers (https://opendata.stackexchange.com/questions/15740/what-does-lower-status-mean-in-boston-house-prices-dataset).

There is positive autocorellation because the Durbin-Watson score is less than 2.

The data is not from a normal distribution, it is right-skewed because the skew is greater than 0.5 and it is tall because the Kurtosis is greater than 3.



In [70]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score

X = df[['RM', 'LSTAT']]
Y = df['Housing Value']

lm = LinearRegression().fit(X, Y)


print('Target values for features: ', lm.predict(X))
print('\n')
print('R-Squared: ', lm.score(X,Y))
print('\n')
print('Coefficients of RM and LSTAT features: ', lm.coef_)
print('\n')
print('Intercept of the model: ', lm.intercept_)



Target values for features:  [28.94101368 25.48420566 32.65907477 32.40652    31.63040699 28.05452701
 21.28707846 17.78559653  8.10469338 18.24650673 17.99496223 20.73221309
 18.5534842  23.64474107 23.10895823 22.9239452  24.65257604 19.73611045
 18.9297215  20.57377596 13.51732408 20.14832175 17.90896697 15.48764606
 18.35281036 16.56210901 18.74440281 18.34995811 23.51018847 24.94888935
 13.23095259 21.20092715 11.15596625 15.89983805 16.63398622 22.65107562
 21.07107521 22.81275431 22.53014238 29.46686594 33.15564849 30.0244275
 26.33937234 25.50630935 23.42747337 21.03183392 19.03080004 17.28696205
  6.35742724 16.77652446 20.38222834 23.73891662 28.42223975 23.78518476
 19.13293549 32.4841017  27.4553513  30.83048667 25.54262118 22.91599173
 19.44389291 19.76157796 27.21060683 26.99027936 29.66411644 27.68813019
 21.54751591 23.38578845 18.73350058 22.97822472 27.01833368 22.66525802
 25.99579831 25.61529631 26.24614271 24.92488095 22.94287168 23.32670532
 22.46574406 22.7230509