# Machine Learning and Statistics Project, November 2019
***

## **Submitted by:** Francis Adepoju (G00364694)
***
## __Title:__ Using Descriptive Statistics and Plots to Describe the Boston House Prices Dataset 
***


## Summary of the dataset:
#### The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:
1. __crim__   - per capita crime rate by town.
2. __zn__     - proportion of residential land zoned for lots over 25,000 sq.ft.
3. __indus__  - proportion of non-retail business acres per town.
4. __chas__   - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. __nox__    - nitrogen oxides concentration (parts per 10 million).
6. __rm__     - average number of rooms per dwelling.
7. __age__    - proportion of owner-occupied units built prior to 1940.
8. __dis__    - weighted mean of distances to five Boston employment centres.
9. __rad__    - index of accessibility to radial highways.
10. __tax__    - full-value property-tax rate per **`$10,000`**.
11. __ptratio__  - pupil-teacher ratio by town.
12. __black__  - `1000 * (Bk - 0.63)^2` where Bk is the proportion of blacks by town.
13. __lstat__  - lower status of the population (percent).
14. __medv__  - median value of owner-occupied homes in **`$1,000`**.

#### NOTE: 
1. The __medv__ variable is the target (y) variable while the effect of the remaining 13 variables on house prices are to be investigated.
2. In this project, we use the Python[1], scipy[2], keras[3], and Jupyter[4] packages to produce a comprehensive description of house prices using the Boston house prices' dataset [5] 
***
***
        



#### Import necessary Libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.linalg as sl



#### Load Dataset from my gitHub repository

In [4]:
# Let's use pandas to read this csv file and organise the housing data.
# Load the boston-housing dataset... This is the URL from "raw" version of housing.csv file from my github
#df = pd.read_csv("housingCSV2.csv")
df = pd.read_csv("https://raw.githubusercontent.com/dewaledr/MLearning-Projects/master/housing.csv")
print(df.head())
print(df.tail())

      crim    zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

    black  lstat  medv  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
4  396.90   5.33  36.2  
        crim   zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio  \
501  0.06263  0.0  11.93     0  0.573  6.593  69.1  2.4786    1  273     21.0   
502  0.04527  0.0  11.93     0  0.573  6.120  76.7  2.2875    1  273     21.0   
503  0.06076  0.0  11.93     0  0.573  6.976  91.0  2.1675    1  273     21.0   
504  0.10959  0.0  11.93     0  0.573  6.794 

***

#### Multi-linear regression using sklearn.
[https://scikit-learn.org/stable/](https://scikit-learn.org/stable/)

In [5]:
# Import linear_model from sklearn.
import sklearn.linear_model as lm

In [6]:
# Create a linear regression model instance.
m = lm.LinearRegression()

Assuming the following linear relationship holds for the house prices and other variables:
$$ medv = a (crim) + b (zn) + c (indus) + d (chas) + e (nox) + f (rm) + g (age) + 
          h (dis) + j (rad) + k (tax) + m (ptratio) + n (black) + p (lstat) $$

In [7]:
# Let's pretend we want to do linear regression on these variables to predict petal width.
x = df[['crim', 'zn', 'indus','chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 
        'black', 'lstat']]

In [8]:
# price relationship.
y = df['medv']

In [9]:
# Ask our model to fit the data.
m.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
# Here's our intercept.
m.intercept_

36.459488385090005

In [11]:
# Here's our coefficients, in order.
m.coef_

array([-1.08011358e-01,  4.64204584e-02,  2.05586264e-02,  2.68673382e+00,
       -1.77666112e+01,  3.80986521e+00,  6.92224640e-04, -1.47556685e+00,
        3.06049479e-01, -1.23345939e-02, -9.52747232e-01,  9.31168327e-03,
       -5.24758378e-01])

In [12]:
# See how good our fit is.
m.score(x, y)

0.7406426641094094

### Usind statsmodel for the above operation,

In [14]:
# inport statsmodels.
import statsmodels.api as sm

In [15]:
# Tell statmodels to include an intercept.
xwithc = sm.add_constant(x)

# Create a model.
msm = sm.OLS(y, xwithc)
# Fit the data.
rsm = msm.fit()
# Print a summary.
print(rsm.summary())

                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Mon, 14 Oct 2019   Prob (F-statistic):          6.72e-135
Time:                        20:16:34   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         36.4595      5.103      7.144      0.0

#### The condition number is large, 15,100. This might indicate that there are strong multicollinearity or other numerical problems... Need to investigate this!

In [17]:
#y

NOTE: Kaggle account opened & housing.csv dataset downloaded - Today 3rd October, 2019
# References:
#### [1] Python Software Foundation: https://www.python.org/
#### [2] SciPy developers: https://www.scipy.org/
#### [3] Keras: https://keras.io/
#### [4] Project Jupyter: https://jupyter.org/
#### [5] Housing Values in Suburbs of Boston: https://www.kaggle.com/c/boston-housing.