# Simple Regression Modeling with Boston Housing dataset - Lab

In this final lab, we shall apply the regression analysis and diagnostics techniques covered in this section to a familiar "Boston Housing" dataset. We performed a detailed EDA for this dataset in earlier section and hence carry a good understanding of how this dataset is composed. This this lab we shall try to identify the predictive ability of some of features found in this dataset towards identifying house price. 

### Objectives:
You will be able to:
* Build many linear models with boston housing data set using OLS
* For each model, analyze OLS diagnostics for model validity 
* Visually explain the results and interpret the diagnostics from Statsmodels 
* Comment on the goodness of fit for a simple regression model

Let's get started. 

### Import necessary libraries and load 'BostonHousing.csv' as pandas dataframe.

In [128]:
# Your code here
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import scipy.stats as stats
from scipy.stats import kurtosis, skew
plt.style.use('ggplot')
boston = pd.read_csv('BostonHousing.csv')
boston.head()

The data features and target are present as columns in boston data. Boston data gives a set of independent as independent variables in data and the housing rate as `MEDV` in target property. Also feature names are listed in feature_names. The desription is available at [KAGGLE](https://www.kaggle.com/c/boston-housing). 

### Inspect the columns of the dataset and comment on type of variables present

In [129]:
# Your code here
print(boston.info())
# len(boston.crim.unique().tolist()), len(boston.zn.unique().tolist()),len(boston.indus.unique().tolist()),
# len(boston.chas.unique().tolist()),len(boston.nox.unique().tolist()), len(boston.rm.unique().tolist()),
for column in ['rad', 'chas','tax']:
    print(column, 'len unique values =',len(boston[column].unique().tolist()))
    print(column, ' unique values:', boston[column].unique().tolist())



In [130]:
# Record your observations here 
# there are 506 rows, 14 columns and no null and missing values
# The dataset mostly contains continuous variables
# chas is categorical - (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) 
# rad-index of accessibility to radial highways,  is also categorical, with 9 unique values





### Create histograms for all variables in the dataset and comment on their shape (uniform or not ?)

In [131]:
# Your code here 
boston.hist(figsize=(18,10))

In [132]:
#check for skewness & kurtosis
for column in boston: 
    print ('Skewness =',column,  skew(boston[column]))
    print ('kurtosis =',column, kurtosis(boston[column]))

In [133]:
# You observations here 
# indus and rm are fairly symmetrical 
# positively skewed: crim, zn, rad, dis, medv
# negatively skewed: b
# Leptokurtic: crim, zn, chas, b
# platykurtic:  indus, nox, rm, age, dis, rad, tax, ptratio, lstat, medv
# positively skewed: crim, zn, rad, dis, medv
# negatively skewed: b
# Some variables have outliers at extreme tails (b, zn, crim, rm)
# the target variable looks good with some outliers in the right tail

Based on this , we shall choose a selection of features which appear to be more 'normal' than others.
### Create a new dataset with `['crim', 'dis', 'rm', 'zn', 'age', 'medv']`

In [134]:
# Your code here
df_new = boston[['crim', 'dis', 'rm', 'zn', 'age', 'medv']].copy()
df_new.head()


### Check for linearity assumption for all chosen features with target variable using scatter plots and comment on the results

In [135]:
# Your code here 
for column in ['crim', 'dis', 'rm', 'zn', 'age']:
    plt.scatter(data[column], data.medv, label=column)
    plt.legend()
    plt.show()
    
for column in df_new:
    plt.scatter(df_new[column], df_new.medv, label=column)
    plt.legend()
    plt.show();

In [136]:
# Your observations here 
# crim variable's linearity seemd a bit unclear as the values are too close to each other and generally very small
# there is SOME linearity apparent in variables although the variance along y-axis is a bit unpredictable for some values
# Some outliers present in almost all cases
# Data probably needs more normalization and pre-processing to "Clean it up"




Okie so obviously our data needs a lot of pre-procesing to improve the results. This key behind such kaggle competitions is to process the data in such a way that we can identify the relationships and make predictions in the best possible way. For now, we shall leave the dataset untouched and just move on with regression. So far, our assumptions, although not too strong, but still hold to a level that we can move on. 

### Let's do Regression 

Right here is the real deal. Let's perform a number of simple regression experiments between the chosen independent variables and the dependent variable (price). We shall do this is a loop and in every iteration, we shall pick one of the independent variables  perform following steps:

* Run a simple OLS regression between independent and dependent variables
* Plot a regression line on the scatter plots
* Plot the residuals using `sm.graphics.plot_regress_exog()`.
* Plot a Q-Q plot for regression residuals normality test 
* Store following values in array for each iteration:
    * Independent Variable
    * r_squared'
    * intercept'
    * 'slope'
    * 'p-value'
    * 'normality (JB)' 
* Comment on each output 

In [None]:
# Your code here
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import statsmodels.stats.api as sms



results = [['ind_var', 'r_squared', 'intercept', 'slope', 'p-value', 'normality (JB)' ]]
for idx, val in enumerate(['crim', 'dis', 'rm', 'zn', 'age']):
    print ("Boston Housing DataSet - Regression Analysis and Diagnostics for formula: medv~" + val)
    print ("-------------------------------------------------------------------------------------")

    f = 'medv~' + val
#   
    model = smf.ols(formula=f, data=data).fit()
    
    X_new = pd.DataFrame({val: [data[val].min(), data[val].max()]});
    preds = model.predict(X_new)
    data.plot(kind='scatter', x=val, y='medv');
    plt.plot(X_new, preds, c='red', linewidth=2);
    plt.show()
    fig = plt.figure(figsize=(15,8))
    fig = sm.graphics.plot_regress_exog(model, val, fig=fig)
    fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True,   )
    plt.show()
    
    results.append([val, model.rsquared, model.params[0], model.params[1], model.pvalues[1], sms.jarque_bera(model.resid)[0] ])
    input("Press Enter to continue...")

In [None]:
pd.DataFrame(results)

In [None]:
#Your obervations here 
# We can do a detailed analysis of each exsperiment and elaborate in detail 
# Here we shall show a summary of selected observations

# Crime has a negative relationship with price i.e. less crime > higher price and vice vera
# Crime does not show any clear signs heteroscedasticity 
# Crime has a low r-squared so not such a good fit 
# Residuals not normally distributed (needs log normalization that we'll see in next section)

# a positive relationship between dis and medv
# dis residual plots show some signs of heteroscadasticity as cone shaped residuals
# normality is still questionable 

# rm shows a strong positive relationship
# rm residuals show no signs of heteroscdasticity however some outliers are present
# rm qqplot shows a long right tail which hurts normality 

# zn variable scatter shows a lot of varianc


So clearly the results are not highly reliable. the best good of fit i.e. r-squared is witnessed with `rm`. So clearly in this analysis this is our best predictor. 

---
#### So how can we improve upon these results . 
1. Pre-Processing 

This is where pre-processing of data comes in. Dealing with outliers, normalizing data, scaling values etc can help regression analysis get more meaningful results from the given set of data 

2. Advanced Analytical Methods

Simple regression is a very basic analysis techniques and trying to fit a straight line solution to complex analytical questions may prove to be very inefficient. In the next section we shall look at multiple regression where we can use multiple features **AT ONCE** to define a relationship with outcome. We shall also look at some pre-processing and data simplification techniques and re-visit the boston dataset with an improved toolkit. 

## Level up - Optional 

Apply some data wrangling skills that you have learned in previous section to pre-process the set of independent variables we chose above. You can start off with outliers and think of a way to deal with them. See how it affects the the goodness of fit. 
![](https://i.pinimg.com/originals/e5/a5/1e/e5a51eff1b2133105ebaa9b779106ae2.png)

## Summary 

In this lab, we attempted to bring in all the skills learnt so far to a slighlt detailed dataset. We looked at the outcome of our analysis and realized that the data might need some pre-processing to see a clear improvement in results. We shall pick it up in the next section from this point and bring in data pre-processing techniques along with some assumptions that are needed for multiple regression . 