<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Multiple Linear Regression Lab 6 : Akaike’s Information Criteria (AIC)

### Overview
Figure out which attributes to include using AIC

### Builds on
None

### Run time
approx. 20 minutes

### Notes



In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import RFormula
print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])


## Step 1 : House data

In [None]:
housePrices = spark.read.csv("/data/house-prices/house-sales-full.csv", header=True, inferSchema=True)
housePrices.show()

## Step 2: Apply an R formula for Feature Extraction

R users will be familiar with the concept of the **formula**.  The formula has a lot of features, but in its most basic form what it consists of is the following:

```
 y-variable ~ x-variable1 + xvariable2 + ....
```

basically, the y variable is the variable we are trying to predict, while the x variable(s) are the variables 
that we are using to make the prediction.  There are some complexities but that's the gist of it.

In the process, R will convert all categorical variables using one-hot encoding, and index strings.  Remember, features in spark are only allowed to be numeric (doubles).  NAs are also forbidden, so those are converted as well.

**=>TODO: instantiate R formula with formula tex tand features column = "features" **

In [None]:
#lm(SalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + NbrLivingUnits + SqFtFinBasement + YrBuilt + YrRenovated + NewConstruction,
#              data = house.prices, na.action = na.omit)
    

variables = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade', 'PropertyType',
               'NbrLivingUnits', 'SqFtFinBasement', 'YrBuilt', 'YrRenovated', 'NewConstruction']

textFormula = "SalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + \
               NbrLivingUnits + SqFtFinBasement + YrBuilt + YrRenovated + NewConstruction"

formula = RFormula(
    formula=????,
    featuresCol="????",
    labelCol="label")    


featureVector = formula.fit(housePrices).transform(housePrices)
featureVector.select("features", "label").show()


## Step 2: Run MLR With All Attributes

In [None]:
glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)
lrModel = glr.fit(featureVector)

print("Coefficents:" + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

In [None]:

# Summarize the model over the training set and print out some metrics
summary = lrModel.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()

**Inspect the summary output**

**=>TODO: What can we say about our model in terms of evaluation? **

## Step 3:  Run AIC calculation

We can do some parameter tuning here. In general, lower AIC is better.  By removing certain variables from the mix, we can get lower AICs and therefore a better model.

But how can we do this?  Let's programatically generate model combinations, and then run them. We're looking at all combinations of 8 variables out of 11, so that's 165 combinations.

**=>TODO: Run a loop of all combinations of 8 variables or more. Watch the output. Which is optimal? **

In [None]:
import itertools

def formulaGen(xvars, yvar):
    returnformula = yvar + " ~ "
    length = len(xvars)
    for xvar in xvars:
        returnformula = returnformula + xvar;
        length -= 1
        if (length != 0):
            returnformula = returnformula + " + "
        
    return returnformula

min_aic = summary.aic
min_model = lrModel
min_formula = textFormula

for L in range(8, len(variables)): #Find all combinations of minimum 8 variables
  for subset in itertools.combinations(variables, L):
    this_formula = formulaGen(subset, 'SalePrice')
    formula = RFormula(
        formula=this_formula,
        featuresCol="features",
        labelCol="label")
    featureVector_iter = formula.fit(housePrices).transform(housePrices)
    lr_iter = glr.fit(featureVector_iter)
    if (lr_iter.summary.aic < min_aic):
        min_aic = lr_iter.summary.aic
        min_model = lr_iter
        min_formula = this_formula
        print("New Lowest AIC found:" + str(min_aic))

print(min_formula)
print("AIC:" + str(min_aic))
# Summarize the model over the training set and print out some metrics
summary = min_model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()



**Observe the formula, which attributes are included / dropped**