# Homework 3

## Pair Programming Group Number: FILL IN HERE
## Members of Team: FILL IN HERE

## Feature engineering and linear regression

For this week's homework we are going to load in a data set that isn't in the "cleanest", repair it, add a feature, do some analysis on the features, build a linear regression model, and use that model to estimate numeric values.  Is linear regression _really_ machine learning? Depends on who you ask, but it is definitely an important tool for data analytics. 

In [2]:
# only use these libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load in the melb_data_sold_train.csv file here
df = pd.read_csv('melb_data_sold_train.csv')

## Q1 Fix the dataframe to remove any blanks
The linear regression needs all attribute and dependent values to be defined.  Use list-wise deletion to remove entries with missing values. Save the modified dataframe with the indices reset to be $0-(length-1)$ into the variable `df1` for use in a later problem. 

In [None]:
df1 = df.dropna().reset_index()

## Q2 Add a new feature
Toorak is known as one of the priciest suburbs in Melbourne.  Create a new column in your dataframe that is the distance in kilometers from the center of Toorak to the latitude/longitude of that row.  Use the latitude / longitude of $(-37.841820, 145.015986)$ for the center of Toorak.  You may assume the Earth is spherical and has radius of $6371.0088$km (check your function ... the property located at $(-37.68178,144.73779)$ is approx 30.2 km away)

In [1]:
# Step 1 : Define the Haversine distance as a function
# assumes that pt1 and pt2 are 2x1 [lat,long] np arrays that contain locations of the 2 earth coordinates in deg
# using the Haversine formula found https://en.wikipedia.org/wiki/Haversine_formula
def haversine_distance(pt1,pt2):
    lat1 = pt1[0]*np.pi/180
    lat2 = pt2[0]*np.pi/180
    lon1 = pt1[1]*np.pi/180
    lon2 = pt2[1]*np.pi/180
    havTheta = hav(lat2-lat1)+np.cos(lat1)*np.cos(lat2)*hav(lon2-lon1)
    d_r = np.arccos(1-havTheta*2)
    return d_r *6371.0088

#code here, make sure pt1 and pt2 are passed in as degrees (lat,long) and convert to radians before calculation

def hav(theta):
    return (np.sin(theta/2))**2

In [None]:
# A quick check to see if we are getting the expected value
toorak_pt = np.array([-37.841820, 145.015986])
haversine_distance(toorak_pt,[-37.68178,144.73779])


In [None]:
# Step 2: Add a new column to `df1` called 'distance_to_toorak' that uses the haversine_distance function 
# to calculate the distance to Toorak for every row in our dataframe. Save the new dataframe as `df2`
# toorak_pt = np.array([-37.841820, 145.015986])
df2 = df1.assign(distance_to_toorak = haversine_distance(toorak_pt, np.array([df1["Lattitude"], df1["Longtitude"]])))

df2

## Q3 Create a one hot encoding for the categorical column 'Type'
Modify the data frame `df2` such that it removes the column for `Type` and replaces it with the appropriate number of columns for a one-hot encoding of the column `Type` and save that dataframe as `df3` for use in a later problem. The pandas method [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) will be very useful here. 

In [None]:
df3 = df2.drop("Type",1)
ind = df2.columns.get_loc("Type")

ohe = pd.get_dummies(df2["Type"],columns="Type")


for i in range(len(ohe.columns)):
    df3.insert(ind+i,ohe.columns[i],ohe[ohe.columns[i]])
    
df3

## Q4 Calculate the pairwise correlations between all of your numeric attributes
Use the Pearson correlation as discussed in the lectures to calculate the pairwise correlations between the attributes in the dataframe `df3`. Read the documentation for [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). 

In [None]:
pCorr = df3.corr(method="pearson")
pCorr

## Q5 Create a linear regression model to predict home values
Using the math in ESLII, section 3.2 equation (3.6) calculate $\hat{\beta}$

We are going to create a linear regression model using our numeric attribute columns in `df3`, and specifying the home values (`Price` column) as the value we are trying to predict.  You may use numpy to do matrix calculations, but you may not use a built in regression library (for example, you may not use scikt-learn). 

The features you use to build the matrix $X$ should all be numeric and include the distance to Toorak and the one hot encodings. 

In [None]:
# Step one, build the matrix X
numericCols = df3.select_dtypes(include=[np.number])
numericCols.insert(0,"intercept",1)
X = numericCols.drop("Price",1).drop("index",1).drop("Postcode",1).drop("Lattitude",1).drop("Longtitude",1)#.to_numpy()
#X.shape
print (X)

In [None]:
# Step two, build the column vector y
y = numericCols["Price"].to_numpy()
#y.shape
y = y.reshape(y.shape[0],1)
#y

In [None]:
# Step three, find beta hat per the formula (3.6) (you should use the library we used in class)
[beta_hat, residuals, rank, s] = np.linalg.lstsq(X, y, rcond=-1)
beta_hat

In [None]:
# Test the model to see if we get something "reasonable" - i picked 23 at random 
np.matmul(X.iloc[23],beta_hat) 

In [None]:
# This is the actual price at this point
y[23]

## Q6 Apply the linear regression model to the test data and visualize the error
We will cover other methods of evaluating any sort of prediction later, but for this week's exercise I have partitioned the data into two files.  Load the melb_data_sold_test.csv data set and use the $\hat{\beta}$ you calculated in the last step to predict the housing prices for data in melb_data_sold_test.  Create a visualization that shows the absolute error in your predictions. Remember to do all your data pre-processing on the data loaded from the melb_data_sold_test file before you apply beta_hat.  For the visualization, a histogram of the absolute error vs the total housing prices is sufficient.  Use [hist](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.hist.html) for reference. 

While doing imputation, there are some helpful parameters in [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html). 

In [7]:
# Step zero, load the melb_data_sold_test.csv data for testing.  Use Imputation to fill in any missing numeric values
# We use imputation here instead of deletion since we want a prediction for _every_ row in the test file.

df_sold_test = pd.read_csv('melb_data_sold_test.csv')

for val in df_sold_test.select_dtypes(include=np.number).columns:
    df_sold_test[val] = df_sold_test[val].fillna(value=df_sold_test[val].mean())

Rooms
Price
Postcode
Bedroom2
Bathroom
Car
Landsize
BuildingArea
YearBuilt
Lattitude
Longtitude


In [8]:
# Step one, add the new attribute for the 'distance_to_toorak' and the one hot encoding to the new data frame
toorak_pt = np.array([-37.841820, 145.015986])
df_sold_test["distance_to_toorak"] = haversine_distance(toorak_pt, np.array([df_sold_test["Lattitude"], df_sold_test["Longtitude"]]))
df_sold_test

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Date,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,distance_to_toorak
0,Jacana,29 Fox Ct,3,h,620500.0,29/07/2017,3047.0,3.0,1.0,4.0,764.0,244.154731,1968.796396,Hume,-37.68907,144.91459,Northern Metropolitan,19.181567
1,Keilor Park,45 Collinson St,3,h,750000.0,29/07/2017,3042.0,3.0,2.0,2.0,761.0,222.000000,1980.000000,Brimbank,-37.72224,144.85739,Western Metropolitan,19.263068
2,Kensington,42 Gower St,3,h,1060000.0,29/07/2017,3031.0,3.0,1.0,0.0,190.0,244.154731,1968.796396,Melbourne,-37.79560,144.92779,Northern Metropolitan,9.296810
3,Kew,6/385 Barkers Rd,3,t,1405000.0,29/07/2017,3101.0,3.0,2.0,2.0,325.0,129.000000,1980.000000,Boroondara,-37.81614,145.05056,Southern Metropolitan,4.168250
4,Kew,11 Raheen Dr,4,h,3015000.0,29/07/2017,3101.0,4.0,2.0,2.0,813.0,276.000000,1970.000000,Boroondara,-37.80437,145.01725,Southern Metropolitan,4.165735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1016,Wantirna South,15 Mara Cl,4,h,1330000.0,26/08/2017,3152.0,4.0,2.0,2.0,717.0,191.000000,1980.000000,,-37.86887,145.22116,Eastern Metropolitan,18.262739
1017,Werribee,5 Nuragi Ct,4,h,635000.0,26/08/2017,3030.0,4.0,2.0,1.0,662.0,172.000000,1980.000000,,-37.89327,144.64789,Western Metropolitan,32.814349
1018,Westmeadows,9 Black St,3,h,582000.0,26/08/2017,3049.0,3.0,2.0,2.0,256.0,244.154731,1968.796396,,-37.67917,144.89390,Northern Metropolitan,21.030518
1019,Wheelers Hill,12 Strada Cr,4,h,1245000.0,26/08/2017,3150.0,4.0,2.0,2.0,652.0,244.154731,1981.000000,,-37.90562,145.16761,South-Eastern Metropolitan,15.081333


In [None]:
# Step two, build the attribute matrix Xdot 


In [None]:
# Step three, multiply Xdot by Beta hat. DO NOT USE A LOOP.  This is a vector of predicted prices
# called y_hat in the notes

In [None]:
# Step four, calculate the error vector, |actual price - predicted price|. We call this our "absolute error"

In [None]:
# Step five, create a histogram of the absolute error, and on the same plot create a histogram of the actual price.  
# You should use the "alpha" parameter to make the graph on top slightly translucent 