<a href="https://colab.research.google.com/github/quinn-dougherty/DS-Unit-2-Sprint-2-Linear-Regression/blob/master/module1-OLS-regression/Copy_of_Linear_Regression_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# King County Housing Data - Linear Regression Assignment

Data for this assignment was obtained from Kaggle: <https://www.kaggle.com/harlfoxem/housesalesprediction>

Complete the following challenges below to improve iteratively your home price estimation and practice implementing predictive linear regression models. 

# Bivariate Regression

Pick the X variable that you think will be the most correlated with Y. 

Split your dataset into a 50-50 test-train-split (50% of data for training, and 50% for testing).

Train a regression model using this single X and single Y variable. Once you have trained the model and obtained its coefficients, plot the points on a graph and fit your line of best fit to the graph.

Report your Root Mean Squared Error and R-Squared for this model.



In [51]:
import pandas as pd
import numpy as np
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import altair as alt
alt.data_transformers.enable('default', max_rows=None)

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/kc_house_data.csv')
pd.set_option('display.max_columns', 100)

df.date = pd.to_datetime(df.date)
df.date = df.date.map(dt.datetime.toordinal)

print(df.shape)
df.head()



(21613, 21)


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,735519,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,735576,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,735654,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,735576,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,735647,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [0]:
class bivar_simple_regr: 
  def __init__(self, dat, y_feat):
    self.df = dat
    self.Y_feat = y_feat
    self.features = self.df.drop([self.Y_feat], axis=1).columns
    self.bivs = self.make_dict()
    self.best_feat = self.argmax_bivs()

  def make_model(self, feat): 

    X = self.df[feat].values.reshape(-1,1)
    y = self.df[self.Y_feat]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                      test_size=0.5, 
                                                      random_state=42)
  
    model = LinearRegression().fit(X_train, y_train)

    Y_test_predict = model.predict(X_test)

    MSE = mean_squared_error(y_test, Y_test_predict)

    RMSE = np.sqrt(mean_squared_error(y_test, Y_test_predict)) 

    R2 = r2_score(y_test, Y_test_predict)
  
    beta = {0: model.intercept_, 1: model.coef_[0]}
  
    return (beta, R2)

  def make_dict(self): 
    return {feat: self.make_model(feat) for feat in self.features}

  def argmax_bivs(self): 
    '''take a dict feature name -> (beta, R2) and returns the feature name 
    corresponding to the biggest R2'''
    curr_max = - 7**6
    curr_best = None
    for name in self.bivs.keys(): 
      dummy = self.bivs[name][1]
      if (dummy > curr_max): 
        curr_best = name
        curr_max = dummy
    return curr_best
  
  def report(self): 
    feat = self.best_feat
    
    Beta = self.bivs[feat][0]
  
    R2 = self.bivs[feat][1]
  
    s1 = f'The best performing feature in predicting price with simple linear regression is {feat} at R2={R2:.3}\n\n'
  
    s2 = f'M = {Beta[1]:.3}\nb = {Beta[0]:.3}\nIn other words, price_hat = {Beta[1]:.3}*X + {Beta[0]:.3} '
  
    s3 = f'where X is the values of {feat} interpreted as a random variable.\n\n'
  
    return s1+s2+s3
  
  def plot_data(self, feat): 
    C = alt.Chart(self.df).mark_circle(fillOpacity=0.1).encode(x=feat, y=self.Y_feat)

    return C
  
  def plot_LBF(self, feat): 
    X = np.linspace(self.df[feat].min(), self.df[feat].max())
    
    Beta = self.bivs[feat][0]
    m = Beta[1]
    b = Beta[0]
       
    C = alt.Chart(pd.DataFrame.from_dict({'x': X, 'y': m * X + b})).mark_line().encode(x='x',y='y')
    return C
  
  def SHOW(self): 
    print(self.report())
    return self.plot_data(self.best_feat) + self.plot_LBF(self.best_feat)




In [63]:
bivar_price = bivar_simple_regr(df, 'price')

#bivar_price.SHOW()

#bivar_price.bivs

droppem = []
for name in bivar_price.bivs.keys(): 
  if bivar_price.bivs[name][1] < 0.1: 
    droppem.append(name)



0.2727289287491844
0.49226256935760826
0.15740302402072848
0.4309667377739639
0.3702620965114556
0.10538844274155323
0.3384999778053297
7


In [62]:
binom(7,2)

21.0

# Two-variable Multiple Regression

To ramp up slowly, pick a second X variable that you think will be the most correlated with Y. 

Split your dataset into a 50-50 test-train-split (50% of data for training, and 50% for testing).

Train a regression model using these two X variables. Once you have trained the model and obtained its coefficients, plot the points on a graph and fit your **plane** of best fit to the graph.

Report your Root Mean Squared Error and R-squared for this model.

In [0]:
# for a plane of best fit with 2 independent variables, we'll need to test each possible pair of features. 
# since this amounts to length 2 lists from df.columns with no repetitions and where order doesn't matters; 

from itertools import combinations
from scipy.special import binom
binom(len(df.drop(['price'], axis=1).columns), 2)

# 190 -- a bit too big. 

# a lot of the features weren't performing very well in the case of 1 independent variable. 
# lets assume that they won't perform very well if they contribute to the case of 2. 

def droppin_em(dat=df): 
  feat_performance = bivar_simple_regr(dat, 'price').bivs

  droppem = []
  for name in feat_performance.keys(): 
    if feat_performance[name][1] < 0.1: 
      droppem.append(name)
  return droppem

droppem = droppin_em(df)


# Multiple Regression

Now using all available X variables, split your data into test and training datasets, train your model, obtain its coefficients, and report the Root Mean Squared Error and R-squared values.

In [0]:
##### Your Code Here #####

# Stretch Goals

Pick from these stretch goals the tasks that you feel like will be the most beneficial for you. 

- Explore the concept of $R^2$, learn how it is calculated and how it relates to covariance, correlation, and variance. 
- Start to research Polynomial Regression and Log-Linear Regression (tomorrow's topics). Find a new regression dataset and try to implement one of these models. 
- Research "Feature Engineering" see what features you can engineer on the above dataset. How much are you able to improve your accuracy with feature engineering?
- Further explore the concept of "Model Validation" - we'll spend a whole week on this soon. What other measures of model accuracy could we have used besides Root Mean Squared Error?
- Write a blog post explaining the basics of Linear Regression.

Remember to share your findings in the slack channel. :)
