<a href="https://colab.research.google.com/github/Distortedlogic/DS-Unit-2-Regression-Classification/blob/master/Jeremy_meek_assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 3

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

Instead, predict property sales prices for **One Family Dwellings** (`BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'`) using a subset of the data where the **sale price was more than \\$100 thousand and less than $2 million.** 

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do exploratory visualizations with Seaborn.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a linear regression model with multiple features.
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.


## Stretch Goals
- [ ] Add your own stretch goal(s) !
- [ ] Try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) instead of Linear Regression, especially if your errors blow up! Watch [Aaron Gallant's 9 minute video on Ridge Regression](https://www.youtube.com/watch?v=XK5jkedy17w) to learn more.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way (without an excessive amount of formulas or academic pre-requisites).
(That book is good regardless of whether your cultural worldview is inferential statistics or predictive machine learning)
- [ ] Read Leo Breiman's paper, ["Statistical Modeling: The Two Cultures"](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [0]:
%%capture
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module3')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sp
import category_encoders as ce

from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.preprocessing import StandardScaler

import time
from math import sqrt

import pandas_profiling
import pprint
pp = pprint.PrettyPrinter(indent=4)

In [0]:
# Read New York City property sales data
odf = pd.read_csv('../data/condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
odf.columns = [col.replace(' ', '_') for col in odf]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
odf['SALE_PRICE'] = (
    odf['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

fam = (odf['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS')
price_range = (100_000<odf['SALE_PRICE']) & (odf['SALE_PRICE']<2_000_000)
odf = odf[fam & price_range]

odf['LAND_SQUARE_FEET'] = odf['LAND_SQUARE_FEET'].apply(lambda s: s.replace(',', '')).astype(int)
odf['ZIP_CODE'] = odf['ZIP_CODE'].astype(str)

toss = ['APARTMENT_NUMBER', 'ZIP_CODE', 'EASE-MENT', 'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_CATEGORY', 'ADDRESS']
odf = odf.drop(toss, axis=1)

#Custom Functions

In [0]:
def timeit(method):
  '''
  Decorator to time how long a function runs
  '''
  def timed(*args, **kw):
      def _pretty(value):
          '''From seconds to Days;Hours:Minutes;Seconds'''

          valueD = (((value/365)/24)/60)
          Days = int(valueD)

          valueH = (valueD-Days)*365
          Hours = int(valueH)

          valueM = (valueH - Hours)*24
          Minutes = int(valueM)

          valueS = (valueM - Minutes)*60
          Seconds = int(valueS)

          return str(Days)+"D:"+str(Hours)+"H:"+str(Minutes)+"M:"+str(Seconds)+"S"

      ts = time.time()
      result = method(*args, **kw)
      te = time.time()
      print(f'\n{method.__name__} took {_pretty(te-ts)}\n')
      return result
  return timed

In [0]:
@timeit
def calc_best_features(dataframe, target):
  '''
  Calculate the best features to use with linear regression.

  Parameters:
  dataframe - Your dataframe with target as a feature.
  target - What you want to predict.

  Return:
  Dictionary:
   - Different scoring metrics of the best model.
   - Number of features
   - The best features.
  '''

  kf = KFold(n_splits=10, random_state=42, shuffle=True)
  model = LinearRegression()

  features = [f for f in dataframe.columns if f not in target]
  stats = {
        'mse': np.inf,
    }
    
  for train_index, test_index in kf.split(dataframe[features]):
    train, test = dataframe.iloc[train_index], dataframe.iloc[test_index]

    for k in range(1, len(features)+1):
        
        selector = SelectKBest(score_func=f_regression, k=k)
        X_train_selected = selector.fit_transform(train[features], train[target])
        X_test_selected = selector.transform(test[features])
        
        model.fit(X_train_selected, train[target])
        y_pred = model.predict(X_test_selected)
        
        if mean_squared_error(test[target], y_pred) < stats['mse']:
          stats['mse'] = mean_squared_error(test[target], y_pred)
          stats['rmse'] = sqrt(mean_squared_error(test[target], y_pred))
          stats['r2'] = r2_score(test[target], y_pred)
          stats['mae'] = mean_absolute_error(test[target], y_pred)
          stats['num_features'] = k

          selected_mask = selector.get_support()
          stats['features'] = list(train[features].columns[selected_mask])
          stats['exclude_features'] = list(train[features].columns[~selected_mask])

  return stats

In [0]:
def build_linear_model(dataframe, target):
  '''
  Builds a linear regression model. Metrics are calculated on a hold out set.

  Parameters:
  dataframe - Your dataframe with target as a feature.
  target - What you want to predict.

  Return:
  Dictionary:
   - Different scoring metrics of the best model.
   - The Model
  '''

  kf = KFold(n_splits=10, random_state=42, shuffle=True)
  model = LinearRegression()

  features = [f for f in dataframe.columns if f not in target]
  stats = {
        'mse': np.inf,
    }
  
  build_set, holdout_set = train_test_split(dataframe, test_size=0.1, shuffle=True)
    
  for train_index, test_index in kf.split(build_set[features]):
    train, test = build_set.iloc[train_index], build_set.iloc[test_index]
        
    model.fit(train[features], train[target])
    y_pred = model.predict(test[features])
    
    if mean_squared_error(test[target], y_pred) < stats['mse']:
      stats['mse'] = mean_squared_error(test[target], y_pred)
      best_model = model

      # stats['rmse'] = sqrt(mean_squared_error(test[target], y_pred))
      # stats['r2'] = r2_score(test[target], y_pred)
      # stats['mae'] = mean_absolute_error(test[target], y_pred)
      # stats['model'] = best_model

  holdout_pred = best_model.predict(holdout_set[features])

  stats['mse'] = mean_squared_error(holdout_set[target], holdout_pred)
  stats['rmse'] = sqrt(mean_squared_error(holdout_set[target], holdout_pred))
  stats['r2'] = r2_score(holdout_set[target], holdout_pred)
  stats['mae'] = mean_absolute_error(holdout_set[target], holdout_pred)
  stats['model'] = best_model

  return stats

In [0]:
def build_ridge_model(dataframe, target):
  '''
  Builds a ridge regression model. Metrics are calculated on a hold out set.

  Parameters:
  dataframe - Your dataframe with target as a feature.
  target - What you want to predict.

  Return:
  Dictionary:
   - Different scoring metrics of the best model.
   - alpha
   - The Model
  '''

  features = [f for f in dataframe.columns if f not in target]
  stats = {
        'mse': np.inf,
    }
  
  build_set, holdout_set = train_test_split(dataframe, test_size=0.1, random_state=42, shuffle=True)

  ridge = RidgeCV(alphas = [i/10 for i in range(1, 20, 1)], cv = 10)
  ridge.fit(build_set[features], build_set[target])

  y_pred = ridge.predict(holdout_set[features])

  stats['mse'] = mean_squared_error(holdout_set[target], y_pred)
  stats['rmse'] = sqrt(mean_squared_error(holdout_set[target], y_pred))
  stats['r2'] = r2_score(holdout_set[target], y_pred)
  stats['mae'] = mean_absolute_error(holdout_set[target], y_pred)
  stats['model'] = ridge
  stats['alpha'] = ridge.alpha_

  return stats

In [0]:
df = odf

#Preprocessing

In [0]:
newest_time = pd.to_datetime(df['SALE_DATE']).max()
df['days_old'] = (pd.to_datetime(df['SALE_DATE'])-newest_time).apply(lambda t: abs(t.days))
df = df.drop('SALE_DATE', axis=1)

In [0]:
temp = df
scaler = StandardScaler()
to_scale = list(set(list(temp.select_dtypes(exclude=['object']))) - set(['SALE_PRICE']))
temp[to_scale] = scaler.fit_transform(temp[to_scale])
scaled = temp

In [0]:
le = ce.OneHotEncoder()
to_encode = list(scaled.select_dtypes(include=['object']))
encoded = le.fit_transform(scaled[to_encode])
encoded_df = pd.concat([scaled, encoded], axis=1).drop(to_encode, axis = 1)

#Build Model

In [37]:
mean_squared_error([encoded_df['SALE_PRICE'].mean() for _ in range(0,len(encoded_df))], encoded_df['SALE_PRICE'])

85816118571.50015

In [143]:
pp.pprint(build_linear_model(encoded_df, ['SALE_PRICE']))

{   'mae': 125652.99050632911,
    'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
    'mse': 32101293782.363926,
    'r2': 0.5300170024915367,
    'rmse': 179168.33922979786}


In [131]:
pp.pprint(build_ridge_model(encoded_df, ['SALE_PRICE']))

{   'alpha': 0.8,
    'mae': 133114.370822912,
    'model': RidgeCV(alphas=array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
       1.4, 1.5, 1.6, 1.7, 1.8, 1.9]),
        cv=10, fit_intercept=True, gcv_mode=None, normalize=False, scoring=None,
        store_cv_values=False),
    'mse': 38368434505.96907,
    'r2': 0.5395847582249651,
    'rmse': 195878.6218707112}
