<a href="https://colab.research.google.com/github/coding-dojo-data-science/data-viz-wk17-codealong/blob/main/WarmUps/DataViz_Week17_Lect01_WarmUp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔥 DataViz: Week 17, Lecture 01 - Warm Up


- 01/24/22


## 🗺️ INSTRUCTIONS

>- Run the code in this notebook to create and evaluate the Linear Regression model at the bottom. 
  - `Runtime` menu > `Run All`
- Scroll down to the regression model results and
- Answer the questions below the model and be prepared to discuss with the class.

[Click here](#warmup)

Note: this notebook **does NOT follow best practices** (minimal EDA, removing outliers before train/test split, etc.). It is meant for discussion purposes.


# CODE 


### Updating Scikit-Learn to V 1.1.3

In [None]:
## UPDATING SKLEARN ON COLAB
!pip install scikit-learn==1.1.3

from IPython.display import clear_output
clear_output()

import sklearn as sk
vers = !python --version
print(f"Python Vers: {vers[0]}")
print(f"Scikit-learn Vers: {sk.__version__}")

In [None]:
## Import Standard Packages
import os, sys
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
fav_style = ('ggplot','tableau-colorblind10')
fav_context  ={'context':'notebook', 'font_scale':1.1}
plt.style.use(fav_style)
sns.set_context(**fav_context)
plt.rcParams['savefig.transparent'] = False
plt.rcParams['savefig.bbox'] = 'tight'

In [None]:
## Preprocessing Imports ([ ] TO DO: Consider making preprocess_imports module)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import (make_column_transformer, make_column_selector, 
                             ColumnTransformer)
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import metrics

from sklearn.linear_model import LinearRegression

##import statsmodels correctly
import statsmodels.api as sm
from scipy import stats

## fixing random for lesson generation
SEED = 321
np.random.seed(SEED)

In [None]:
def evaluate_regression(model, X_train,y_train, X_test, y_test,as_frame=True): 
  """Evaluates a scikit learn regression model using r-squared and RMSE. 
  Returns the results a DataFrame if as_frame is True (Default).
  """
  ## Training Data
  y_pred_train = model.predict(X_train)
  r2_train = metrics.r2_score(y_train, y_pred_train)
  rmse_train = metrics.mean_squared_error(y_train, y_pred_train, 
                                          squared=False)
  mae_train = metrics.mean_absolute_error(y_train, y_pred_train)


  ## Test Data
  y_pred_test = model.predict(X_test)
  r2_test = metrics.r2_score(y_test, y_pred_test)
  rmse_test = metrics.mean_squared_error(y_test, y_pred_test, 
                                          squared=False)
  mae_test = metrics.mean_absolute_error(y_test, y_pred_test)

  ## if returning a dataframe:
  if as_frame:
      df_version =[['Split','R^2','MAE','RMSE']]
      df_version.append(['Train',r2_train, mae_train, rmse_train])
      df_version.append(['Test',r2_test, mae_test, rmse_test])
      df_results = pd.DataFrame(df_version[1:], columns=df_version[0])
      df_results = df_results.round(2)


      # adapting hide_index for pd version
      if pd.__version__ < "1.4.0":
        display(df_results.style.hide_index().format(precision=2, thousands=','))
      else:
        display(df_results.style.hide(axis='index').format(precision=2, thousands=','))

  ## If not dataframe, just print results.    
  else: 
      print(f"Training Data:\tR^2 = {r2_train:,.2f}\tRMSE = {rmse_train:,.2f}\tMAE = {mae_train:,.2f}")
      print(f"Test Data:\tR^2 = {r2_test:,.2f}\tRMSE = {rmse_test:,.2f}\tMAE = {mae_test:,.2f}")



def plot_residuals(model,X_test_df, y_test,figsize=(12,5)):
  """Plots a Q-Q Plot and residual plot for a regression model.
  """
  ## Make predictions and calculate residuals
  y_pred = model.predict(X_test_df)
  resid = y_test - y_pred

  fig, axes = plt.subplots(ncols=2,figsize=figsize)

  ## Normality 
  sm.graphics.qqplot(resid, line='45',fit=True,ax=axes[0]);

  ## Homoscedascity
  ax = axes[1]
  ax.scatter(y_pred, resid, edgecolor='white',lw=0.5)
  ax.axhline(0,zorder=0)
  ax.set(ylabel='Residuals',xlabel='Predicted Value');
  fig.suptitle("Residual Plots", y=1.01)
  plt.tight_layout()


## Load Data

In [None]:
## Load in data
FILE = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSEZQEzxja7Hmj5tr5nc52QqBvFQdCAGb52e1FRK1PDT2_TQrS6rY_TR9tjZjKaMbCy1m5217sVmI5q/pub?output=csv"
df = pd.read_csv(FILE)

# setting index & dropping 
# df = df.set_index('id')
use_cols = ['bedrooms','bathrooms','sqft_living','yr_built','waterfront',
            'floors','price']
df = df[use_cols].copy()
df.info()
df.head()


## EDA

> This notebook does NOT follow best practices (minimal EDA, removing outliers before train/test split, etc.)

In [None]:
df.isna().sum()

In [None]:
df.describe()

### Checking the Target

In [None]:
## Plotting histogram and boxplot together
target = 'price'
grid_spec = {'height_ratios':[0.7,0.3]}

fig, axes = plt.subplots(nrows=2, figsize=(10,6), 
                         gridspec_kw=grid_spec)
axes[0].set_title(f"Distribution of {target}")

sns.histplot(data=df, x=target, ax=axes[0])
sns.boxplot(data=df, x=target, ax=axes[1])

- This dataset is known to be a tricky regression without addressing assumptions of linear regression.
- Doing a M.V.P. removal of outliers, just from target.

#### Removing Outliers from Target

In [None]:
## find outliers 
idx_outliers = np.abs(stats.zscore(df[target]) )>3
idx_outliers.sum()

In [None]:
# saving non-outliers
df = df[~idx_outliers].copy()
df

In [None]:
# visualizing final target

fig, axes = plt.subplots(nrows=2, figsize=(10,6), 
                         gridspec_kw=grid_spec)
axes[0].set_title(f"Distribution of {target} (Outliers Removed")

sns.histplot(data=df, x=target, ax=axes[0])
sns.boxplot(data=df, x=target, ax=axes[1])

### EDA: Features vs Target

In [None]:
fig_pairplot = sns.pairplot(df,y_vars='price');
fig_pairplot

In [None]:
fig_corr,ax = plt.subplots(figsize=(8,7))
sns.heatmap(df.corr(), annot=True, fmt= ".2g",cmap='coolwarm', ax=ax);
ax.set_title("Correlation Heatmap");

## Preprocessing

In [None]:
## define X/y and train-test-split
target='price'
y = df[target].copy()
X = df.drop(columns=target).copy()

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=SEED)


## make categorical preprocessing pipeline
cat_sel = make_column_selector(dtype_include='object')

cat_pipe = make_pipeline( SimpleImputer(strategy='constant', 
                                        fill_value='MISSING'),
                         OneHotEncoder(drop='first',
                                       sparse=False)
                        )


## make numeric preprocessing pipeline
num_sel = make_column_selector(dtype_include='number')

num_pipe = make_pipeline( SimpleImputer(strategy='mean'))

>- Note: So far, all of the code should be familiar to you. 
    -  With sklearn v1.1+, you should always add `verbose_feature_names_out=False` to column transformers *`ColumnTransformer`/ `make_column_transformer`)
        - (If you want to see what the verbose version looks like feel free to give it a try!)

In [None]:
## make the preprocessing column transformer
preprocessor = make_column_transformer( (num_pipe, num_sel),
                                        (cat_pipe,cat_sel),                                      
                                       verbose_feature_names_out=False)
preprocessor

In [None]:

## Fit preprocessor and get feature names
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()
feature_names

In [None]:
### PREP ALL X VARS AS DATAFRAMES
## Prepare X_train_df
X_train_df = pd.DataFrame( preprocessor.fit_transform(X_train), 
                          columns = feature_names,
                         index = X_train.index)

## Prepare X_test_df
X_test_df = pd.DataFrame( preprocessor.transform(X_test),
                          columns = feature_names,
                         index=X_test.index)
X_train_df

<a id="warmup"></a>

<a name="warmup"></a>
# 🔥 **Warm-Up Questions** 


## Linear Regression

In [None]:
## Features Used
X_train_df.head(3)

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_df,y_train)

## Questions to Answer - Part 1: Model Metrics

In [None]:
## evaluate model
evaluate_regression(lin_reg, X_train_df, y_train, 
                    X_test_df, y_test)



> Use the cells below to answer the following questions:

- **Q1: Does this model have a good MAE?**
  - ...
- **Q2: Does this model have a good R-squared?**
  - ...

- **Q3: Is this model overfit, under fit, or neither?**
  - ...

## Questions to Answer - Part 2: Assumptions of Linear Regression

> Use the visualizations below to answer the following questions:

- **Q4: Does this regression meet the assumptions of Linear Regression:**
  - 1) Little to no multicollinearity:
    - ...
  - 2) Features have a linear relation with the target.
    - ...
  - 3) Homoscedasticity
    - ...
  - 4) Normally Distributed Residuals
    - ...



In [None]:
fig_corr

In [None]:
## Using a pairgrid 
# https://seaborn.pydata.org/tutorial/axis_grids.html#plotting-pairwise-data-relationships
features= df.drop(columns=target).columns
pair_g = sns.PairGrid(df, y_vars=["price"], x_vars=features)
pair_g.map(sns.regplot,
      scatter_kws={'ec':'white','lw':0.5},
      line_kws={'color':'red','ls':':','lw':2});

In [None]:
plot_residuals(lin_reg, X_test_df,y_test)

# Final Questions to Answer (if you can)

- Q5: What features made the model predict a higher price?
  - ...

- Q6: Which feature made the model predict a lower price?
  - ...
- Q7:  What effect does having more bathrooms have on predicted price?
  - ...


> Return to the main zoom room once you've answered the questions.