<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 05 | Advanced Regression Models</h2>
<h4>DAT-5390 | Computational Data Analytics with Python</h4>
Chase Kusterer - Faculty of Analytics<br>
Hult International Business School<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Comparing statsmodels and scikit-learn</h2><br>
It may seem counterproductive to build models in both statsmodels and scikit-learn, but each package has its advantages.<br><br>
<u>Advantages of statsmodels</u><br>

* This is a great tool for generating model summaries, enabling analysts to base decisions on familiar metrics such as p-values and R-Square.
* Model outputs are similar to that of R and Excel.
<br><br>

<u>Advantages of scikit-learn</u><br>

* Minimal things happen behind the scenes, making scikit-learn faster than statsmodels. This becomes a serious advantage when running a model in real time on a server or cloud.
* It is incredibly easy to change model types, allowing analysts to experiment with minimal effort.
<br><br>

<u>Disadvantages of statsmodels</u><br>

* Oftentimes, a substantial amount of code needs to be modified in order to change model types.
* Some metrics in a model's summary output may not be relevant to the the analysis at hand. Furthermore, the most important metrics for the analysis may not be available in the summary.
<br><br>

<u>Disadvantages of scikit-learn</u><br>

* Analysts have to tell scikit-learn which metrics to generate, which can be tedious in a complex analysis.
* Some statistical concepts, such as p-values, do not exist.

<br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Preparation</h2>

In [None]:
# importing libraries
import pandas as pd                                  # data science essentials
import numpy  as np                                  # mathematical essentials
import matplotlib.pyplot as plt                      # data viz
import seaborn as sns                                # enhanced data viz
import statsmodels.formula.api as smf                # linear modeling
from sklearn.model_selection import train_test_split # train/test split
import sklearn.linear_model                          # faster linear modeling
from baserush.optimize import quick_lm               # efficient base modeling
from baserush.summary import  lr_summary             # New! model summaries


# new libraries
from sklearn.preprocessing import StandardScaler  # standard scaler
import warnings                                   # warnings from code

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)


# suppressing warnings
warnings.filterwarnings(action = 'ignore')


# specifying the path and file name
file = './datasets/housing_feature_rich.xlsx'


# reading the file into Python
housing = pd.read_excel(io     = file,
                        header = 0   )


housing.drop(labels  = ['property_id'],
             axis    = 1,
             inplace = True)


#####################################
# importing model coefficients file #
#####################################
results_path = "./model_results/model_results.xlsx"

results_df   = pd.read_excel(io     = results_path,
                             header = 0           )



# checking housing dataset
housing.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>Candidate Models</h3><br>
Run the following code to instantiate the candidate models from previous scripts.

In [None]:
#################################
## original data (full models) ##
#################################
# all x-data
x_all = list(housing.drop(labels  = ['Sale_Price', 'log_Sale_Price'],
                          axis    = 1))

# continuous x-data
x_original = list(housing.loc[ : , 'Lot_Area' : 'Porch_Area' ])



################
## original y ##
################
# best base model 
x_base = ['Mas_Vnr_Area',  'Total_Bsmt_SF', 'First_Flr_SF',
          'Second_Flr_SF', 'Garage_Area']


# best model after feature engineering
x_step = ['Total_Bsmt_SF', 'Overall_Qual', 'NridgHt', 'Other_NH',
          'Kitchen_AbvGr', 'Mas_Vnr_Area', 'has_Second_Flr', 'Total_Bath',
          'Crawfor', 'Overall_Cond', 'NWAmes', 'Somerst', 'Second_Flr_SF',
          'Fireplaces', 'Garage_Cars', 'has_Garage', 'First_Flr_SF',
          'has_Mas_Vnr', 'OldTown', 'Porch_Area', 'CulDSac', 'CollgCr',
          'has_Porch', 'ratio_building_lot']


###################
## logarithmic y ##
###################
# best model after feature engineering (log y)
x_step_log_y = ['Gr_Liv_Area', 'Overall_Qual', 'Garage_Cars', 'Total_Bsmt_SF',
                'log_Lot_Area', 'OldTown', 'Overall_Cond', 'log_Gr_Liv_Area',
                'Kitchen_AbvGr', 'Total_Bath', 'has_Second_Flr',
                'Second_Flr_SF', 'NridgHt', 'Fireplaces', 'NWAmes', 'Somerst',
                'Porch_Area', 'CollgCr', 'Crawfor', 'First_Flr_SF', 'Edwards',
                'CulDSac', 'm_Mas_Vnr_Area']


########################
## response variables ##
########################
original_y = 'Sale_Price'
log_y      = 'log_Sale_Price'

<br>

In [None]:
# creating placeholder DataFrame for results
results = pd.DataFrame()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Let's run the OLS models from the previous script so that we can compare them to more advanced models.

In [None]:
## stepwise model using Sale_Price ##
sp_model = quick_lm(x_data        = housing[ x_all ],
                    y_data        = housing[ original_y ],
                    threshold_in  = 0.01,
                    threshold_out = 0.05,
                    test_size     = 0.25,
                    verbose       = False) # suppressing output


# storing results
results = lr_summary(x = housing[ sp_model['selected_features'] ],
                     y = housing[ original_y ],
                     model = sklearn.linear_model.LinearRegression(),
                     model_name = 'OLS Model (y)',
                     results_df = results)


# checking results
results.head(n = 25)

<br>

In [None]:
## stepwise model using Sale_Price ##
sp_model = quick_lm(x_data        = housing[ x_all ],
                    y_data        = housing[ log_y ],
                    threshold_in  = 0.01,
                    threshold_out = 0.05,
                    test_size     = 0.25,
                    verbose       = False)


# storing results
results = lr_summary(x = housing[ sp_model['selected_features'] ],
                     y = housing[ log_y ],
                     model = sklearn.linear_model.LinearRegression(),
                     model_name = 'OLS Model (log y)',
                     results_df = results)


# checking results
results.head(n = 25)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Ridge Regression</h2><br>
Ridge regression is a model type that has a shrinkage parameter. In other words, ridge models can tune each x-feature to make it more stable (more formally known as <strong>regularization</strong>). Think of stability as a coefficient from an OLS regression model that has a p-value exactly equal to zero. Therefore, instability would be a coefficient with a p-value greater than zero. Too much instability in a coefficient implies that an x-feature is insignificant, such as when a p-value gets above 0.05 (assuming 95% confidence). In OLS regression, we <strong>regulate</strong> a model by removing insignificant features. Note that mathematically, this is the same as setting the feature's coefficient to zero.
<br><br>
Now imagine a model that has the ability to shrink a feature's coefficient instead of setting it to zero. This is what ridge and other regularization models do. When a ridge model finds a coefficient that is unstable, it
shrinks it until stability is achieved. This tends to lead to weaker predictive performance in terms of metrics like R-Square. However, this also tends to lead to greater stability, which can be observed through metrics like the train-test gap. <strong>Stable models are preferred to unstable models</strong> because they are more likely to perform as expected in the real world. Remember that even though a model may "look" good on paper, its job is to predict something that is currently unknown. There is less risk in getting unexpected results if a model is stable. 
<br><br>
Here's a video if you'd like to <a href="https://www.youtube.com/watch?app=desktop&v=Q81RR3yKn30">learn more about ridge regression</a>.<br><br>

In [None]:
help(sklearn.linear_model.Ridge)

<br>

<strong>a)</strong> Complete the code below using x_all and log_y.

In [None]:
# preparing x-data
x_data = housing[ _____ ]

# preparing y-data
y_data = housing[ _____ ]

<br><br><strong>a)</strong> Develop a ridge regression model with <strong>sklearn.linear_model.Ridge(&nbsp;)</strong>.

In [None]:
# instantiating a ridge model
model = sklearn.linear_model._____(alpha        = 1.0,
                                   random_state = 702)


# analyzing results
results = baserush.summary.lr_summary(x          = x_data,
                                      y          = y_data,
                                      model      = model,
                                      model_name = 'Unscaled Ridge Model',
                                      results_df = results)


# checking results
results.head(n = 25)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>Part III: Lasso Regression</h2><br>
Next up is lasso regression, which is also a regulation model that is very similar to ridge regression. The major difference between these model types is that lasso can shrink a coefficient to zero, whereas ridge can only get extremely close to zero. This means that lasso models have a built-in variable selection technique: it can set coefficients to zero, which effectively kicks them out of the model. This can be very useful in the early stages of an analysis.
<br><br>
This video is a great way to <a href="https://www.youtube.com/watch?app=desktop&v=NGf0voTMlcs">learn more about lasso regression</a>.<br><br>

<img src="./script_images/lasso.png" alt="Ted Lasso" width="400"/>
<br><br>

In [None]:
help(sklearn.linear_model.Lasso)

<br><strong>a)</strong> Develop a lasso regression model with <strong>sklearn.linear_model.Lasso(&nbsp;) </strong>.

In [None]:
# instantiating a ridge model
model = sklearn.linear_model._____(alpha        = 1.0,
                                   random_state = 702)


# analyzing results
results = lr_summary(x          = x_data,
                     y          = y_data,
                     model      = model,
                     model_name = 'Unscaled Lasso Model',
                     results_df = results)


# checking results
results.head(n = 25)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>Part IV: Stochastic Gradient Descent</h2><br>
I have a very interesting memory from my childhood that I would like to share with you. When I was around five years old, I remember seeing a lot of soap commercials on TV. Each one would explain how their soap was the best, much better when compared to other leading brands. One would be the best at killing germs. Another would be the best at moisturizing. This one smells the best. That one is the most recommended by doctors, so it's the best. As I saw more commercials, I kept thinking: Why don't they just combine all the best soaps together? Wouldn't that lead to a soap that's the best at everything? There would be no more debate on this subject.
<br><br>
Later in life, I learned about ridge and lasso models. Both use different regularization techniques, which can lead to one model working better than the other depending on the data. So, how do you know when to use one over  the other? Wouldn't it be great if we could just combine them together like the soaps of my childhood? This is exactly what an <strong>elastic net</strong> does. If you'd like to know more, check out <a href="https://www.youtube.com/watch?app=desktop&v=1dKRdX9bfIo">this video on elastic net regression</a>.
<br><br>

In [None]:
help(sklearn.linear_model.SGDRegressor)

<br><br><strong>a)</strong> Develop an SGD regression model with <strong>SGDRegressor(&nbsp;)</strong>.

In [None]:
# instantiating a ridge model
model = sklearn.linear_model._____(loss     = 'squared_error',
                                   penalty  = 'elasticnet',
                                   alpha    = 0.001,
                                   l1_ratio = 0.15,
                                   random_state = 702)


# analyzing results
results = lr_summary(x          = x_data,
                     y          = y_data,
                     model      = model,
                     model_name = 'Unscaled SGD Model',
                     results_df = results)


# checking results
results.head(n = 25)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part V: Standardization</h2><br>
In this section, we will learn how to <strong>standardize</strong> the X-features. In other words, we are going to put them into a form where each feature's variance is measured on the same scale. Some algorithms base their calculations on distance, which requires standardization so that they work properly. Others have penalty terms that assume all features have a mean of zero and a standard deviation of one,  which is the result of standardization.
<br><br>
In general, distance- and penalty-based algorithms (like K-Nearest Neighbors, which we will see in the next script) perform much better after standardization. This is because distance-based algorithms use variance to compute similarity amongst observations: the closer two observations are in terms of their variance, the more similar the algorithm will think they are. Therefore, if the data is not standardized, features with less variance may take over the model. This can be a lot to conceptualize, so let's take it step by step, keeping in mind that our goal is to ensure that the variance in each feature is treated fairly by the algorithms we develop.
<br><br><strong>Standard Scaler</strong><br>
Technically speaking, this is our first unsupervised learning technique! Congrats on all that you've accomplished thus far! Notice how the process of data standardization is very similar to that of building models in scikit-learn:<br>

* Instantiate
* Fit
* <strike>Predict</strike> Transform
* <strike>Score</strike> Convert

<br>
<strong>a)</strong> Complete the code below to standardize all X-features.

In [None]:
# INSTANTIATING a StandardScaler() object
scaler = StandardScaler()


# FITTING and TRANSFORMING
x_scaled = scaler.fit_transform( _____ )


# converting scaled data into a DataFrame
x_scaled_df = pd.DataFrame(x_scaled)


# labeling columns
x_scaled_df.columns = _____.columns


# checking the results
x_scaled_df.describe(include = 'number').round(decimals = 2)

<br><br>
<h3>Standardizing Candidate Models</h3><br>
Now that we've instantiated a standardized version of the x-data, we can standardize each candidate model with subsetting, as exemplified in the two codes below.

In [None]:
# x_all (not standardized)
housing[x_all].iloc[ : , 0:3 ].head(n=5)

In [None]:
# x_all (standardized)
x_scaled_df[x_all].iloc[ : , 0:3 ].head(n=5)

<br><hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>Part VI: Team Challenge</h2>
<br>Below are the available candidate models.
<br><br>
<u><strong>Candidate Models (X-features)</strong></u>

* x_all
* x_original
* x_base
* x_step
* x_step_log_y


<br>
<u><strong>Response Variables (y)</strong></u>

* original_y
* log_y


<br>Your objectives in this challenge are to:

1. Determine which models perform better with standardized data.
2. Choose the best candidate model for each of the following model types:

* <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">OLS Regression</a>
* <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html">Ridge Regression</a>
* <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">Lasso Regression</a>
* <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html">SGD Regression</a>

<br>
<strong>a)</strong> Run each candidate model on each model type, recording its results in <em>results_df</em>.

In [None]:
# preparing x-data
x_data =  _____ [ _____ ] # df can be housing or x_scaled_df

# preparing y-data
y_data = housing[ _____ ] # df can only be housing

<br>

In [None]:
# instantiating a ridge model
model = sklearn.linear_model._____(loss     = 'squared_error',
                                   penalty  = 'elasticnet',
                                   alpha    = 0.001,
                                   l1_ratio = 0.15,
                                   random_state = 702)


# analyzing results
results = lr_summary(x          = x_data,
                     y          = y_data,
                     model      = model,
                     model_name = _____,
                     results_df = results)


# checking results
results.head(n = 25)

<br>
<h3>Analysis</h3>

Use this markdown cell to write your analysis. Good luck!





<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~


 █████╗ ██╗     ██╗ ██████╗ ███╗   ██╗██╗███╗   ██╗ ██████╗ 
██╔══██╗██║     ██║██╔════╝ ████╗  ██║██║████╗  ██║██╔════╝ 
███████║██║     ██║██║  ███╗██╔██╗ ██║██║██╔██╗ ██║██║  ███╗
██╔══██║██║     ██║██║   ██║██║╚██╗██║██║██║╚██╗██║██║   ██║
██║  ██║███████╗██║╚██████╔╝██║ ╚████║██║██║ ╚████║╚██████╔╝
╚═╝  ╚═╝╚══════╝╚═╝ ╚═════╝ ╚═╝  ╚═══╝╚═╝╚═╝  ╚═══╝ ╚═════╝ 
                                                            
██╗    ██╗██╗████████╗██╗  ██╗                              
██║    ██║██║╚══██╔══╝██║  ██║                              
██║ █╗ ██║██║   ██║   ███████║                              
██║███╗██║██║   ██║   ██╔══██║                              
╚███╔███╔╝██║   ██║   ██║  ██║                              
 ╚══╝╚══╝ ╚═╝   ╚═╝   ╚═╝  ╚═╝                              
                                                            
███████╗██╗   ██╗ ██████╗ ██████╗███████╗███████╗███████╗██╗
██╔════╝██║   ██║██╔════╝██╔════╝██╔════╝██╔════╝██╔════╝██║
███████╗██║   ██║██║     ██║     █████╗  ███████╗███████╗██║
╚════██║██║   ██║██║     ██║     ██╔══╝  ╚════██║╚════██║╚═╝
███████║╚██████╔╝╚██████╗╚██████╗███████╗███████║███████║██╗
╚══════╝ ╚═════╝  ╚═════╝ ╚═════╝╚══════╝╚══════╝╚══════╝╚═╝
                                                            


~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>
<h2>Bonus: Storing Model Results</h2><br>
The code below will store the model results from above as a new Excel file.

In [None]:
# saving results in Excel
results_df.to_excel(excel_writer = "./model_results/model_results_2.xlsx",
                    index = False)

<br>