<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 05 | The Modeling Building Framework</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>

<h3>Basic Modeling Strategy for a Continuous Response Variable</h3>

<strong>1. Data Exploration, Preparation, and Feature Engineering</strong><br>
Analyze the data, treat anomalies (missing values, etc.), and develop new features.<br><br>

<strong>2. Prepare for Model Development</strong><br>
Split the dataset into training and testing sets.<br><br>

<strong>3. Model Development in statsmodels</strong><br>
Experiment with different feature combinations in linear regression (OLS) and analyze results.<br><br>

<strong>4. Develop Candidate Models</strong><br>
Find X-feature combinations that are predicting well.<br><br>

<strong>5. Model Tournament</strong><br>
Apply X-feature combinations to different model types in scikit-learn.<br><br><br>

<h3>The Model Tournament Workflow</h3>
    
<strong>1. Instantiate</strong><br>
Create a blueprint of the model, just like with user-defined functions.<br><br>

<strong>2. Fit</strong><br>
Run the training data through the model object. This creates regression metrics such as R-Square and coefficients.<br><br>

<strong>3. Predict</strong><br>
Use the fitted model to predict on the testing set. This helps us understand the stability of the model.<br><br>

<strong>4. Score</strong><br>
Analyze the model's performance based on its regression metrics.<br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Training and Testing Sets</h2><br>
Our previous model building endeavors using statsmodels had one major drawback:<br><br>

<div align="center"><h4>The models were trained using all of the data.</h4></div><br>

This may not seem like a big deal, but modeling in this way can be very dangerous. Allowing an algorithm to see all of the data runs the risk of <strong>overfitting</strong>, or tailoring so closely to a dataset that the algorithm predicts poorly on new observations. Remember, the primary goal of building a model is to predict well on observations where the end result is unknown (i.e., new cases). Therefore, we need to set aside a some data before a model is trained (known as a <strong>testing</strong> or <strong>validation set</strong>). After training, the testing set will help us understand how the model predicts on new data.<br><br><br>
<strong>Some Things to Keep in Mind</strong><br>
1. Data exploration and feature engineering are always conducted on the full dataset.
2. Model adjustments should never be made on the testing set. The testing set should only be used to analyze a model's fit.<br>

<h4>Imports and Loading the Dataset</h4>
Run the following code to import packages and the feature-enhanced version of the dataset.

In [None]:
# importing packages
import pandas as pd                                  # data science essentials
import matplotlib.pyplot as plt                      # data viz
import seaborn as sns                                # enhanced data viz
import statsmodels.formula.api as smf                # linear modeling
from sklearn.model_selection import train_test_split # train/test split
import sklearn.linear_model                          # faster linear modeling
import numpy as np                                   # mathematical essentials

# setting pandas and numpy print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
np.set_printoptions(suppress=True)

# specifying the path and file name
file = './datasets/housing_feature_rich.xlsx'


# reading the file into Python
housing = pd.read_excel(io     = file,
                        header = 0   )


housing.drop(labels  = 'property_id',
             axis    = 1,
             inplace = True)


# checking the file
housing.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Preparing for Modeling</h2><br>
The best X-feature sets from previous scripts have been prepared as lists in the code below. These are known as <strong>candidate models</strong>. We will apply each candidate model to different model types in the hopes of finding the most optimal one in terms of predictive performance.

In [None]:
#################################
## original data (full models) ##
#################################
# all x-data
x_all = list(housing.drop(labels  = ['Sale_Price', 'log_Sale_Price'],
                          axis    = 1))

# continuous x-data
x_original = list(housing.loc[ : , 'Lot_Area' : 'Porch_Area' ])



################
## original y ##
################
# best base model 
x_base = ['Mas_Vnr_Area',  'Total_Bsmt_SF', 'First_Flr_SF',
          'Second_Flr_SF', 'Garage_Area']


# best model after feature engineering
x_rich = ['Lot_Area', 'Garage_Cars', 'Overall_Qual', 'Total_Bsmt_SF',
          'NridgHt', 'Kitchen_AbvGr', 'has_Second_Flr',
          'Mas_Vnr_Area', 'has_Garage', 'Porch_Area',
          'NWAmes', 'OldTown', 'Overall_Cond', 'NAmes',
          'Edwards', 'Somerst', 'Fireplaces', 'Second_Flr_SF',
          'First_Flr_SF', 'has_Mas_Vnr', 'CulDSac', 'Total_Bath',
          'Crawfor', 'Garage_Area', 'has_Porch']



###################
## logarithmic y ##
###################
# best model after feature engineering (log y)
x_rich_log_y = ['Lot_Area', 'First_Flr_SF', 'Second_Flr_SF', 'Garage_Cars' ,
                'Overall_Qual', 'Overall_Cond', 'Total_Bsmt_SF', 'OldTown',
                'Kitchen_AbvGr', 'Total_Bath', 'has_Second_Flr', 'NridgHt',
                'Fireplaces', 'Porch_Area', 'Somerst', 'CollgCr', 'Crawfor',
                'CulDSac', 'NWAmes', 'Edwards', 'Gilbert']



########################
## response variables ##
########################
original_y = 'Sale_Price'
log_y      = 'log_Sale_Price'

<br><br>
<strong>a)</strong> Complete the code below using x_all and original_y.

In [None]:
# preparing x-data
x_data = housing[ _____ ]


# preparing y-data
y_data = _____


# train-test split
x_train, x_test, y_train, y_test = train_test_split(x_data, # x
                                                    y_data, # y
                                                    test_size    = 0.25,
                                                    random_state = 219 )

In [None]:
# preparing x-data
x_data = housing[ x_all ]


# preparing y-data
y_data = original_y


# train-test split
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data,
                                                    test_size    = 0.25,
                                                    random_state = 219 )

<br>

In [None]:
# checking results
print(f"""
Data Shapes
-----------
x_train: {x_train.shape}
y_train: {y_train.shape}

x_test: {x_test.shape}
y_test: {y_test.shape}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Example: Modeling with OLS Regression</h2>

In [None]:
# naming the model
model_name = "Linear Regression"


# INSTANTIATING model object
model = sklearn.linear_model.LinearRegression()


# FITTING to training data
model_fit = model.fit(x_train, y_train)


# PREDICTING on new data
model_pred = model.predict(x_test)


# SCORING results (R-Square)
model_train_score = round(model.score(x_train, y_train), ndigits = 4)
model_test_score  = round(model.score(x_test, y_test), ndigits = 4)
model_gap         = round(model_train_score - model_test_score, ndigits = 4)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

<br>
<h3>Extracting Coefficients</h3>

In [None]:
# zipping each feature name to its coefficient
model_coefficients = zip(x_train.columns,
                         model.coef_.round(decimals = 4))


# setting up a placeholder list to store model features
coefficient_lst = [('intercept', model.intercept_.round(decimals = 4))]


# printing out each feature-coefficient pair one by one
for coefficient in model_coefficients:
    coefficient_lst.append(coefficient)
    

# checking the results
for pair in coefficient_lst:
    print(pair)

<br>
<h3>Model Summary</h3>

In [None]:
# dynamically printing model summary
ols_model =  f"""\
Model Name:     {model_name}
Train_Score:    {model_train_score}
Test_Score:     {model_test_score}
Train-Test Gap: {model_gap}

Coefficents
-----------
{pd.DataFrame(data = coefficient_lst, columns = ["Feature", "Coefficient"])}"""

print(ols_model)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>How fit should a model be?</strong><br>
As a general heuristic, if the training and testing scores are within 0.05 of each other, the model has not been overfit. Don't worry if the testing score ends up higher than the training score. Some sources claim that in such situations a model is underfit, but this is a general misconception that is beyond the scope of this course.<br><br>

<h2>Exploring the Model Building Framework</h2><br>
Let's explore each component of the model building framework.

In [None]:
help(sklearn.linear_model.LinearRegression)

<br>
<h3>Instantiate</h3>
This creates a blueprint of the model, just like with user-defined functions.

In [None]:
# this is a blueprint
model = sklearn.linear_model.LinearRegression()


# printing blueprint
print(model)

<br>
<h3>Fit</h3>
Runs the data through the model, creating regression metrics such as R-Square and coefficients.

In [None]:
# model is created from blueprint and data
model_fit = model.fit(x_train, y_train)


# printing model attributes
print(f"""
Intercept
---------
{round(model_fit.intercept_, ndigits = 2)}


Coefficients
------------
{model_fit.coef_.round(decimals = 2)}


Total X-Features
----------------
{model_fit.n_features_in_}
""")

<br>
<h3>Predict</h3>
Uses the fitted model to predict on the testing set. This helps us understand the stability of the model.

In [None]:
# applying model to validation set
model_pred = model.predict(x_test)


# printing predictions (validation set)
print(model_pred.round(decimals = 0).astype(dtype = int))

<br>

In [None]:
## Residual Analysis ##

# organizing residuals
model_residuals = {
    "True"            : y_test,
    "Predicted"       : model_pred.round(decimals = 0).astype(dtype = int)
}


# converting residuals into df
resid_df = pd.DataFrame(data = model_residuals)


# checking results
resid_df.head(n = 5)

<br><br>
<strong>a)</strong> Complete the code to develop a residual plot from <em>resid_df</em>. Plot <em>Predicted</em> on the x-axis and <em>True</em> on the y-axis.

In [None]:
# developing a residual plot
sns.residplot(data        = _____,
              x           = _____,
              y           = _____,
              lowess      = True,
              color       = 'blue',
              scatter_kws = {'alpha': 0.3},   # data point transparency
              line_kws    = {'color': 'red'}) # line color


# title and axis labels
plt.title(label   = "Residual Plot - Full Model")
plt.xlabel(xlabel = "Predicted Sale Price")
plt.ylabel(ylabel = "Residual Sale Price")


# layout and rendering visual
plt.tight_layout()
plt.show()

In [None]:
# developing a residual plot
sns.residplot(data        = resid_df,
              x           = 'Predicted',
              y           = 'True',
              lowess      = True,
              color       = 'blue',
              scatter_kws = {'alpha': 0.3},   # data point transparency
              line_kws    = {'color': 'red'}) # line color


# title and axis labels
plt.title(label   = "Residual Plot - Full Model")
plt.xlabel(xlabel = "Predicted Sale Price")
plt.ylabel(ylabel = "Residual Sale Price")


# layout and rendering visual
plt.tight_layout()
plt.show()

<br>
<h3>Score</h3>
Quantifies the quality of predictions based on a scoring metric (in this case, R-Square). More information on scoring metrics <a href="https://scikit-learn.org/stable/modules/model_evaluation.html">can be found here</a>.

In [None]:
# checking overall predictive quality
print('Training Score:', round(model.score(x_train, y_train), ndigits = 4))
print('Testing  Score:', round(model.score(x_test, y_test),   ndigits = 4))
print('Train-Test Gap:', model_gap)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part IV: Practice Time!</h2><br>
Run this script again using other X-feature sets.<br>

* Which feature set results in the highest R-Square values?
* Which feature set results in the lowest train-test gap?
* Are there any feature sets that flatten out the lowess estimator in the residual plot?

<br><br><hr style="height:.9px;border:none;color:#333;background-color:#333;" /><hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~

  _    _                           __  __           _      _ _             _ 
 | |  | |                         |  \/  |         | |    | (_)           | |
 | |__| | __ _ _ __  _ __  _   _  | \  / | ___   __| | ___| |_ _ __   __ _| |
 |  __  |/ _` | '_ \| '_ \| | | | | |\/| |/ _ \ / _` |/ _ \ | | '_ \ / _` | |
 | |  | | (_| | |_) | |_) | |_| | | |  | | (_) | (_| |  __/ | | | | | (_| |_|
 |_|  |_|\__,_| .__/| .__/ \__, | |_|  |_|\___/ \__,_|\___|_|_|_| |_|\__, (_)
              | |   | |     __/ |                                     __/ |  
              |_|   |_|    |___/                                     |___/   
                                                                 
~~~

<br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<br>