<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 01 | Base Modeling</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><br>
In this script, we will:

* Refresh our knowledge of Python and important analytical concepts.
* Develop a base model using continuous features from the Ames Housing dataset.

<br>
<h3>What is Predictive Modeling?</h3><br>
Base modeling is a form of <strong>predictive modeling</strong>. A straightforward definition of predictive modeling can be found in <a href="https://towardsdatascience.com/selecting-the-correct-predictive-modeling-technique-ba459c370d59">this article from Towards Data Science</a>:
<br><br>

<div align="center">
    Predictive modeling is the process of taking known results and developing<br>a model that can predict values for new occurrences.
    
<a class="tocSkip"></a></div><br>
In other words, predictive modeling is the process of using historic data to estimate the values of unknown observations. The keyword in this definition is estimate. Predictive modeling is all about developing our "best guess" given all available information. <font style=color:red><strong>Predictive modeling is NOT about being right. It is about being less wrong.</strong></font>
<a class="tocSkip"></a></div>
<br><br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part I: Imports and Path</h2><br>

<strong>a) Import the following packages:</strong>
* pandas (as pd)
* seaborn (as sns)
* matplotlib.pyplot (as plt)

Then, import <em>ames_continuous</em> dataset using pd.read_excel().

In [None]:
# importing libraries
_____ pandas _____ pd       # data science essentials
import matplotlib.pyplot as _____       # essential graphical output
import _____ as sns       # enhanced graphical output
from os import listdir                # NEW! paths and directories
import statsmodels.formula.api as smf # NEW! model building

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


# specifying file name
file = './____/ames_continuous.xlsx'


# reading the file into Python
housing = _____


# outputting the first ten rows of the dataset
housing._____

In [None]:
# importing libraries
import pandas as pd                   # data science essentials
import matplotlib.pyplot as plt       # essential graphical output
import seaborn as sns                 # enhanced graphical output
from os import listdir                # NEW! paths and directories
import statsmodels.formula.api as smf # NEW! model building

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


# specifying file name
file = './datasets/ames_continuous.xlsx'


# reading the file into Python
housing = pd.read_excel(io = file)


# outputting the first ten rows of the dataset
housing.head(n = 10)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>Breaking Down the Path</h3><br>
There is tremendous value in understanding the path structure your computer uses to find things. Such knowledge is transferable across a wide variety of technical applications. Our current path is as follows:
<br><br>

~~~
./datasets/ames_continuous.xlsx
~~~

<br><br>
The syntax of this path can be interpreted as follows:
<br><br>

~~~
[FOLDER WHERE THIS NOTEBOOK IS] / [FOLDER WHERE THE DATASET IS] / [EXCEL FILE NAME].xlsx
~~~

<br><br>
If we were to write the path in human language, it would appear as follows:
<br><br><br>

~~~
Start in the folder where this Notebook is located...

\ and then

Navigate into the folder named "datasets", which is located in the same place as this Notebook... 

\ and then

Select the file named "ames_continuous.xlsx"

~~~

<br><br>
The contents of each part of the directory can be observed with the help of the <a href="https://docs.python.org/3/library/os.html#os.listdir">list directory method</a> coming from <a href="https://docs.python.org/3/library/os.html">the os package</a>.

In [None]:
# calling help on listdir (from the os package)
help(listdir)

<br>

In [None]:
# everything in this Notebook's folder (current directory)
for item in listdir(path="."): # one dot
    print(item)

<br>

In [None]:
# going backwards in the path (parent directory)
for item in listdir(path=".."): # two dots
    print(item)

<br>

In [None]:
# checking what's in the datasets folder
for item in listdir(path="./datasets"): # ./[folder name]
    print(item)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Navigate to the <em>script_images</em> folder.</h4>

In [None]:
# printing all files in the script images directory
_____

In [None]:
# printing all files in the script images directory
for item in listdir(path="./script_images"): # into datasets folder
    print(item)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

It looks like there's a .gif in this folder. Let's check it out!

<h4>c) Complete the code and copy/paste it into the markdown cell below.</h4><br>

~~~
![dude_gif](./script_images/_____._____)
~~~


~~~
--------------------------------------
 CLICK HERE TO OPEN THE MARKDOWN CELL 
--------------------------------------
~~~

![dude_gif](./script_images/dude_wheres_my_car.gif)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Initial Exploration of the Dataset</h2><br>

<h4>a) How many observations (rows) are present in the dataset? How many features (columns)?</h4><br>
Use the following code to complete the formatted string (an f-string) that prints the number of observations and the number of features.

In [None]:
# formatting and printing the dimensions of the dataset
print(f"""
Size of Original Dataset
------------------------
Observations (rows): {housing.shape[0]}
Features (columns) : {housing.shape[1]}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Access general information about each feature, including data types and the number of non-missing values.</h4>

In [None]:
# INFOrmation about each variable
housing.info(verbose = True)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
The dataset has missing values. This is definitely something of interest that we will take care of in a later script.<br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h3>Analyzing the Distribution of Sale Prices</h3><br>
Notice how the Y-variable (<em>&nbsp;Sale_Price&nbsp;</em>) is encoded as an integer as it does not contain decimal places. While this is common in real estate pricing, it introduces a violation of continuity, which is important for predictive models like linear regression. A truly continuous variable should, in theory, have an infinite range and have an infinite number of decimal places. Our Y-variable violates does not meet either condition. However, we must keep in mind that statistics and real-world applications are expected to have a certain degree of misalignment.
<br><br>
This is one of the many reasons that we do not expect our predictions to be perfect (for example, our predicted sale prices will have decimal places). We do, however, expect to develop a general understanding as to what features affect the sale price of a house in Ames, Iowa. The word <em>general</em> is important as base models are often built using one of several <a href="https://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear models</a>.
<br><br>
Note that a <strong>y-variable</strong> is often referred to as a <strong>response variable</strong> or a <strong>dependent variable</strong>. Think of this in terms of the following:<br>

* Question: How much is the sale price of a home in Ames, Iowa?<br>
* <strong>Response</strong>: Well, it <strong>depends</strong> on the features of each house.

<br>
Additional names for the X- and y-variables can be found in <a href="https://www.statsmodels.org/stable/endog_exog.html">the User Guide for statsmodels</a>.
<br><br>
Next, we will use a histogram to visualize <em>Sale_Price</em>. We are hoping to find a normal distribution that is symmetrical. Symmetry is important for modeling with straight lines, as in linear regression.
<br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h4>e) Develop a histogram to analyze the distribution of the Y-variable.</h4><br>
Does it look as if this variable is normally distributed? Does it appear to be skewed positive or negative? The following help(&nbsp;) file may be useful in completing this task.

In [None]:
# documentation for making histograms with seaborn
help(sns.histplot)

<br>

In [None]:
# developing a histogram using HISTPLOT
sns.____(data  = ____,
         x     = ____,
         kde   = True)


# title and axis labels
plt.title(label   = "Distribution of Housing Sale Prices")
plt.xlabel(xlabel = "Sale Price") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
____.____()

In [None]:
# developing a histogram using HISTPLOT
sns.histplot(data   = housing,
             x      = 'Sale_Price',
             kde    = True)


# title and axis labels
plt.title(label   = "Distribution of Housing Sale Prices")
plt.xlabel(xlabel = "Sale Price") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")

# displaying the histogram
plt.show()

<br>
As can be observed from the histogram above, sale prices are skewed positive. This also something of interest that we will take care of in a later script. For now, let's move forward as the distribution of sale prices appear to be relatively normal. 

<h4>a) Complete the code below to generate descriptive statistics, rounded to two decimal places.</h4>

In [None]:
# descriptive statistics for numeric data
housing_stats = housing.iloc[ :, 1: ]._____(include = 'number')._____


# checking results
housing_stats

In [None]:
# descriptive statistics for numeric data
housing_stats = housing.describe(include = 'number').round(decimals = 2)


# checking results
housing_stats

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Let's subset the results to focus on the distributions of each feature.

In [None]:
# analyzing feature distributions
housing_stats.iloc[ 3:, : ].round(decimals = -2) # negative rounding

<br>

Everything is looking good with the exception of pool areas (&nbsp;<em>Pool_Area</em>&nbsp;). The dataset might not have enough houses with pools for this feature to be useful in base modeling.

<h4>b) Create a frequency table for <em>Pool_Area</em> using value_counts(&nbsp;).</h4>

In [None]:
# frequency table for Pool_Area
_____

In [None]:
# frequency table for Pool_Area
housing['Pool_Area'].value_counts()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<strong>c) Modify the code to show:</strong>
    
* True if a pool area is greater than zero.
* False if a pool area is equal to zero.

In [None]:
# house has pool True or False


In [None]:
# house has pool True or False
(housing['Pool_Area'] > 0).value_counts(sort      = True,
                                        ascending = True)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

There aren't enough houses with pools in the dataset. Thus, we will need to drop <em>Pool_Area</em> until more data is collected. This is also a good time to drop <em>property_id</em> as it does not serve an analytical purpose in our situation.

In [None]:
# dropping Order and Pool_Area
housing.drop(columns = ['property_id', 'Pool_Area'],
             axis    = 1,
             inplace = True,
             errors  = 'ignore')

<br>

In [None]:
# checking results
housing.columns

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>Part III: Base Modeling</h2><br>
It's time to develop a base model. Base models are very important as they:

* Allow us to confirm our original (base) data, assumptions, and domain knowledge. Think of this as a common sense test. If the findings from our base model don't make sense, we likely need new data, different assumptions about the relationships between our features, or better domain knowledge before moving forward.
* Provide a benchmark to compare to more complex models. Additionally, as models get more complex, they tend to get less interpretable and even harder to take action from.
* Are built with features that follow the assumptions of the type of model we are using. This will be discussed in more detail in class.

<br>
Base modeling will also help us understand the value of the analytical techniques covered throughout this course (missing value analysis, feature engineering, etc.). To get started, let's analyze the linear correlations between our <em>Sale_Price</em> and our X-features. This will help us find good X-candidates for our model.
<br>
<h4>a) Complete the code below and analyze the correlations with <em>Sale_Price</em>.</h4>

In [None]:
# developing a correlation matrix
housing_corr = housing.____(method = 'pearson')


# filtering results to show correlations with Sale_Price
housing_corr.loc[ : , ____].round(decimals = 2).sort_values(ascending = False)

In [None]:
# developing a correlation matrix
housing_corr = housing.corr(method = 'pearson')


# filtering results to show correlations with Sale_Price
housing_corr.loc[ : , 'Sale_Price'].round(decimals = 2).sort_values(ascending = False)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Develop a scatter plot between <em>Sale_Price</em> and the feature with the strongest correlation to <em>Sale_Price</em>.</h4>

In [None]:
# setting figure size
fig, ax = plt.subplots(figsize = (9, 6))


# developing a scatterplot
sns.____(x    = ____,
         y    = ____,
         data = ____)


# SHOWing the results
____.____()

In [None]:
# setting figure size
fig, ax = plt.subplots(figsize = (9, 6))


# developing a scatterplot
sns.scatterplot(x    = 'Gr_Liv_Area',
                y    = 'Sale_Price',
                data = housing)


# SHOWing the results
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>Building A Base Model</h3><br>
The following code has been provided for you. Its purpose is to provide a basic framework for developing a predictive model in Python using the <a href="https://www.statsmodels.org/stable/index.html">statsmodels</a> package. Keep in mind that there are several techniques we can employ to make this model more optimal, which we will cover in our later sessions.

In [None]:
## using the statsmodels package ##

# Step 1: INSTANTIATE a model object
lm_best = smf.ols(formula = """Sale_Price ~ Gr_Liv_Area""",
                  data = housing)


# Step 2: FIT the data into the model object
results = lm_best.fit()


# Step 3: analyze the SUMMARY output
print(results.summary())

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>Challenge</h2>

<h4>Develop an optimal base model using more than one x-feature.</h4><br>
Your task is to find the combination of X-features that maximizes adjusted R-squared, where all coefficients have p-values of $\leq$ 0.05.
<br><br><br>
<strong><u>Tips</u></strong>

* A common approach is to start with all of the X-features in the model and remove insignificant ones one at a time (also known as backward selection).
* If a feature is removed from a model, expect the p-values for other features to change. Try not to remove too many at a time and test out different combinations.

In [None]:
# Step 1: INSTANTIATE a model object
lm_best = smf.ols(formula =  """Sale_Price ~ ____ +
                                             ____ +""",
                                data = housing)


# Step 2: FIT the data into the model object
results = lm_best.fit()


# Step 3: analyze the SUMMARY output
print(results.summary())

In [None]:
## Sample Solution ##

# Step 1: INSTANTIATE a model object
lm_best = smf.ols(formula =  """Sale_Price ~ Mas_Vnr_Area  +
                                             Total_Bsmt_SF +
                                             First_Flr_SF  +
                                             Second_Flr_SF +
                                             Garage_Area   +
                                             Porch_Area""",
                                data = housing)


# Step 2: FIT the data into the model object
results = lm_best.fit()


# Step 3: analyze the SUMMARY output
print(results.summary())

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>Part IV: statsmodels OLS Study Sheet</h2><br>
Below is a summary of the OLS regression output from statsmodels.
<br>

<br><br>
<div style = "width:image width px; font-size:80%; text-align:center;">
<br>
<img src="./script_images/statsmodels_OLS_output_2.png" width="800" height="500" style="padding-bottom:0.5em;"><em>Figure 1a: statsmodels OLS Regression Output Study Sheet - Part I</em>
<br><br><br><hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br><br>
<img src="./script_images/statsmodels_OLS_output_3.png" width="800" height="500" style="padding-bottom:0.5em;"><em>Figure 1b: statsmodels OLS Regression Output Study Sheet - Part II</em>
</div>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~
 _    _      _                          
| |  | |    | |                         
| |  | | ___| | ___ ___  _ __ ___   ___ 
| |/\| |/ _ \ |/ __/ _ \| '_ ` _ \ / _ \
\  /\  /  __/ | (_| (_) | | | | | |  __/
 \/  \/ \___|_|\___\___/|_| |_| |_|\___|
                                        
                                        
______            _    _ _ _            
| ___ \          | |  | | | |           
| |_/ / __ _  ___| | _| | | |           
| ___ \/ _` |/ __| |/ / | | |           
| |_/ / (_| | (__|   <|_|_|_|           
\____/ \__,_|\___|_|\_(_|_|_)           
                                        
~~~

<br><br><hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>