[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%2010%20Notebooks/GDAN%205400%20-%20Week%2010%20Notebooks%20%28IV%29%20-%20Task%209.ipynb)

This notebook provides a mini-tutorial on task #9 of Coding Assignment #5.

---

### Overview of Coding Assignment 5

In the fifth assignment, we are switching to another competition on *Kaggle*, an online platform for data science and machine learning that provides datasets, competitions, collaborative notebooks, and learning resources.

In the assignment, you will complete the following tasks:

- Task 1: Join the Kaggle Competition  
- Task 2: Load the Housing Prices `Training` Dataset  
- Task 3: Identify Variables with Missing Data  
- Task 4: Fill in Missing Values for `LotFrontage`
- Task 5: Explore the Data with Histograms  
- Task 6: Generate an Automated Data Report  
- Task 7: Create a Binary Variable `2+ Car Garage` from `GarageCars`  
- Task 8: Prepare the Data for Modeling
- Task 9: Train and Evaluate at Least Three Models
- Task 10: Make Predictions on `test.csv` and Generate Submission File


These exercises will help strengthen your ability to explore, preprocess, and model real-world datasets using machine learning. You will gain hands-on experience with data cleaning, feature engineering, and predictive modeling, all while working with a classic dataset in a competitive Kaggle environment.

<br> Read in The Usual Packages and Set up Environment

In [None]:
import numpy as np
import pandas as pd

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)  #Set PANDAS to show all columns in DataFrame
pd.set_option('max_colwidth', 500)

# Task 1: Join the Kaggle Competition  

In [None]:
kaggle_displayname = input("Enter your Kaggle Display Name: ")
print(f"Your Kaggle name is: {kaggle_displayname}")

# Task 2: Load the Housing Prices `Training` Dataset
I have uploaded the training and test datasets onto the class GitHub repository.

In [None]:
train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Task 3: Identify Variables with Missing Data  
- Determine which variables contain missing values in the dataset using any acceptable method

*Hints:*
- You can use the `.info()` method, `.isnull().sum()`, `.isna().sum()`, or `.describe()`

In [None]:
train.info()

In [None]:
train.isnull().sum()[train.isnull().sum() > 0]

# Task 4: Fill in Missing Values for `LotFrontage`
- The `LotFrontage` column contains missing values that must be filled before modeling.  
- Use the **median** value to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `LotFrontage` no longer has any missing entries.  

In [None]:
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

In [None]:
train['LotFrontage'] = train['LotFrontage'].fillna(train["LotFrontage"].median())
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

# Task 5: Explore the Data with Histograms  
- Generate histograms for all **numeric features** in the dataset.  
- Use these histograms to understand the distribution of key variables in the dataset.
- **Tips:** 
  - Instead of plotting separate histograms for each variable, use the **shortcut method** we covered in class to generate all histograms at once.
  - Make sure to read in the plotting packages (*hint*: there are two relevant import lines we used in our Week 7 and Week 8 notebooks, as well as weeks 5 and 6)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

train.select_dtypes(include='number').hist(figsize=(13, 8))
plt.tight_layout()
plt.show()

# Task 6: Generate an Automated Data Report  
- Install and use `ydata-profiling` to create a detailed report of the dataset.  
- This report will provide insights into **missing values, distributions, correlations, and more**.  
- **Tip:** Instead of manually exploring each variable, use this **automated tool** to summarize the data in one step.  
- Save the report as an **HTML file** for easy viewing.

In [None]:
# Install ydata-profiling
!pip install ydata_profiling --quiet
# Install ydata-profiling
from ydata_profiling import ProfileReport

In [None]:
# Generate the report
profile = ProfileReport(train,title="Housing_Prices")

In [None]:
# Save the report to an HTML file
profile.to_file("housing_prices.html")

# Task 7: Create a Binary Variable `2+ Car Garage` from `GarageCars`
- The variable `GarageCars` is described as `Garage size in car capacity`
- Run frequencies on the variable before proceeding.
- Also, check that there are no missing values.
- *Hint:* `2+ Car Garage` should have values of only `0` and `1`. 
- Run a cross-tabulation on `2+ Car Garage` and `GarageCars` to ensure the new variable maps as expected.
- Ensure `2+ Car Garage` is not missing any values.

---

I will show you an example here using the variable `OverallQual`, from which we will create a binary variable called `High_Quality`.

From the full [codebook](https://github.com/gdsaxton/GDAN5400/blob/main/Housing_Prices/data_description.txt) we see that the values of `OverallQual` are the following:

OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

In [None]:
#Check for missing values
print("Missing values in OverallQual column:", train["OverallQual"].isnull().sum())

In [None]:
#Frequencies 
train['OverallQual'].value_counts().sort_index()

In [None]:
#Create the binary variable
train['High_Quality'] = train['OverallQual'].apply(lambda x: 1 if x >= 7 else 0)
train['High_Quality'].value_counts()

In [None]:
#Check for missing values
print("Missing values in High_Quality column:", train["High_Quality"].isnull().sum())

#### Task 7 also asks you to generate a *cross-tabulation*. In PANDAS, we do this with the `pd.crosstab()` command. 
You will put the two variables (`train['High_Quality']` and `train['OverallQual']`) in the parentheses. This command will allow you compare the values on the binary variable `High_Quality` to the ten-category variable `OverallQual`. We do this as a data-verification check. We are expecting that the only values of *1* on `High_Quality` should be observations with values of *7 or higher* on `OverallQual`. That is what we in fact find below.

In [None]:
#Cross-tabulation to verify coding
pd.crosstab(train['High_Quality'], train['OverallQual'])

# Task 8: Prepare the Data for Modeling
- Select the **predictor variables (`X`)** and the **target variable (`y`)**.  
- Set `LotArea`, `LotFrontage`, `YearBuilt`, `1stFlrSF`, and `2+ Car Garage` as the features for prediction (`X`).  
- Split the data into **training (`X_train, y_train`)** and **testing (`X_val, y_val`)** sets using a standard 80/20 split.  
- Set `random_state=42` in your `train_test_split` command to ensure reproducibility.  

---

I will show the process here on four of the variables from the assignment. 

In [None]:
features = ['LotArea', 'LotFrontage', 'YearBuilt', '1stFlrSF']
train[features].describe().T

In [None]:
X = train[features]
y = train['SalePrice']
print(X.shape, y.shape)

# Splitting training data into train and validation sets
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Print out number of rows and columns in each of the four dataframes we just generated
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

# Task 9: Train and Evaluate at Least Three Models
- **Write a loop** to train **at least three different models** (e.g., linear regression) using the training data.  
- The predictor variables (`X`) used in the models are `LotArea`, `LotFrontage`, `YearBuilt`, `1stFlrSF`, and `2+ Car Garage`. 
- Use the trained model to make *predictions* on the validation set.  
- Evaluate the model’s performance using the **RMSLE** score.
  - *Hint:* We want the score to be as low as possible.
- Make sure to include the proper import statements from `sklearn`.
- Choose the model with the best performance.
  - *Hint:* You can either re-run your best-performing model separately, or extract it from any results dataframe you may have generated during the loop. 
  
--- 

### Explanation: Trying Out Different Models

In data analytics projects, we will typically test multiple machine learning models to compare their performance in predicting our outcome of interest, which in this case is housing prices. Instead of relying only on linear regression, we could also evaluate **linear models (Ridge, Lasso, ElasticNet), tree-based models (DecisionTree, RandomForest, XGBoost), and nonlinear models (SVR)**.  

By training each model on the same dataset and computing the **Root Mean Squared Logarithmic Error (RMSLE)** for validation predictions, we can determine which model generalizes best. This process helps us **identify the most accurate and robust approach** for this specific problem, guiding model selection for final predictions.  

We will be using the same **train-test split and features** – `Age`, `LotFrontage`, `High_Quality` – so we will not re-run those parts of the code.  

---

In the following example I will show you a loop in which we train three models:
- Linear Regression
- Decision Tree regressor
- Support Vector Regression (SVR)

#### Linear Regression (The Straight-Line Approach)
- **How it works**: Assumes that house prices change in a **straight-line relationship** with features (e.g., if `YearBuilt` goes up, price increases by a fixed amount).  
- **Pros**: Simple, interpretable.  
- **Cons**: Can't capture complex patterns.  


---

#### Decision Tree (The Rule-Based Approach)
- **How it works**: Think of this model as a series of **Yes/No questions** that split the data into groups based on features.  
  - Example: *Is the house built after 2000?* → If yes, go to the next rule.  
- **Why it's useful**: Can handle **non-linear relationships** in the data.  
- **Cons**: Can overfit if the tree is too deep.  

---

#### Support Vector Regression (SVR)
- **How it works**: Instead of fitting a single best-fit line, SVR finds a **small range (margin)** where most predictions will fall.  
- **Why it's useful**: Handles **nonlinear relationships** better than standard regression.  
- **Cons**: Can be slower on large datasets and performed worst in our analysis.  

---


#### **📊 Summary of Models**
| Model | Purpose |
|-------|---------|
| **Linear Regression** | Baseline model, assumes a linear relationship | 
| **Decision Tree** | Splits data using **rule-based conditions** | 
| **SVR** | Uses a **flexible margin** instead of a single line | 

---


#### Key Question: Which Model Will Perform Best?

As you will see below, of these three, `Linear Regression` provides the best results.

---


#### Code Explanation

We are doing some new things in the following code block, so I will walk you through the key parts.

1. **Define a dictionary of models (`models`)**  
   - Several regression models are stored in a dictionary with their names as keys and model objects as values.  
   - Models include **Linear Regression, Ridge, Lasso, ElasticNet, Decision Tree, Random Forest, XGBoost, and SVR**.  

2. **Create an empty list (`results`)**  
   - This list will store the performance of each model.  

3. **Loop through each model, train it, and evaluate it**  
   - For each model:  
     - It is trained using `X_train` and `y_train`.  
     - It makes predictions on `X_val` (validation data).  
     - The **Root Mean Squared Logarithmic Error (RMSLE)** is calculated to measure model performance.  
     - The model's name and its RMSLE score are added to the `results` list. 
   - Note that we are doing a basic `for` loop here. What is new is that we are looping not over a `list`, which you have seen before, but rather a `dictionary`. This means we use the `items()` method.
     - The `.items()` method is used on dictionary objects such as our `models` dictionary, which is a dictionary where:
       - `Keys` represent model names (e.g., 'Linear Regression', 'Random Forest').
       - `Values` represent the actual model objects (e.g., LinearRegression(), RandomForestRegressor()).
       - So, the line `for name, model in models.items():` returns key-value pairs from a dictionary.
         - `for name, model` unpacks each pair:
           - `name` gets the model's name (a string).
           - `model` gets the actual model instance (e.g., `LinearRegression()`).

4. **Convert results into a Pandas DataFrame (`results_df`)**  
   - The results are stored in a DataFrame and sorted in **ascending order by RMSLE**, so the best-performing model appears first.  
   - For a refresher on *RMSLE* see the Week 9 materials.
   
#### Why Are We Doing the Code This Way?
This approach allows us to **compare multiple models efficiently** and determine which one gives the best predictions for house prices. It helps us make an informed decision on **which model to use in the final analysis**.     
   
   
---




In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_log_error

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

# Define models with default parameters
models = {
    'LinearRegression': LinearRegression(),
    'DecisionTree': DecisionTreeRegressor(),
    'SVR': SVR(),
}

# Create empty list for storing results
results = []

#Train each model in a loop, saving model name RMSLE score into results dataframe
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmsle_score = rmsle(y_val, y_pred)
    results.append({'Model': name, 'RMSLE': rmsle_score})

# Convert results to DataFrame; sort by RMSLE and output
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSLE')
results_df

### Select the Best Model for Generating Updated Submission File
First, select the best model and re-generate predictions on `validation` dataset without retraining

In [None]:
# Retrieve the trained model without retraining
model = models['LinearRegression']   # No re-fitting, just using the stored model
#model.fit(X_train, y_train)  #If you want to retrain the model, uncomment this line

# Generate predictions on validation set
val_predictions = model.predict(X_val) # Model is already trained, just predict

# Evaluate model performance on validation data
rmsle_score = rmsle(y_val, val_predictions)
print("RMSLE:", rmsle_score)

<br>Alternative: Extract best model from `results_df` programmatically

In [None]:
# Extract the best model from results_df based on lowest RMSLE
best_model_name = results_df.loc[results_df['RMSLE'].idxmin(), 'Model']
print(f"Best Model: {best_model_name}")

model = models[best_model_name] # No re-fitting, just using the stored model
#model.fit(X_train, y_train)  #If you want to retrain the model, uncomment this line

# Generate predictions on validation set
val_predictions = model.predict(X_val) # Model is already trained, just predict

# Evaluate model performance on validation data
rmsle_score = rmsle(y_val, val_predictions)
print("RMSLE:", rmsle_score)