<p><a href="https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%205/GDAN%205400%20-%20Coding%20Assignment%205.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>

# Coding Assignment #5

In this fifth assignment, we are switching to another competition on *Kaggle*, an online platform for data science and machine learning that provides datasets, competitions, collaborative notebooks, and learning resources.

---

In this assignment, you will:

1. **Join the Kaggle Competition** – Sign up for the [Housing Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition on Kaggle.
2. **Load the Data** – Read in the `train.csv` and `test.csv` files using PANDAS.
3. **Explore the Data** – Use methods like `df.info()`, histograms, and plots to understand the dataset.
4. **Handle Missing Data** – Identify and fill missing values appropriately.
5. **Prepare Data for Modeling** – Convert categorical variables to numeric, select relevant features, and split the data into training and validation sets.
6. **Train an Appropriate Regression Model** – Fit three models that are appropriate for a 'regression' problem in order to predict the sale price. Choose the best one.
7. **Evaluate the Model** – Measure accuracy on the validation set.
8. **Make Predictions on the Test Set** – Apply the trained model to generate survival predictions for Kaggle's test data.
9. **Create a Submission File** – Format the predictions into a CSV file.
10. **Upload to Kaggle** – Submit your predictions and check your score.


These exercises will help strengthen your ability to explore, preprocess, and model real-world datasets using machine learning. You will gain hands-on experience with data cleaning, feature engineering, and predictive modeling, all while working with a classic dataset in a competitive Kaggle environment.


---


As a reference, here is a data dictionary describing the variables you will find in the dataset:

---


| **Data Dictionary** |                   |
|---------------------|----------------|
| **Feature**         | **Description** |
| SalePrice      | Property's sale price in dollars (target variable) |
| MSSubClass     | Building class |
| MSZoning       | General zoning classification |
| LotFrontage    | Linear feet of street connected to property |
| LotArea        | Lot size in square feet |
| Street         | Type of road access |
| Alley          | Type of alley access |
| LotShape       | General shape of property |
| LandContour    | Flatness of the property |
| Utilities      | Type of utilities available |
| LotConfig      | Lot configuration |
| LandSlope      | Slope of property |
| Neighborhood   | Physical locations within Ames city limits |
| Condition1     | Proximity to main road or railroad |
| Condition2     | Proximity to main road or railroad (if a second is present) |
| BldgType       | Type of dwelling |
| HouseStyle     | Style of dwelling |
| OverallQual    | Overall material and finish quality |
| OverallCond    | Overall condition rating |
| YearBuilt      | Original construction date |
| YearRemodAdd   | Remodel date |
| RoofStyle      | Type of roof |
| RoofMatl       | Roof material |
| Exterior1st    | Exterior covering on house |
| Exterior2nd    | Exterior covering on house (if more than one material) |
| MasVnrType     | Masonry veneer type |
| MasVnrArea     | Masonry veneer area in square feet |
| ExterQual      | Exterior material quality |
| ExterCond      | Present condition of exterior material |
| Foundation     | Type of foundation |
| BsmtQual       | Height of the basement |
| BsmtCond       | General condition of the basement |
| BsmtExposure   | Walkout or garden level basement walls |
| BsmtFinType1   | Quality of basement finished area |
| BsmtFinSF1     | Type 1 finished square feet |
| BsmtFinType2   | Quality of second finished area (if present) |
| BsmtFinSF2     | Type 2 finished square feet |
| BsmtUnfSF      | Unfinished square feet of basement area |
| TotalBsmtSF    | Total square feet of basement area |
| Heating        | Type of heating |
| HeatingQC      | Heating quality and condition |
| CentralAir     | Central air conditioning (Yes/No) |
| Electrical     | Electrical system type |
| 1stFlrSF       | First floor square feet |
| 2ndFlrSF       | Second floor square feet |
| LowQualFinSF   | Low quality finished square feet (all floors) |
| GrLivArea      | Above grade (ground) living area square feet |
| BsmtFullBath   | Basement full bathrooms |
| BsmtHalfBath   | Basement half bathrooms |
| FullBath       | Full bathrooms above grade |
| HalfBath       | Half bathrooms above grade |
| Bedroom        | Number of bedrooms above basement level |
| Kitchen        | Number of kitchens |
| KitchenQual    | Kitchen quality |
| TotRmsAbvGrd   | Total rooms above grade (excludes bathrooms) |
| Functional     | Home functionality rating |
| Fireplaces     | Number of fireplaces |
| FireplaceQu    | Fireplace quality |
| GarageType     | Garage location |
| GarageYrBlt    | Year garage was built |
| GarageFinish   | Interior finish of the garage |
| GarageCars     | Garage size in car capacity |
| GarageArea     | Garage size in square feet |
| GarageQual     | Garage quality |
| GarageCond     | Garage condition |
| PavedDrive     | Paved driveway presence |
| WoodDeckSF     | Wood deck area in square feet |
| OpenPorchSF    | Open porch area in square feet |
| EnclosedPorch  | Enclosed porch area in square feet |
| 3SsnPorch      | Three-season porch area in square feet |
| ScreenPorch    | Screen porch area in square feet |
| PoolArea       | Pool area in square feet |
| PoolQC         | Pool quality |
| Fence          | Fence quality |
| MiscFeature    | Miscellaneous feature not covered in other categories |
| MiscVal        | Dollar value of miscellaneous feature |
| MoSold         | Month sold |
| YrSold         | Year sold |
| SaleType       | Type of sale |
| SaleCondition  | Condition of sale |

---


<br> Read in The Usual Packages and Set up Environment

In [None]:
import numpy as np
import pandas as pd

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)  #Set PANDAS to show all columns in DataFrame
pd.set_option('max_colwidth', 500)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

# **Instructions: Steps to Complete**

### Task 1: Join the Kaggle Competition  
- Join the [Housing Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition
- Enter your **Kaggle display name** in the input box below.  
- To help me find you on the leaderboard, please add ** 'GDAN5400'** to your display name (e.g., *"Gregory Saxton - GDAN5400"*).  
- **Important:** When running the code, you must enter a name and press **Enter** to proceed.  

In [None]:
kaggle_displayname = input("Enter your Kaggle Display Name: ")
print(f"Your Kaggle name is: {kaggle_displayname}")

### Task 2: Load the Housing Prices `Training` Dataset  
- Import the necessary libraries (**NumPy** and **PANDAS**).  
- Load the Housing Prices training dataset directly from a **GitHub URL** into a Pandas DataFrame. Name the dataset `train`.
- Print the **number of rows** in the dataset.  
- Display the **first two rows** to verify the data has loaded correctly.  

### Task 3: Identify Variables with Missing Data  
- **Part (i)**
  - Determine which variables contain missing values in the dataset using any acceptable method
- **Part (ii)**
  - Use the provided input function to manually enter the variables that have missing values, then hit `Enter` to continue. 

*Hints:*
- You can use the `.info()` method, `.isnull().sum()`, `.isna().sum()`, or `.describe()`

### Task 4: Fill in Missing Values for `LotFrontage`
- The `LotFrontage` column contains missing values that must be filled before modeling.  
- Use the **median** value to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `LotFrontage` no longer has any missing entries.  

### Task 5: Explore the Data with Histograms  
- Generate histograms for all **numeric features** in the dataset.  
- Use these histograms to understand the distribution of key variables in the dataset.
- **Tips:** 
  - Instead of plotting separate histograms for each variable, use the **shortcut method** we covered in class to generate all histograms at once.
  - Make sure to read in the plotting packages (*hint*: there are two relevant import lines we used in our Week 7 and Week 8 notebooks, as well as weeks 5 and 6)

### Task 6: Generate an Automated Data Report  
- Install and use `ydata-profiling` to create a detailed report of the dataset.  
- This report will provide insights into **missing values, distributions, correlations, and more**.  
- **Tip:** Instead of manually exploring each variable, use this **automated tool** to summarize the data in one step.  
- Save the report as an **HTML file** for easy viewing.

### **Task 7: Create a Binary Variable `2+ Car Garage` from `GarageCars`**  
- The variable `GarageCars` is described as `Garage size in car capacity`
- Run frequencies on the variable before proceeding.
- Also, check that there are no missing values.
- *Hint:* `2+ Car Garage` should have values of only `0` and `1`. 
- Run a cross-tabulation on `2+ Car Garage` and `GarageCars` to ensure the new variable maps as expected.
- Ensure `2+ Car Garage` is not missing any values.

### **Task 8: Prepare the Data for Modeling**  
- Select the **predictor variables (`X`)** and the **target variable (`y`)**.  
- Set `LotArea`, `LotFrontage`, `YearBuilt`, `1stFlrSF`, and `2+ Car Garage` as the features for prediction (`X`).  
- Split the data into **training (`X_train, y_train`)** and **testing (`X_val, y_val`)** sets using a standard 80/20 split.  
- Set `random_state=42` in your `train_test_split` command to ensure reproducibility.  

### **Task 9: Train and Evaluate at Least Three Models**  
- **Write a loop** to train **at least three different models** (e.g., linear regression) using the training data.  
- The predictor variables (`X`) used in the models are `LotArea`, `LotFrontage`, `YearBuilt`, `1stFlrSF`, and `2+ Car Garage`. 
- Use the trained model to make *predictions* on the validation set.  
- Evaluate the model’s performance using the **RMSLE** score.
  - *Hint:* We want the score to be as low as possible.
- Make sure to include the proper import statements from `sklearn`.
- Choose the model with the best performance.
  - *Hint:* You can either re-run your best-performing model separately, or extract it from any results dataframe you may have generated during the loop. 


---

### Task 10: Make Predictions on `test.csv` and Generate Submission File

Use your **best-performing model** to predict house prices in the Kaggle test dataset and generate a correctly formatted submission file.

---

### **Instructions:**  
1️⃣ **Load the Kaggle Test Dataset (`test.csv`)**  
   - Download the **test dataset** from Kaggle’s **House Prices - Advanced Regression Techniques** competition (for ease of use, you can access the version I have provided on GitHub). 
   - Ensure the test dataset has the *same structure* as the training dataset.  

2️⃣ **Apply the Same Data Preprocessing Steps**  
   - Use *the same feature selection* as in training: `LotArea`, `LotFrontage`, `YearBuilt`, `1stFlrSF`, and `2+ Car Garage`.  
   - Handle any missing values in these columns using the median values.
   - *Hint:* Before making predictions, *ensure there are no missing values* in the variables used for training. Double-check all variables and apply any necessary transformations before proceeding.  
   
3️⃣ **Use the Best Model to Make Predictions**  
   - Use the trained model to make *predictions on the test set*.  
     - Load and apply the best modet from your training in Task 9.
   - Ensure that all p redictions are *non-negative* (house prices cannot be negative).  
   
4️⃣ **Format Predictions for Kaggle Submission**  
   - The submission file must contain *two columns:* `Id` and `SalePrice`.  
   - Save the predictions as **`submission.csv`** in the required format:  
     - *Important:* The submission file *must match Kaggle’s format exactly*—every `Id` must have a prediction, and no values can be missing.     

   ```
   Id,SalePrice
   1461,185000
   1462,160000
   1463,220000
   ...
   ```

5️⃣ **Submit to Kaggle**  
   - *Check for missing values* before submitting.  
   - Upload `submission.csv` to Kaggle.  

---

## **Deliverables**
1. Submit the link to you Google Colab notebook in the assignment area in Canvas.
2. Include comments in your code to explain each step.

---

## Bonus Points

If you beat my Kaggle score you can earn an additional 10% (1 point out of 10). 

To get the bonus, upload a screenshot showing your submission with your *Public Score*. Below is a screenshot of my submission with a score of 0.27167. Get your RMSLE below this to claim your bonus!

![](https://github.com/gdsaxton/GDAN5400/blob/main/Housing_Prices/Linear_Regression.png?raw=true)

##### Here is code you can use to upload you screenshot

In [None]:
from google.colab import files
from IPython.display import display, Image

# Upload the screenshot
uploaded = files.upload()  # Prompts file upload dialog

# Display the uploaded image (assuming it's the first uploaded file)
for filename in uploaded.keys():
    display(Image(filename))
    break  # Display only the first uploaded image