[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%2010%20Notebooks/GDAN%205400%20-%20Week%2010%20Notebooks%20%28I%29%20-%20Tasks%201-6.ipynb)

This notebook provides a mini-tutorial on the first six tasks of Coding Assignment #5.

---

### Overview of Coding Assignment 5

In the fifth assignment, we are switching to another competition on *Kaggle*, an online platform for data science and machine learning that provides datasets, competitions, collaborative notebooks, and learning resources.

In the assignment, you will complete the following tasks:

- Task 1: Join the Kaggle Competition  
- Task 2: Load the Housing Prices `Training` Dataset  
- Task 3: Identify Variables with Missing Data  
- Task 4: Fill in Missing Values for `LotFrontage`
- Task 5: Explore the Data with Histograms  
- Task 6: Generate an Automated Data Report  
- Task 7: Create a Binary Variable `2+ Car Garage` from `GarageCars`  
- Task 8: Prepare the Data for Modeling
- Task 9: Train and Evaluate at Least Three Models
- Task 10: Make Predictions on `test.csv` and Generate Submission File


These exercises will help strengthen your ability to explore, preprocess, and model real-world datasets using machine learning. You will gain hands-on experience with data cleaning, feature engineering, and predictive modeling, all while working with a classic dataset in a competitive Kaggle environment.

<br> Read in The Usual Packages and Set up Environment

In [None]:
import numpy as np
import pandas as pd

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)  #Set PANDAS to show all columns in DataFrame
pd.set_option('max_colwidth', 500)

# Task 1: Join the Kaggle Competition  

In [None]:
kaggle_displayname = input("Enter your Kaggle Display Name: ")
print(f"Your Kaggle name is: {kaggle_displayname}")

# Task 2: Load the Housing Prices `Training` Dataset
I have uploaded the training and test datasets onto the class GitHub repository.

In [None]:
train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Task 3: Identify Variables with Missing Data  
- Determine which variables contain missing values in the dataset using any acceptable method

*Hints:*
- You can use the `.info()` method, `.isnull().sum()`, `.isna().sum()`, or `.describe()`

In [None]:
train.info()

In [None]:
train.isnull().sum()[train.isnull().sum() > 0]

# Task 4: Fill in Missing Values for `LotFrontage`
- The `LotFrontage` column contains missing values that must be filled before modeling.  
- Use the **median** value to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `LotFrontage` no longer has any missing entries.  

In [None]:
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

In [None]:
train['LotFrontage'] = train['LotFrontage'].fillna(train["LotFrontage"].median())
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

# Task 5: Explore the Data with Histograms  
- Generate histograms for all **numeric features** in the dataset.  
- Use these histograms to understand the distribution of key variables in the dataset.
- **Tips:** 
  - Instead of plotting separate histograms for each variable, use the **shortcut method** we covered in class to generate all histograms at once.
  - Make sure to read in the plotting packages (*hint*: there are two relevant import lines we used in our Week 7 and Week 8 notebooks, as well as weeks 5 and 6)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

train.select_dtypes(include='number').hist(figsize=(13, 8))
plt.tight_layout()
plt.show()

# Task 6: Generate an Automated Data Report  
- Install and use `ydata-profiling` to create a detailed report of the dataset.  
- This report will provide insights into **missing values, distributions, correlations, and more**.  
- **Tip:** Instead of manually exploring each variable, use this **automated tool** to summarize the data in one step.  
- Save the report as an **HTML file** for easy viewing.

In [None]:
# Install ydata-profiling
!pip install ydata_profiling --quiet
# Install ydata-profiling
from ydata_profiling import ProfileReport

In [None]:
# Generate the report
profile = ProfileReport(train,title="Housing_Prices")

In [None]:
# Save the report to an HTML file
profile.to_file("housing_prices.html")