## Advanced House Price Prediction

This notebook explores more advanced techniques for the House Price Prediction task. We will dive deeper into:

1.  **Exploratory Data Analysis (EDA)** with Seaborn to find patterns.
2.  **Advanced Feature Engineering**, including Target Encoding for categorical features.
3.  Training a more robust regression model like **XGBoost** or **Random Forest**.
4.  **Model Evaluation** and interpreting feature importances.

### 1. Setup and Data Loading

First, we import all the necessary libraries and load our dataset.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Set plot style
sns.set_style('whitegrid')

# Load data
df = pd.read_csv('train.csv')

print(df.head())

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

### 2. Exploratory Data Analysis (EDA) with Seaborn

Let's visualize the data to understand relationships and distributions.

### 2.1. Target Variable Distribution

We'll check the distribution of `SalePrice`. Since many models assume a normal distribution, we might need to transform it (e.g., using a log transform).

In [None]:
# Your code here: Create a histogram and a Q-Q plot for SalePrice
# sns.histplot(df['SalePrice'], kde=True)
# plt.title('Distribution of SalePrice')

### 2.2. Correlation Heatmap

A heatmap is a great way to see which numerical features are most correlated with `SalePrice`.

In [None]:
# Your code here: Calculate the correlation matrix and plot a heatmap
# corr_matrix = df.corr()
# plt.figure(figsize=(12, 9))
# sns.heatmap(corr_matrix, vmax=.8, square=True)

### 2.3. Categorical Feature Analysis

Let's see how different categories relate to `SalePrice` using boxplots.

In [None]:
# Your code here: Create a boxplot for 'Neighborhood' vs 'SalePrice'
# plt.figure(figsize=(12, 6))
# sns.boxplot(x='Neighborhood', y='SalePrice', data=df)

## 3. Feature Engineering

This is where we clean the data, create new features, and handle categorical variables.

### 3.1. Handling Missing Values

Identify columns with missing values and decide on a strategy to fill them (e.g., mean, median, or a constant).

In [None]:
# Your code here: Identify and fill missing values
# For simplicity, we'll focus on a few features and drop rows with NaNs in them for now.
# Later, you can implement more sophisticated imputation.

### 3.2. Target Encoding

Here, we'll implement Target Encoding for a high-cardinality feature like `Neighborhood`. To prevent data leakage, we'll calculate the encoding on the training set and apply it to the validation set.

In [None]:
# 1. Select features and target
features = ['Neighborhood', 'LotArea', 'GrLivArea', 'TotalBsmtSF', 'YearBuilt']
target = 'SalePrice'
df_subset = df[features + [target]].dropna()

# 2. Split data BEFORE encoding
X_train, X_val, y_train, y_val = train_test_split(df_subset[features], df_subset[target], test_size=0.2, random_state=42)

# 3. Calculate target encoding on the training set
# Your code here: Calculate the mean SalePrice for each Neighborhood in the training data
# neighborhood_map = X_train.groupby('Neighborhood')['SalePrice'].mean() ... (this is conceptual, you need to group y_train by X_train's neighborhood)

# 4. Apply the encoding to both training and validation sets
# X_train['Neighborhood_encoded'] = X_train['Neighborhood'].map(neighborhood_map)
# X_val['Neighborhood_encoded'] = X_val['Neighborhood'].map(neighborhood_map)

# 5. Fill any potential NaNs in validation set (if a neighborhood was not in the training set)
# X_val['Neighborhood_encoded'].fillna(y_train.mean(), inplace=True)

## 4. Model Training

Now we'll train a more powerful model. A Random Forest is a great choice as it's robust and handles non-linear relationships well.

In [None]:
# Your code here: Prepare the final feature set (dropping original categorical columns)
# X_train_final = X_train.drop('Neighborhood', axis=1)
# X_val_final = X_val.drop('Neighborhood', axis=1)

# Initialize and train the model
# rf_model = RandomForestRegressor(n_estimators=100, random_state=42, oob_score=True)
# rf_model.fit(X_train_final, y_train)

## 5. Model Evaluation

### 5.1. Performance Metrics

Let's see how our model performed on the validation set.

In [None]:
# Your code here: Make predictions and calculate RMSE and R-squared
# predictions = rf_model.predict(X_val_final)
# rmse = np.sqrt(mean_squared_error(y_val, predictions))
# r2 = r2_score(y_val, predictions)
# print(f'RMSE: {rmse}')
# print(f'R-squared: {r2}')

### 5.2. Feature Importance

Let's find out which features the model found most important.

In [None]:
# Your code here: Get feature importances from the trained model and plot them
# importances = rf_model.feature_importances_
# feature_names = X_train_final.columns
# importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
# importance_df = importance_df.sort_values(by='Importance', ascending=False)

# sns.barplot(x='Importance', y='Feature', data=importance_df)

## 6. Conclusion and Next Steps

Summarize your findings here. What worked well? What could be improved?

**Next Steps:**
- Implement more sophisticated feature engineering (e.g., creating `HouseAge` from `YearBuilt`).
- Try other models like XGBoost or LightGBM.
- Perform hyperparameter tuning to optimize the model.