# House prices: The First 

Fork from:  
* [House prices: Easy mode (top 12%)](https://www.kaggle.com/code/matthieugouel/house-prices-easy-mode-top-12/notebook)

Sources that helped me a lot:
* [Stacked Ensemble Models (Top 3% on Leaderboard)](https://www.kaggle.com/code/alexturkmen/preprocessing-modeling-with-stacking-top-5#2---Preprocessing)
* [Preprocessing & Modeling with Stacking -->Top 5%](https://www.kaggle.com/code/limyenwee/stacked-ensemble-models-top-3-on-leaderboard)

## Initialization

In this section we import the dataset and the required packages.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv
/kaggle/input/house-prices-advanced-regression-techniques/data_description.txt
/kaggle/input/house-prices-advanced-regression-techniques/train.csv
/kaggle/input/house-prices-advanced-regression-techniques/test.csv


In [2]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_log_error, r2_score

from xgboost import XGBRegressor

from shap import Explainer

# Dataset directory
base_path = Path("../input/house-prices-advanced-regression-techniques")

In [3]:
df_train = pd.read_csv(base_path / "train.csv")
df_test = pd.read_csv(base_path / "test.csv")

In [9]:
X = train_df.drop(columns=['SalePrice', 'Id'])
y = train_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=8888)

NameError: name 'train_df' is not defined

# Exploratory Data Analysis (EDA)

In this section we explore the data to help the feature engineering process.  
We can remark several things:
* There is numerical and categorical features
* There are missing values
* The numerical values are not scaled
* The target values (SalePrice) is skewed


Of course we could go further on the investigation (check the colinearity, information gain, ...) 
but we choose here to stay light and general. 


In [None]:
df_train.head()

In [None]:
df_train["SalePrice"].describe()

In [None]:
sns.displot(df_train["SalePrice"])

In [None]:
print(df_train["SalePrice"].skew())
print(df_train["SalePrice"].kurt())

In [None]:
# Visualize missing values
total = X.isnull().sum().sort_values(ascending=False)
percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

missing_data=missing_data.head(35)
f, ax = plt.subplots(figsize=(16, 8))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.title('Percent missing data by feature', fontsize=15)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.show()

# Feature Engineering

In this section, we perform some basic feature engineering: 

* Split features into numerical and categorical data
* Fill missing numerical values with mean
* Fill missing categorical vues with a "Missing" category
* Scale numerical features 
* Encode categorical features into numbers
* Apply a log transformation to target values to mitigate the skewness

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Get numerical and categorical features
numeric_feats = X.dtypes[X.dtypes != "object"].index
categoric_feats = X.dtypes[X.dtypes == "object"].index

# Get features with missing values
na_numeric_feats = [k for k, v in X[numeric_feats].isnull().sum().to_dict().items() if v > 0]
na_categoric_feats = [k for k, v in X[categoric_feats].isnull().sum().to_dict().items() if v > 0]

print(na_numeric_feats)
print(na_categoric_feats)

In [None]:
# Clean numerical features with missing values
imp = SimpleImputer(strategy="mean")
X_train[na_numeric_feats] = imp.fit_transform(X_train[na_numeric_feats])
X_val[na_numeric_feats] = imp.transform(X_val[na_numeric_feats])

# Clean categorical features with missing values
for feat in na_categoric_feats:
    X_train[feat].fillna("Missing", inplace=True)
    X_val[feat].fillna("Missing", inplace=True)

In [None]:
# Scale numerical features
for feat in numeric_feats:
    scaler = RobustScaler()
    X_train[feat] = scaler.fit_transform(X_train[feat].values.reshape(-1, 1))
    X_val[feat] = scaler.transform(X_val[feat].values.reshape(-1, 1))

In [None]:
# Encode categorical features
for feat in categoric_feats:
    encoder = OrdinalEncoder()
    X_train[feat] = encoder.fit_transform(X_train[feat].values.reshape(-1, 1))
    X_val[feat] = encoder.fit_transform(X_val[feat].values.reshape(-1, 1))

In [None]:
# Log-transformation of skewed target variable
y_train = np.log1p(y_train)
y_val = np.log1p(y_val)

# Modeling

In this section we model our regressor and evaluate it.  
We only use XGBoost ensemble regressor, and not perform stacking/blending of different models, for simplicity.

The hyper parameter tuning has been applied and then commented.

In [None]:
# # Hyperparameter Tuning for XGBoost
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     "n_estimators": [500, 750, 1000, 1500, 2000], 
#     "learning_rate": [0.01, 0.02, 0.05], 
#     "max_depth": [6, 8], 
#     "subsample": [0.3, 0.5, 0.7]
# }

# grid = GridSearchCV(XGBRegressor(objective='reg:squarederror'), parameters)
# grid.fit(X_train, y_train)

# print(grid.best_params_)

In [None]:
# Fit the model with training data
model = XGBRegressor(n_estimators=1500, learning_rate=0.02, max_depth=6, subsample=0.7)
model.fit(X_train, y_train)

In [None]:
print("-----")
print("* Training set")
y_pred = model.predict(X_train)
print(f"R2: {r2_score(y_train, y_pred):.2%}")
print(f"RMSE: {mean_squared_log_error(y_train, y_pred, squared=False):.5f}")

print("-----")
print("* Validation set")
y_pred = model.predict(X_val)
print(f"R2: {r2_score(y_val, y_pred):.2%}")
print(f"RMSE: {mean_squared_log_error(y_val, y_pred, squared=False):.5f}")