# House prices: The First 

Fork from:  
* [House prices: Easy mode (top 12%)](https://www.kaggle.com/code/matthieugouel/house-prices-easy-mode-top-12/notebook)

Sources that helped me a lot:
* [Stacked Ensemble Models (Top 3% on Leaderboard)](https://www.kaggle.com/code/alexturkmen/preprocessing-modeling-with-stacking-top-5#2---Preprocessing)
* [Preprocessing & Modeling with Stacking -->Top 5%](https://www.kaggle.com/code/limyenwee/stacked-ensemble-models-top-3-on-leaderboard)

* **目标**：  
1. 参照现有逻辑，简单的生成初版模型，最好达到TOP3%;  
2. 采用shap方法，对其归因分析,聚类。

## Part1:Initialization

In this section we import the dataset and the required packages.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_log_error, r2_score,accuracy_score

from xgboost import XGBRegressor
import lightgbm as lgb
import xgboost as xgb
from shap import Explainer

# Dataset directory
base_path = Path("../input/house-prices-advanced-regression-techniques")

In [None]:
df_train = pd.read_csv(base_path / "train.csv")
df_test = pd.read_csv(base_path / "test.csv")

X = df_train.drop(columns=['SalePrice', 'Id'])
y = df_train['SalePrice']

categoric_feats = X.dtypes[X.dtypes == "object"].index
X[categoric_feats]=X[categoric_feats].astype("category")


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=8888)

yy_test=df_test.drop(columns=['Id'])
yy_test[categoric_feats]=yy_test[categoric_feats].astype("category")

In [None]:
df_train.head()

# Part2:MVP Model --minimum viable product  model  
**Modeling**  
We only use LightGBM ensemble regressor, and not perform stacking/blending of different models, for simplicity.
The hyper parameter tuning has been applied and then commented.
1. 探索数据
2. 采用xgboost直接建模，生成结果，获得基础得分

In [None]:
f, ax = plt.subplots(figsize=(16,4))
df_train["SalePrice"].plot.hist(bins=50,ax=ax)
#sns.displot(df_train["SalePrice"])
print("skew=",df_train["SalePrice"].skew(),"kurt=",df_train["SalePrice"].kurt())
print(df_train["SalePrice"].describe())

In [None]:
#LightGBM 初步建模
model = lgb.LGBMRegressor(n_estimators=1500, learning_rate=0.02, max_depth=6, subsample=0.7)
model.fit(X_train, y_train, categorical_feature=categoric_feats.to_list())

# 对模型的预测结果进行评判
y_pred=model.predict(X_test)

print("Mean Absolute Error : " + str(mean_squared_log_error(model.predict(X_test), y_test)))
print("r2_score : " + str(r2_score(y_pred, y_test)))

yy_pred=model.predict(yy_test)
yy_pred.tofile("../working/house-prices-predV1.csv")

In [None]:
f,ax=plt.subplots(figsize=(16,8))
lgb.plot_importance(model, max_num_features=30,ax=ax)
plt.title("Featurertances",fontsize=15)
# plt.ylabel(fontsize=15)
f.show()

## 用shap方法进行模型解释

## 模型自动调参

In [None]:
callbacks = [lgb.log_evaluation(period=100), lgb.early_stopping(stopping_rounds=30)]
cv_results = lightgbm.cv(
                    params,
                    lgb_train,
                    seed=1,
                    nfold=5,
                    metrics='auc',
                    callbacks=callbacks
                    )

#LightGBM 初步建模


model2 = lgb.LGBMRegressor(n_estimators=1500, learning_rate=0.02, max_depth=6, subsample=0.7)
model2.fit(X_train, y_train, categorical_feature=categoric_feats.to_list(),callbacks=callbacks)

# 对模型的预测结果进行评判
y_pred=model.predict(X_test)

print("Mean Absolute Error : " + str(mean_squared_log_error(model.predict(X_test), y_test)))
print("r2_score : " + str(r2_score(y_pred, y_test)))


# Feature Engineering
**Exploratory Data Analysis (EDA)**  
we model our regressor and evaluate it.  * There is numerical and categorical features  
* There are missing values
* The numerical values are not scaled
* The target values (SalePrice) is skewed  



In this section, we perform some basic feature engineering: 

* Split features into numerical and categorical data
* Fill missing numerical values with mean
* Fill missing categorical vues with a "Missing" category
* Scale numerical features 
* Encode categorical features into numbers
* Apply a log transformation to target values to mitigate the skewness

In [None]:
# Visualize missing values
total = X.isnull().sum().sort_values(ascending=False)
percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20).T

In [None]:
f, ax = plt.subplots(figsize=(16,4))
plt.xticks(rotation='90')
missing_data['Percent'][:20].plot.bar(ax=ax)
# sns.barplot(x=missing_data.index, y=missing_data['Percent',:20])
plt.title('Percent missing data by feature', fontsize=15)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.show()

In [None]:
# Get numerical and categorical features
numeric_feats = X.dtypes[X.dtypes != "object"].index
categoric_feats = X.dtypes[X.dtypes == "object"].index

# Get features with missing values
na_numeric_feats = [k for k, v in X[numeric_feats].isnull().sum().to_dict().items() if v > 0]
na_categoric_feats = [k for k, v in X[categoric_feats].isnull().sum().to_dict().items() if v > 0]

# Clean numerical features with missing values
imp = SimpleImputer(strategy="mean")
X_train[na_numeric_feats] = imp.fit_transform(X_train[na_numeric_feats])
X_val[na_numeric_feats] = imp.transform(X_val[na_numeric_feats])

# Clean categorical features with missing values
for feat in na_categoric_feats:
    X_train[feat].fillna("Missing", inplace=True)
    X_val[feat].fillna("Missing", inplace=True)

In [None]:
# Scale numerical features
for feat in numeric_feats:
    scaler = RobustScaler()
    X_train[feat] = scaler.fit_transform(X_train[feat].values.reshape(-1, 1))
    X_val[feat] = scaler.transform(X_val[feat].values.reshape(-1, 1))

In [None]:
# Encode categorical features
for feat in categoric_feats:
    encoder = OrdinalEncoder()
    X_train[feat] = encoder.fit_transform(X_train[feat].values.reshape(-1, 1))
    X_val[feat] = encoder.fit_transform(X_val[feat].values.reshape(-1, 1))

In [None]:
# Log-transformation of skewed target variable
y_train = np.log1p(y_train)
y_val = np.log1p(y_val)