# Scores

- dummy-model: 5.55666 
- baseline model with 5 most correlated numeric features: 1.03676
- baseline model + 5 most correlated numeric features + not null categorical data: 1.22499
- Random Forest + 5 most correlated numeric features + not null categorical data: 1.00179
- Random Forest + 5 most correlated numeric features + 117 most important categorial columns: 1.00156

- The goal of this notebook is to create a first pipeline for the model exploration. This is important if we want to be able to get feedback on the different modeling steps we will be doing (pre-processing, feature engineering, etc) and to know if they are improving the model performance or not.

- This pipeline is composed of:
    - Dataset loading
    - The minimum preprocessing or feature engineering required to prepare the data for the model (eg: remove misssing values)
    - Modeling (a first baseline model)
    - Evaluation
    - Preparing submission file to the kaggle competition

In [None]:
from pathlib import Path
import pandas as pd
import copy

pd.set_option('display.max_columns', None)

# Dataset

- Download the data

In [None]:
!kaggle competitions download -c house-prices-advanced-regression-techniques
!mkdir ../../data/house-prices
!unzip -o house-prices-advanced-regression-techniques.zip -d ../../data/house-prices
!rm house-prices-advanced-regression-techniques.zip

In [None]:
!ls ../data/house-prices

- Read data

In [None]:
ROOT_DIR = Path('.').resolve().parents[1].absolute()
DATA_DIR = ROOT_DIR / 'data' / 'house-prices'

target_column = 'SalePrice'

In [None]:
df_master = pd.read_csv(DATA_DIR / 'train.csv', index_col='Id')
df_master.sample(5)

# Pre-processing

## Continuous data

In [None]:
df_continuous = df_master.select_dtypes(include='number')
df_continuous.head()

- Drop rows with null values

In [None]:
df_continuous.isna().sum()

In [None]:
df_continuous = df_continuous.dropna()

In [None]:
highest_corr = df_continuous.corrwith(df_continuous[target_column]).sort_values(ascending=False)
highest_corr.head()

In [None]:
top = list(highest_corr[1:6].index)

In [None]:
top

In [None]:
top_target = copy.deepcopy(top)

In [None]:
top_target

In [None]:
top_target.append(target_column)

In [None]:
top_target

In [None]:
df_continuous = df_continuous[top_target]

## Categorical data

In [None]:
df_categorical = df_master.select_dtypes(include='dtype')
df_categorical.head()
df_categorical.shape

In [None]:
df_categorical.isna().sum()

In [None]:
cat_data_cols = df_categorical.columns[df_categorical.isna().any()].tolist()
cat_data_cols

In [None]:
df_categorical = df_categorical.drop(cat_data_cols, axis=1)
df_categorical.shape

In [None]:
df_categorical = df_categorical.dropna()
#df_categorical = df_categorical.fillna(value='None')
df_categorical.shape

In [None]:
df_categorical.isna().sum()

In [None]:
df_categorical = pd.get_dummies(df_categorical)
df_categorical.shape

In [None]:
df_categorical.head()

# Model training

- Split data in train and test sets

In [None]:
print(df_categorical.shape)
print(df_continuous.shape)

Manually copied least important features

In [None]:
l_important = ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF',
       'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL',
       'MSZoning_RM', 'Street_Grvl', 'Street_Pave', 'LotShape_IR1',
       'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk',
       'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl',
       'Utilities_AllPub', 'Utilities_NoSeWa', 'LotConfig_Corner',
       'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3',
       'LotConfig_Inside', 'LandSlope_Gtl', 'LandSlope_Mod', 'LandSlope_Sev',
       'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes',
       'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown',
       'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW',
       'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
       'Neighborhood_Veenker', 'Condition1_Artery', 'Condition1_Feedr',
       'Condition1_Norm', 'Condition1_PosA', 'Condition1_PosN',
       'Condition1_RRAe', 'Condition1_RRAn', 'Condition1_RRNe',
       'Condition1_RRNn', 'Condition2_Artery', 'Condition2_Feedr',
       'Condition2_Norm', 'Condition2_PosA', 'Condition2_PosN',
       'Condition2_RRAe']

In [None]:
# dummy model
#df_final = df_continuous
# baseline model
# df_final = df_continuous
# imroved model - both categorical and continuous data
df_final = pd.merge(df_continuous, df_categorical, left_index=True, right_index=True)
df_final.drop(l_important, axis=1)
X, y = df_final.drop(target_column, axis=1), df_final[target_column]
X.shape, y.shape

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape

- Train a model

In [None]:
#from sklearn.linear_model import LinearRegression

#model = LinearRegression()
#model.fit(X_train, y_train)

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(max_depth=6, n_estimators=140)
model.fit(X_train, y_train)

# Model evalution

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_log_error

def compute_rmsle(y_test: np.ndarray, y_pred: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return round(rmsle, precision)

In [None]:
y_pred = model.predict(X_test)
# Replace negative predictions with 0
y_pred = np.where(y_pred < 0, 0, y_pred)
compute_rmsle(y_test, y_pred)

## Feature importances

Less is sometimes better, I wanted to remove some columns, which are not important for the model.

In [None]:
feats = {}
for feature, importance in zip(X_test.columns, model.feature_importances_):
    feats[feature] = importance

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'importance'})

In [None]:
importances.shape

The least important columns. Can be deleted in next model retraining.

In [None]:
importances[:70].index

# Inference

- Read data

In [None]:
inference_df = pd.read_csv(DATA_DIR / 'test.csv', index_col='Id')
inference_df.head()

- Feature engineering

In [None]:
continuous_inference_df = inference_df.select_dtypes(include='number')
continuous_inference_df = continuous_inference_df[top]
continuous_inference_df.head()

In [None]:
continuous_inference_df = continuous_inference_df.dropna()

In [None]:
continuous_inference_df.shape

In [None]:
categorical_inference_df = inference_df.select_dtypes(include='dtype')
print(categorical_inference_df.shape)
categorical_inference_df = categorical_inference_df.drop(cat_data_cols, axis=1)
print(categorical_inference_df.shape)
categorical_inference_df = categorical_inference_df.dropna()
categorical_inference_df = pd.get_dummies(categorical_inference_df)
categorical_inference_df = categorical_inference_df.reindex(columns = df_categorical.columns, fill_value=0)
print(categorical_inference_df.shape)
#categorical_inference_df.head()


In [None]:
total_inference_df = pd.merge(continuous_inference_df, categorical_inference_df, left_index=True, right_index=True)

In [None]:
total_inference_df.head()
total_inference_df.shape

- Make inference

In [None]:
#predictions = model.predict(continuous_inference_df)
predictions = model.predict(total_inference_df)
predictions

In [None]:
# Check if there is negative predictions, if so get their indexs
np.where(predictions < 0)

In [None]:
predictions[1109]

In [None]:
# Replace negative predictions with 0
predictions = np.where(predictions < 0, 0, predictions)
np.where(predictions < 0)

# Submission
- [Kaggle AAPI usage](https://www.kaggle.com/docs/api)

- Prepare submission file

In [None]:
#continuous_inference_df.head()
total_inference_df.head()

In [None]:
total_inference_df.shape

In [None]:
predictions.shape

In [None]:
#continuous_inference_df[target_column] = predictions
#continuous_inference_df.head()

total_inference_df[target_column] = predictions
total_inference_df.head()

- To do the submission, we need to have a prediction for all the samples. However, because we dropped the samples with missing continuous values, we don't have them in `continuous_inference_df`. Therefore, we need to get their ids from the original dataframe `inference_df` and set their predictions to 0.

In [None]:
#continuous_inference_df = continuous_inference_df[[target_column]].reset_index()
#continuous_inference_df.head()
total_inference_df = total_inference_df[[target_column]].reset_index()
total_inference_df.head()

In [None]:
inference_ids_df = inference_df.reset_index()[['Id']]
inference_ids_df.head()

In [None]:
f'Number of missing predictions = {len(inference_ids_df) - len(continuous_inference_df)}'

In [None]:
#submission_df = inference_ids_df.merge(continuous_inference_df, on='Id', how='left')
submission_df = inference_ids_df.merge(total_inference_df, on='Id', how='left')
submission_df.head()

- Fill missing predictions (records we dropped because they had missing values in continuous columns)

In [None]:
# Validate the number of missing predictions
#submission_df[target_column].isna().sum() == len(inference_ids_df) - len(continuous_inference_df)
submission_df[target_column].isna().sum() == len(inference_ids_df) - len(total_inference_df)

In [None]:
# Fill missing values with 0
submission_df[target_column] = submission_df[target_column].fillna(0)
submission_df[target_column].isna().sum()

In [None]:
submission_df.shape

In [None]:
 # Save submission file
submission_file_path = DATA_DIR / 'submission.csv'
submission_df.to_csv(submission_file_path, index=False)

In [None]:
pd.read_csv(DATA_DIR / 'submission.csv').head()

- Make submission

In [None]:
# Used to prevent making a submission in case of `notebook run-all`
assert False

In [None]:
!kaggle competitions submit -c house-prices-advanced-regression-techniques -f ../data/house-prices/submission.csv -m dummy-model

In [None]:
!kaggle competitions submissions -c house-prices-advanced-regression-techniques