<a href="https://colab.research.google.com/github/anik199/House_Prices_Advanced_Regression/blob/main/A_brief_overview_of_house_prices_catboost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Competition Dataset

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os


In [2]:
# load dataset 
df_train = pd.read_csv('/content/train.csv')  

df_test = pd.read_csv('/content/test.csv')

In [None]:
#Environment Prepration
import sys
#you may update the environment that allow you to run the whole code
!{sys.executable} -m pip install --upgrade scikit-learn=="0.24.2"

In [None]:
#record this information if you need to run the Kernel internally
import sklearn; sklearn.show_versions() 

# Input Dataset

In [5]:
# STORE SUBMISSION RELEVANT COLUMNS
test_df_submission_columns =df_test['Id']

In [6]:
# DISCARD IRRELEVANT COLUMNS
df_train.drop(['Id'], axis=1, inplace=True)
df_test.drop(['Id'], axis=1, inplace=True)

## Remove Missing Data in Numerical Columns

In the given input dataset there are <b>37 columns </b> with  missing data as follows:

The following code removes the missing values from those columns. We use average value (median) of each column to replace the null values.

In [7]:
from sklearn.impute import SimpleImputer

In [8]:
# PREPROCESSING-1
_NUMERIC_COLS_WITH_MISSING_VALUES = ['MSSubClass', 'BsmtFinSF2', 'Fireplaces', 'FullBath', 'LotArea', 'GarageYrBlt', 'GarageCars', 'BsmtHalfBath', 'BsmtFinSF1', 'OverallCond', 'MiscVal', 'TotRmsAbvGrd', 'YrSold', 'KitchenAbvGr', 'HalfBath', 'YearBuilt', '1stFlrSF', 'GarageArea', 'MoSold', 'LotFrontage', 'BsmtFullBath', 'YearRemodAdd', 'ScreenPorch', 'WoodDeckSF', 'GrLivArea', 'SalePrice', 'OpenPorchSF', 'BedroomAbvGr', '3SsnPorch', 'BsmtUnfSF', 'EnclosedPorch', 'PoolArea', 'LowQualFinSF', 'MasVnrArea', '2ndFlrSF', 'OverallQual', 'TotalBsmtSF']
for _col in _NUMERIC_COLS_WITH_MISSING_VALUES:
    __simple_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    df_train[_col] = __simple_imputer.fit_transform(df_train[_col].values.reshape(-1,1))[:,0]
    if _col in df_test:
        df_test[_col] = __simple_imputer.transform(df_test[_col].astype(df_train[_col].dtypes).values.reshape(-1,1))[:,0]

## Remove Missing data in Categorical Columns

In the given input dataset there are <b>43 columns </b> with  missing data as follows:

The following code removes the missing values.

In [9]:
# PREPROCESSING-2
_STRING_COLS_WITH_MISSING_VALUES = ['Condition2', 'Fence', 'LandSlope', 'GarageCond', 'BsmtExposure', 'LotConfig', 'HeatingQC', 'SaleCondition', 'MSZoning', 'Alley', 'Exterior1st', 'Exterior2nd', 'Heating', 'HouseStyle', 'Utilities', 'BsmtCond', 'SaleType', 'BldgType', 'FireplaceQu', 'Street', 'BsmtFinType2', 'ExterQual', 'ExterCond', 'BsmtFinType1', 'Functional', 'GarageQual', 'CentralAir', 'LandContour', 'LotShape', 'GarageFinish', 'PoolQC', 'Foundation', 'RoofMatl', 'Neighborhood', 'GarageType', 'PavedDrive', 'KitchenQual', 'Condition1', 'Electrical', 'MiscFeature', 'MasVnrType', 'BsmtQual', 'RoofStyle']
for _col in _STRING_COLS_WITH_MISSING_VALUES:
    __simple_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    df_train[_col] = __simple_imputer.fit_transform(df_train[_col].values.reshape(-1,1))[:,0]
    if _col in df_test:
        df_test[_col] = __simple_imputer.transform(df_test[_col].astype(df_train[_col].dtypes).values.reshape(-1,1))[:,0]

## Encode categorical features
Encode categorical features as an integer array.

In the given input dataset there are <b>43 columns </b> with string values as follows:

The following code encode the selected columns.

In [10]:
# PREPROCESSING-3
from sklearn.preprocessing import OrdinalEncoder
_CATEGORICAL_COLS = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
_ohe = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
df_train[_CATEGORICAL_COLS] = pd.DataFrame(_ohe.fit_transform(df_train[_CATEGORICAL_COLS]), columns=_CATEGORICAL_COLS)
df_test[_CATEGORICAL_COLS] = pd.DataFrame(_ohe.transform(df_test[_CATEGORICAL_COLS]), columns=_CATEGORICAL_COLS)

## Numeric to Scale

In the given input dataset there are <b>37 columns </b> with numeric values as follows where we can convert those values to scale through [log1p](https://numpy.org/doc/stable/reference/generated/numpy.log1p.html).

In [11]:
# PREPROCESSING-4
import numpy as np
_NUMERIC_COLS_TO_SCALE = ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice']
for _col in _NUMERIC_COLS_TO_SCALE:
    df_train[_col] = np.log1p(df_train[_col])
    if _col in df_test:
        df_test[_col] = np.log1p(df_test[_col])

In [12]:
# DETACH TARGET
train_df_feature = df_train.drop(['SalePrice'], axis=1)
train_df_target =df_train['SalePrice']
test_df_feature = df_test

## Model: CatBoostRegressor


In [None]:
pip install catboost

In [None]:
# MODEL
import numpy as np
from catboost import CatBoostRegressor
model = CatBoostRegressor()
model.fit(train_df_feature, train_df_target)
y_pred = model.predict(test_df_feature)

In [18]:
# INVERSE TARGET
import numpy as np
if 'test_df_target' in locals():
    test_df_target = np.expm1(test_df_target)
y_pred = np.expm1(y_pred)

# Submission File


In [19]:
# SUBMISSION
submission = pd.DataFrame(columns=['Id'], data=test_df_submission_columns)
submission['SalePrice'] =y_pred
submission.to_csv("submission.csv", index=False)