# House Pricing Predictions
This is my attempt on the data science challenge on [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview), predicting the house prices given a large number of features.

I got inspired by great notebooks such as:
* the feature-engineering lesson by [***RYAN HOLBROOK***](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices)
* the amazing data exploration notebook by [***Pedro Marcelino***](https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python)

## Preliminary work: imports and loading data

In [None]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sklearn

In [None]:
# variables to store the location of the training and test datasets
train_file = "train.csv"
test_file = "test.csv"

In [None]:
df_train_org = pd.read_csv(train_file)
df_test_org = pd.read_csv(test_file)
# df and df_test are copies of the original datasets with target value referred to as y
Y = "y"
df = df_train_org.rename(columns={'SalePrice': Y})
df_test = df_test_org.rename(columns={'SalePrice':Y})

## Data Exploration 
In this section we explore the basic aspects of the provided dataset.

In [None]:
import copy
# a function to drop an element to both train and test dataframes
def drop_cols(cols):
    global df, df_test
    
    if isinstance(cols, str):
        df = df.drop(cols, axis=1)
        df_test = df_test.drop(cols, axis=1)
    else:
        for col in cols:
            df = df.drop(col, axis=1)
            df_test = df_test.drop(col, axis=1)

In [None]:
# let's first understand the nature of our data
print(df.shape, df_test.shape) 
# each sample is described by 81 features. This number if relatively high.

print((df['Id'].values == range(1 ,len(df) + 1)).all()) 
# as we can see the Id column is merely for ennumeriation purposes. It can be either dropped or set as an index.
drop_cols("Id")
print(df.shape, df_test.shape)


### Categorical and Numerical features
Let's consider the different types of features. First, we divide them into numerical and non-numerical. The non-numerical are definitely categorical (or can be made as such). As for numerical, columns with int values, can be considered categorical if the number of unique values is limited.

In [None]:
# consider non-numerical values
object_type = "object"
cat_type = 'category'
non_num_cols = df.select_dtypes([object_type, cat_type]).columns
print(non_num_cols)
num_cols = df.select_dtypes(np.number).columns
print(num_cols)

In [None]:
# let's consider the subset of numerical columns with few discrete values
num_discrete = df.select_dtypes('int64').columns
num_dis_count = [len(df[num_d].unique()) for num_d in num_discrete]  
print(dict(zip(num_discrete, num_dis_count)))

# we can see that a "MSSubClass" is a categorical feature
# a number of features are not categorical by say, but can be treated as such: Bath related features, Fireplaces, GarageCars, and most importantly
# OverallQual and OveralCond


## Data Cleaning
Certain columns might contain corrupted data and thus require cleaning. I will start with categorical columns. 


### Cleaning categorical columns 
The main procedure is as follows:
* replace the values that do not belong to the data description set of values by the one described there, mainly typos in string typed values
* in the worst case drop rows that have values significantly different from the pre-determined categories


In [None]:
for col, uni_values in zip(non_num_cols, [df[col].unique() for col in non_num_cols]):
    print(col)
    print(uni_values)

In [None]:
# investigating the output of the previous cell as well as inspecting the content of the documentation
# suggests a couple of corrupted values in certain columns such as Exterior2nd

# let's define a function to replace these values both in the training and test dataframes
def replace_values(col_names:list , wrong_correct:list):
    assert (isinstance(col_names, str) and isinstance(wrong_correct, dict)) or all([isinstance(l, dict) for l in wrong_correct]) and isinstance(col_names, list) 
    
    global df, df_test
    
    if isinstance(col_names, str):
        for k, v in wrong_correct.items():
            df = df.replace(k, v)
            df_test = df_test.replace(k, v)
        return 
    
    else:
        for col, dic in zip(col_names, wrong_correct):
            for k, v in dic.items():
                df = df.replace(k, v)
                df_test = df_test.replace(k, v)


In [None]:
# the value C (all) is corrupted in MSZoning
mszoning = "MSZoning"
correct_mszoning = {"C (all)": "C"}
ext2 = "Exterior2nd"
correct_ext2 = {"Brk Cmn": "BrkComm", "CmentBd": "CemntBd"}

ms_ext = [mszoning, ext2]
correct = [correct_mszoning, correct_ext2]

replace_values(ms_ext, correct)


### Cleaning numerical columns
This task is slightly trickier as it might require domain expertise. The main procedure is as follows:
1. replace (or drop) values that contradict general common sense, for instance negative areas, months cannot be more than $12$
2. consider the relationship between certain rows. values in a certain columns cannot be smaller / larger than the corresponding values in other columns. This step require more careful study of the nature of the problem.

In [None]:
# let's first display the numerical columns
print(num_cols)

In [None]:
# let's consider areas
areas = [col for col in num_cols if ("area" in col.strip().lower())]
# inverstiagating the data description, the term SF generally refers to surface (area)
areas.extend([col for col in num_cols if "SF" in col.strip()]) 
print(areas)
# verify all values are positive
areas_with_neg = [any(df[area] < 0) for area in areas]
areas_with_neg = [area for area, a in zip(areas, areas_with_neg) if a]
print(areas_with_neg)
# as we can see all areas-values are positive

The data description indicats that 
* OverallQuall and OverallCond should belong to the interval [1, 10]
* YearRemodAdd is year of remodel, thus it should be larger or equal to YearBuilt
* The relationship between the the year where the garage was built and the other year features should be investigated.
* MSSubClass represents a label encoding of the different types of houses: values should belong to predetermined set of values specified in the data description
* MoSold is the month where the house was sold. it should belong to [1, 12]
* it might be worthwhile investigating any abnormalities in the relationship between "BsmtFinSF1", "BsmtUnfSF", "BsmtFinSF2" and "TotalBsmtSF" 
 

In [None]:
# first of all let's define a method to replace the names of features in both training and test datasets

def new_col_names(old_new_names:dict):
    global df, df_test
    try:    
        df = df.rename(columns=old_new_names)
        df_test = df_test.rename(columns=old_new_names)
    except:
        df_no_col = [col for col in old_new_names.keys if col not in df.columns]
        df_test_no_col = [col for col in old_new_names.keys if col not in df_test.columns]
        print("{cols} are not in the {dataf}".format(df_no_col, "training dataset"))
        print("{cols} are not in the {dataf}".format(df_test_no_col, "test dataset"))


In [None]:
old_new_cols = {"OverallQual": "qua", 
"OverallCond": "cond", "YearBuilt": "Yb", "YearRemodAdd": "Yr", "MSSubClass": "mss" , 
"BsmtFinSF1": "bSF1", "BsmtFinSF2": "bSF2", "BsmtUnfSF": "bubf", "GarageYrBlt":"GYb"}
new_col_names(old_new_cols)

In [None]:
# let' define a function that applies a function to either 
# the whole dataframe or certain columns on the dataframe

def apply_functions(funcs, col_names=None):
    # either have one function passed that should be applied to the whole dataframe
    # or have an equal number of columns and functions where each funtion will be applied to the corresponding column
    all_data = callable(funcs) and col_names is None
    col_funcs = True
    #  if the funcs argument is indeed a function, then the code below will raise an error 
    try:
        col_funcs = (all([callable(f) for f in funcs]) and len(funcs) == len(col_names))
    except:
        col_funcs = False
    
    assert all_data or col_funcs
   
    global df, df_test
    if col_names is None: # if the function is to be applied to the whole dataframe
        df = df.apply(funcs, axis=1)
        df_test = df_test.apply(funcs, axis=1)
    else:
        for col, f in zip(col_names, funcs):
            df = df.apply(f)
            df_test = df_test.apply(f)

# verify the remodeling and building years features
yb = "Yb"
yr = "Yr"
            
def set_built_remodel_years(row):
    if row[yr] < row[yb]:
        row[yr] = row[yb]
    return row

In [None]:
# verify the integrity of overall quality and condition features
qua = "qua"
cond = "cond"

assert all(df[qua].isin(range(1, 11)))
assert all(df[cond].isin(range(1, 11)))

assert df[df[yb] > df[yr]].empty # the resulting dataframe is empty: no problems with either of these features
apply_functions(set_built_remodel_years)
assert df_test[df_test[yb] > df_test[yr]].empty

# verify the MSubclass features
mss = "mss"
mss_values = [20, 30, 40, 45, 50, 60, 70, 75, 80, 85, 90, 120, 150, 160, 180, 190]

assert df[~df[mss].isin(mss_values)].empty # all values from "MSsubclass" feature are under check
assert df_test[~df_test[mss].isin(mss_values)].empty
# verify the month feature

assert df[~df["MoSold"].isin(range(1, 13))].empty  # all values of the month feature are correct
assert df_test[~df_test["MoSold"].isin(range(1, 13))].empty

In [None]:
# let's consider the year where the garage was built
gyb = "GYb"
print(df[df[gyb] < df[yb]][[gyb, yb, yr]])

print(df_test[df_test[gyb] < df_test[yb]][[gyb, yb, yr]]) 
# with few exceptions the year where the garage if it is before the year where the house was built is generally few years earilier
# which suggests that the garage was meant to be built with the house, yet the house took slightly longer to complete.

# if GYb is less than Yb then we will set to Yb.
def set_garage_year(row):
    if row[gyb] < row[yb]:
        row[gyb] = row[yb]
    return row
# set the changes in both train and test data
print(callable(set_garage_year))
apply_functions(set_garage_year)

assert df[df[gyb] < df[yb]].empty
assert df_test[df_test[gyb] < df_test[yb]].empty 

# the changes were applied to both data sets

## Imputing missing values
The second step is imputing the missing values and making sure no Nan values are passed to our machine learning models. The main procedure is as follows:
1. drop all columns with a nan ratio exceeding a certain treshhold
2. imput the missing values. The strategy depends mainly on the feature in question:
    * if the feature is highly related to other features, then a highly accurate and natural value can be deduced
    * some statistical value could be used to impute the missing values
    * investigating related features could help come up with an aggregated value when the data is grouped by a number of features.

### 1st strategy: drop columns

In [None]:
# let's first discover which columns have Nan values
nan_val_train = df.isna().sum()
cols_nan = nan_val_train[nan_val_train > 0]
print(cols_nan)
NAN_THRESHOLD = 0.8 # all columns with more than 0.8 nan values will be dropped
cols_nan = cols_nan / len(df)
# print(cols_nan)
cols_nan_drop = cols_nan[cols_nan > NAN_THRESHOLD].index.values
print(cols_nan_drop) 

In [None]:
nan_val_test = df_test.isna().sum()
nan_val_test = nan_val_test[nan_val_test > 0]
print(nan_val_test)

As we can see: Alley, PoolQC, fence and MiscFeature have an extremely high ratio of nan values. 
before dropping these columns, it is necessary to drop features tightly related to those.  
Investigating the data description reveals that:  
* PoolQC is related to PoolArea
* MiscFeature is related to MiscVal


In [None]:
# let's better understand those relations
pq = "PoolQC"
pa = "PoolArea"
print(df[df[pq].isna()][pa].value_counts()) 
# Nan values are associated with 0 pool area, which means that there is no pool in the first place
# an extremely high ratio of houses do not have a pool.
# let's consider the prices of houses with pool
print(df[~df[pq].isna()][Y]) #  there is only few values with relatively high variance

In [None]:
misf = "MiscFeature"
misv = "MiscVal"
print(df[df[misf].isna()][misv].value_counts()) # we can see that houses with Nan on MiscFeature 
# have 0 on MiscVal which means that they do not any additional features to mention
# both of these features should be dropped

In [None]:
# let's drop the features in questions
cols_nan_drop_final = cols_nan_drop.tolist() + [misv, pa]
drop_cols(cols_nan_drop_final)

assert all(df.drop(Y, axis=1).columns.values == df_test.columns.values)

### 2nd strategy: deduce values

In [None]:
# let's consider the columns left with nan values
nan_val_train = df.isna().sum()
cols_nan = nan_val_train[nan_val_train > 0]
print(cols_nan)

The values with Nan can be divided into categories:
* Masonry veneer 
* Basement
* Garage
* FirePlace
* LotFrontage (only one column) and Electrical


In [None]:
def set_nan(col_names:list, fill_values:list):
    one = isinstance(col_names, str) and isinstance(fill_values, str)
    try:
        many = len(col_names) == len(fill_values)
    except:
        many = False
    assert one or many 
    global df, df_test
    if many: 
        for col, v in zip(col_names, fill_values):
            df[col] = df[col].fillna(v)
            df_test[col] = df_test[col].fillna(v)
    else:
        df[col_names] = df[col_names].fillna(fill_values)
        df_test[col_names] = df_test[col_names].fillna(fill_values)

In [None]:
# let's consider Masonry Veneer columns
msvt = "MasVnrType"
msva = "MasVnrArea"

print(df[(df[msvt].isna()) | (df[msva].isna())][[msva, msvt]]) 
# we can see that type and are either both nan or both non-nan
# a reasonable assumption is that there is no Masonry Veneer
set_nan([msvt, msva], ['None', 0])

In [None]:
# let's consider the garage columns
gt = "GarageType"
gf = "GarageFinish"
gc = "GarageCars"
ga = "GarageArea"
gcond = "GarageCond"
gqua = "GarageQual"

g = [gt, gyb, gf, gc, ga, gcond, gqua]
print(df[df[gt].isna() | df[gyb].isna() | df[gf].isna() | df[gcond].isna() | df[gqua].isna()][ga].value_counts())
# if any of the garage features in a certain row is set to Nan, then its area is 0
# which means there is no garage

# let's first consider the Garage built year feature for such rows
print(df[df[gt].isna() | df[gf].isna() | df[gcond].isna() | df[gqua].isna()][gyb].value_counts())

set_nan(g, ["NA", 0, "NA", 0, 0, "NA", "NA"])

garage_old_new = {gt: "gt", gf:"gf", gc:"gc", ga:"ga", gcond:"gcond", gqua:"gqua"}

new_col_names(garage_old_new)

gt = "gt"
gf = "gf"
gc = "gc"
ga = "ga"
gcond = "gcond"
gqua = "gqua"


In [None]:
# let's consider the basement nan-values
bqua = "BsmtQual"
bcond = "BsmtCond"
bexp = "BsmtExposure"
bf2 = "BsmtFinType2"
bf1 = "BsmtFinType1"
baf1 = "bSF1"
baf2 = "bSF2"

# BsmtQual         37
# BsmtCond         37
# BsmtExposure     38
# BsmtFinType1     37
# BsmtFinType2     38

print(df[df[bqua].isna() & df[bcond].isna() & df[bexp].isna() & df[bf1].isna() ][baf1].value_counts())
# we can assume that Nan values for each of these basement features reflect No basement

set_nan([bqua, bcond, bexp, bf1, bf2], ["NA"] * 5)
set_nan(["bSF1", "bSF2", "bubf", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath"], [0] * 6)
# change the basement's features 

basement_new_old = {bqua: "bqua", bcond: "bcond", bexp: "bexp", bf1:"bf1", bf2: "bf2"}
new_col_names(basement_new_old)

bqua = "bqua"
bcond = "bcond"
bexp = "bexp"
bf2 = "bf2"
bf1 = "bf1"
baf1 = "baf1"
baf2 = "baf2"


In [None]:
# let's check the rest real quick
# LotFrontage     259
# Electrical        1
# FireplaceQu     690

firequa = "FireplaceQu"
firep = "Fireplaces"
print(df[df[firequa].isna()][firep].value_counts())
# so nan values refer to NOn existing fire places
set_nan(firequa, "NA")

e = "Electrical"
print(df[e].value_counts())
lf = "LotFrontage"
# we can assume that the only missing value is SBrkr with a high statistical possibility
# the simplest solution for LotFrontage is to nan values to 0
set_nan([e, lf], ["SBrkr", 0])

fire_new_old = {firequa:"firequa", firep:"firep"}
new_col_names(fire_new_old)

firequa = "firequa"
firep = "firep"


In [None]:
def classify_cols():
    global test, df_test
    cat_train = df.drop(Y, axis=1).select_dtypes(['category', 'object']).columns.values.tolist()
    cat_test = df_test.select_dtypes(['category', 'object']).columns.values.tolist()
    
    num_train = df.drop(Y, axis=1).select_dtypes(np.number).columns.values.tolist()
    num_test = df_test.select_dtypes(np.number).columns.values.tolist()
    assert (num_train == num_test) and (cat_train == cat_test)
    return num_train, cat_train

In [None]:
num_cols, cat_cols = classify_cols()
print(num_cols, cat_cols, sep='\n\n\n')

In [None]:

# as all these types with Nan values are categorial a reasonable statistics to choose is mode
# let's import an Imputer to take care of missing values for us
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df_cols = df.columns.values.tolist()
df_test_cols = df_cols.copy()
df_test_cols.remove(Y)

# due the imputing process the columns types are set to object
num_cols, cat_cols = classify_cols() 

df = imp.fit_transform(df)
df = pd.DataFrame(df, columns=df_cols)
df_test = imp.fit_transform(df_test)
df_test = pd.DataFrame(df_test, columns=df_test_cols)

for c in num_cols:
    df[c] = df[c].astype('float64')
    df_test[c] = df_test[c].astype('float64')

df[Y] = df[Y].astype('float64')

In [None]:
assert ((num_cols, cat_cols) == classify_cols())

In [None]:
assert df_test.isna().sum().sum() == 0 and df.isna().sum().sum() == 0

## Encode the categorial features

In [None]:
# store the current state of the dataframes of later modifications
df_cat = df.copy()
df_t_cat = df_test.copy()

In [None]:
# let's define the orders for each of the ordinal columns
usual_levels = ["NA", "Po", "Fa", "TA", "Gd", "Ex"]

ordered_levels = {
    "ExterQual": usual_levels,
    "ExterCond": usual_levels,
    "bqua": usual_levels,
    "bcond": usual_levels,
    "HeatingQC": usual_levels,
    "KitchenQual": usual_levels,
    "firequa": usual_levels,
    "gqua": usual_levels,
    "gcond": usual_levels,
    "LotShape": ["Reg", "IR1", "IR2", "IR3"],
    "LandSlope": ["Sev", "Mod", "Gtl"],
    "bexp": ["NA", "No", "Mn", "Av", "Gd"],
    "bf1": ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    "bf2": ["NA", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
    "Functional": ["Sal", "Sev", "Maj1", "Maj2", "Mod", "Min2", "Min1", "Typ"],
    "gf": ["NA", "Unf", "RFn", "Fin"],
    "PavedDrive": ["N", "P", "Y"],
    "Utilities": ["NoSeWa", "NoSewr", "AllPub"],
    "CentralAir": ["N", "Y"],
    "Electrical": ["Mix", "FuseP", "FuseF", "FuseA", "SBrkr"],
}

In [None]:
# let's encode ordinal data
def cat_to_ord(col:list, categories:list, ordered:bool=True):
    global df, df_test
    final_cat = categories if ordered else categories[::-1]
    df[col] = df[col].apply(dict(zip(final_cat, range(0, len(final_cat)))).get)
    df_test[col] = df_test[col].apply(dict(zip(final_cat, range(len(final_cat) ))).get)


for k, v in ordered_levels.items():
    cat_to_ord(k, v)

In [None]:
num_c, cat_c = classify_cols()
for k in ordered_levels.keys():
    assert k not in cat_c 

In [None]:
# let's encode non ordinal data using the label encoder
from sklearn.preprocessing import LabelEncoder

non_ord_cat = [cat for cat in classify_cols()[1] if cat not in ordered_levels.keys()]

le = LabelEncoder()

for col in non_ord_cat:
    df[col] = le.fit_transform(df[col].values)    
    df_test[col] = le.transform(df_test[col])
# changing some column names for ease of manipulation

new_col_names({"TotalBsmtSF": "bSF", "bubf": "b_unfSF"})

In [None]:
assert df.select_dtypes(['object', 'category']).empty and df_test.select_dtypes(['object', 'category']).empty


## FeatureEngineering
In this section we will work on improving the performance by creating a number of synthetic features out of the given ones. 

In [None]:
# let's save the dataframe before introducing any new feature
df_base = df.copy()
df_test_base = df_test.copy()

In [None]:
print(df.isna().sum().sum())
print(df_test.isna().sum().sum())

### Baseline performance
Creating a baseline model with the initial (left) features gives me a ground on which I can base my next decisions. I will use the same performance metric as in the competition. The baseline model would be a sophisticated XGBoostRegressor model.

In [None]:
# definining the cross validation procedure
from sklearn.model_selection import KFold
n_splits = 5 
random_state = 3
shuffle = True
kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=shuffle)

In [None]:
print(sklearn.metrics.get_scorer_names())

In [None]:
# the main score of the competition is the square of log error squared
# let's define a function to calculate a model's performance according to this metric
scoring = "neg_mean_squared_error"
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor as xgr
def model_performance(X, y, model=xgr(seed=0)):
    global kf, scoring
    log_y = np.log(y)
    score = cross_val_score(model, X, log_y, cv=5, scoring=scoring)
    return np.sqrt(-score.mean()) 

In [None]:
X = df.copy()
y = X.pop(Y)

from sklearn.ensemble import RandomForestRegressor

base_score = model_performance(X, y)
print(base_score)

### Informative features: mutual information
The mutual information is quite a powerful and general technique to determine the relevance of features with respect to the target variable.


In [None]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y):
    X = X.copy()
    # discrete features are the ones with type int
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]

    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [None]:
mi_scores = make_mi_scores(X, y)


In [None]:
# drop the all features that do not scroe higher than a certain predetermined threshold.
mi_threshhold = 0.02

print(df.columns)
print(df_test.columns)
irrelevant_feats = mi_scores[mi_scores < mi_threshhold].index.values.tolist()

drop_cols(irrelevant_feats)
print(df.columns)
print(df_test.columns)

In [None]:
X = df.copy()
y = X.pop(Y) 

new_score = model_performance(X, y)
print(new_score) # a very small gain is achieved out of removing the unformative features

In [None]:
print(mi_scores.head(20))

#### Quality feature
This subsection I try to understand the Quality feature and its relations to the other features

In [None]:
# let's create a a function that returns three lists: num, ord, cat features
def feature_types():
    global df, df_test
    # using the dataframe version before the numerical encoding
    num = df_cat.select_dtypes(np.number).columns.values.tolist()
    ord = [k for k in ordered_levels.keys()]
    # filter the columns that are still in the dataframe
    num = [col for col in num if col in df.columns]
    ord = [col for col in ord if col in df.columns]
    # the columns left in df that do not belong to the previous two are categorical
    cat = [k for k in df.columns if k not in ord and k not in num]
    return num, ord, cat

In [None]:
num_cols, ord_cols, cat_cols = feature_types()

In [None]:
num_or_ordinal = num_cols  + ord_cols
corr_with_qua = df.loc[:, num_or_ordinal].corr()[qua] 
corr_with_qua = corr_with_qua[(corr_with_qua > 0.4) | (corr_with_qua < -0.4)]
print(corr_with_qua.sort_values())

In [None]:
def plot_scatters(col_names, x=None):
    global df, df_test
    if x is None:
        x = df.index.values.tolist()
    for col in col_names:
        plt.scatter(x=x, y=df[col].values)
        
    plt.title(str(col_names) + " variation")
    plt.show()

In [None]:
agg_funcs = ['count', np.mean, np.median, np.min, np.max]

In [None]:
# let's consider a new feature: qua * cond
df['state'] = df[qua] * df[cond]
# print(df['state'].value_counts().sort_index())
print(df[['state', Y, qua, cond]].corr()[Y])
# as the condition is not as significant as the quality, it might be worth trying a different tranformation
df['state'] = df[qua] *  np.floor(np.sqrt(df[cond]))
print(df[['state', Y, qua, cond]].corr()[Y])
# apparently this feature makes the best out of both features let's verify its effect with and without the old features
X = df.copy()
y = X.pop(Y)
print(model_performance(X, y))
# there is some slight improvement of 0.01 
# # the feature is worth keeping
df_test['state'] = df_test[qua] *  np.floor(np.sqrt(df_test[cond]))


In [None]:
X = df.copy()
y = X.pop(Y)
print(model_performance(X, y))
# there is some slight improvement of 0.01 
# # the feature is worth keeping

### Garage features
Garage is clearly an informative element. Some more investigation is needed.

In [None]:
garage_cols = [col for col in df.columns if col.lower().startswith("g")]
garage_cols.remove("GrLivArea")
print(df[garage_cols].corr())

In [None]:
# let's inspect gt, GYB, and gf
# for c in [gt, gyb, gf]:
#     print(df_cat[c].value_counts())

print(df_cat[(df_cat[gyb] != df_cat[yb]) & (df_cat[gyb]!= df_cat[yr])][gt].value_counts())

In [None]:
print(pd.pivot_table(df, values=gyb, index=gqua, aggfunc=agg_funcs + [pd.Series.mode]))
print(pd.pivot_table(df, values=gyb, index=gcond, aggfunc=agg_funcs + [pd.Series.mode]))
# we can see that the older the more likely for its quality as well as condition to degrade
# let's try to incorporate this idea
print(pd.pivot_table(df, values=gqua, index=gf, aggfunc=agg_funcs + [pd.Series.mode]))
print(pd.pivot_table(df, values=gcond, index=gf, aggfunc=agg_funcs + [pd.Series.mode]))

# the gf seems of little to no significance

In [None]:
print(pd.pivot_table(df, index=gqua, values=qua, aggfunc=agg_funcs))
print(df[[gqua, gcond, Y]].corr()[Y])
# the new state of garage should be similar to the one created for the whole whouse
# let's suggest the new state
# 0, 1,2 -> 1/2 : quality is low anyway
# 3 -> 1: only the quality matters in this case
# in general quality is high, a better condition means a better state

gcond2= "gcond2"
def new_gcond(row):
    row[gcond2] = 0
    if row[gcond] in [0, 1, 2]:
        row[gcond2] = 0.5
    elif row[gcond] == 3:
        row[gcond2] = 1
    else:
        row[gcond2] = 1.5
    return row
apply_functions(new_gcond)

df['gstate'] = df[qua] * df[gcond2]
df_test['gstate'] = df_test[qua] * df_test[gcond2]

print(df[[Y, gcond, gqua, gcond2, 'gstate']].corr()[Y]) 
# as we can see is highly correlated with the target variable
# let's test the new feature
X = df.copy()
y = X.pop(Y)
g_mi_scores =  make_mi_scores(X, y)[[gcond, gqua, gcond2, 'gstate']]
print(g_mi_scores)

In [None]:
# at the moment the new feature seems quite promising, let's inspect its effect on the performance
print(model_performance(X, y))

# let's try removing the different subsets of [gcon, gqua, gcond2]
X1 = df.drop([gcond, gqua, gcond2], axis=1).copy()
y1 = X1.pop(Y)
print(model_performance(X1, y1))

## uncomment the lines of code below to 
## see the performance with the different subsets of removed featurs

# X2= df.drop([gqua, gcond2], axis=1).copy()
# y2 = X2.pop(Y)
# print(model_performance(X2, y2))

# X3= df.drop([gcond, gcond2], axis=1).copy()
# y3 = X3.pop(Y)
# print(model_performance(X3, y3))

# X4= df.drop([gcond, gqua], axis=1).copy()
# y4 = X4.pop(Y)
# print(model_performance(X4, y4))

# X5= df.drop([gcond], axis=1).copy()
# y5 = X5.pop(Y)
# print(model_performance(X5, y5))

# X6= df.drop([gcond2], axis=1).copy()
# y6 = X6.pop(Y)
# print(model_performance(X6, y6))

# X7= df.drop([gqua], axis=1).copy()
# y7 = X7.pop(Y)
# print(model_performance(X7, y7))

## even though the performance did not improve, the mi scores as well a
## as the correlation scores are solid proofs of the usefulness of this feature


In [None]:
min_gyb = df[gyb].min()
max_gyb = df[gyb].max()
# there are two possible ways, either consider age divided by the number of garages
# or consider the square root or log of the age
gage = "gage"
df[gage] = np.log(np.sqrt((max_gyb + 2) - df[gyb]))

df_test[gage] = np.log(np.sqrt((max_gyb + 2) - df[gyb]))

gage_counts = df[gage].value_counts()
plt.bar(gage_counts.index.values, gage_counts.values)
plt.show()
# the gage feature is promising
 

In [None]:
# let's drop gt, gf for the moment, as they do not seem to be as informative 
drop_cols([gt, gf])

In [None]:
garage_cols = [col for col in df.columns if col.lower().startswith("g")]
garage_cols.remove("GrLivArea")
print(garage_cols )

In [None]:
df[ga] = df_base[ga]

In [None]:
# let's consider the different relation between the two garage areas
print(df[[ga, gc, Y]].corr())
print(pd.pivot_table(df, values=ga, index=gc, aggfunc=agg_funcs))

# there is practically no difference between having 3 and 4 car capacity in a garage
# let's experiment with setting the 5 4-car capacity garages to 3
replace_values([gc, gc], [{4:3}, {5: 3}])
print(df[garage_cols].corr())

In [None]:
# let's investigate the effect of the area within the garage with the same car capacity
g1cap = df[df[gc] == 1]
g2cap = df[df[gc] == 2]
g3cap = df[df[gc] == 3]
gs = [g1cap, g2cap, g3cap]

In [None]:
for g in [g1cap, g2cap, g3cap]:
    print(g.loc[:, [Y, ga]].corr()[Y])  # numerically a garage's area is not of much relevance given the capacity is either 1 or 3

In [None]:
gcs = [g1cap[ga].values, g2cap[ga].values, g3cap[ga].values]
plt.boxplot(gcs)
plt.show()    

In [None]:
# let's check the case of garage capacity
ga_outliers = {1:[120, 450], 2: [250, 780], 3:[600, 1100]}
# let's try rendering the data to normal distribution
for g, out in zip(gs, ga_outliers):
    g_no_out = g[(g[ga] <= ga_outliers[out][1]) & (g[ga] >= ga_outliers[out][0])]
    g_no_out[ga] = np.log((g_no_out[ga]))
    g_no_out[Y] = g_no_out[Y] / (10 ** 4)
    plt.bar(g_no_out[ga].values, g_no_out[Y].values)
    plt.show()
    
# add the the outliers for the non-garage homes    
ga_outliers[0] = [0.0, 0.0]

In [None]:
def transform_garage_areas(row):
    # ga_outliers[row[gc]][0]: represents the minimum (no -outlier) area associated with the capacity of the garage 
    row[gc] = int(row[gc])
    row[ga] = max(ga_outliers[row[gc]][0], row[ga])
    row[ga] = min(ga_outliers[row[gc]][1], row[ga])
    row[ga] = np.log(np.sqrt(row[ga])) if row[ga] > 0 else 0
    return row   
apply_functions(transform_garage_areas)

In [None]:
plt.bar(df[ga].values, df[Y].values)

In [None]:
# the garage area is to be dropped
drop_cols([gcond2, gcond, gqua, gyb])
garage_cols = [col for col in df.columns if col.lower().startswith("g")]
garage_cols.remove("GrLivArea")
print(garage_cols)

In [None]:
X = df.copy()
y = X.pop(Y)
print(model_performance(X, y))
classify_cols() # added to make sure both

### Basement Features
The large number of basement-related features requires a better investigation of the relation between them. Feature engineering could pay off well in this aspect of the problem


In [None]:
df['bSF2'] = df_base['bSF2']
df_test['bSF2'] = df_test_base['bSF2']
new_col_names({"TotalBsmtSF": "bSF", "bubf": "b_unfSF"})
bas_cols_init = [col for col in df.columns.values if col.lower().startswith("b")]
bas_cols_init.remove("BldgType")
bas_cols_init.remove( "BedroomAbvGr")
print(bas_cols_init)

#### Applying PCA on basement features
Due to the large number (8 features for one aspect), complexity and (potential) importance of basement features, it is worth experimenting with some 

In [None]:
# let's try to apply PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def apply_pca(X, features, n_comp=None, standardize=True):
    # Standardize the data passed
    if standardize:
        scaler = StandardScaler()
        # scale the data useler the built-in scaler
        X = scaler.fit_transform(X)        
    
    # Create principal components
    if n_comp is None:
        pca = PCA()
    else:
        pca = PCA(n_components=n_comp)
        
    X_pca = pca.fit_transform(X)
    # Convert to dataframe
    component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
    X_pca = pd.DataFrame(X_pca, columns=component_names)
    # Create loadings
    loadings = pd.DataFrame(
        pca.components_.T,  # transpose the matrix of loadings
        columns=component_names,  # so the columns are the principal components
        index=features,  # and the rows are the original features
    )
    return pca, X_pca, loadings


In [None]:
df_bas = df.loc[:, bas_cols_init + [Y]]
X = df_bas.copy()
y = X.pop(Y)

bas_pca, bas_X_pca, bas_loadings = apply_pca(X, bas_cols_init)

# print(bas_X_pca, bas_loadings, sep='\n')

bas_pca_mi_scores = make_mi_scores(bas_X_pca, y)
print(bas_pca_mi_scores)

bas_X_pca[Y] = y
print(bas_X_pca.corr()[Y])

In [None]:
# I will take PC3 and PC1 as the final basement features
df['bas1'] = bas_X_pca['PC3']
df['bas2'] = bas_X_pca['PC1']
df['bas3'] = bas_X_pca['PC8']
# drop_cols(bas_cols_init)
BAS_PCA = ['bas1', 'bas2', 'bas3']

In [None]:
def display_one_distribution(iterable, label, n_unique=30):
    if len(np.unique(iterable)) <= n_unique:
        print(iterable.value_counts().sort_index())
    else:
        print("boxplot for " + label)
        plt.boxplot(iterable, labels=label)
        plt.show()
        
def display_distributions(iterables, labels=None, n_unique=30):
    assert iter(iterables)

    if all([iter(i) for i in iterables]):
        cat_iter = []
        num_iter = []
        cat_l = []
        num_l = []
        for i, l in  zip(iterables, ['c' + str(i) for i in range(len(iterables))] if labels is None else labels):
                
            if len(np.unique(i)) <= n_unique:
                print("values for " + l)
                print()
                print(i.value_counts().sort_index())
            else:
                num_iter.append(i)
                num_l.append(l)
        
        if num_iter:
            plt.boxplot(num_iter, labels=num_l)  
            plt.show()
                                    
    else:
        display_one_distribution(iterables)         
    

In [None]:
bSF = "bSF"
print(pd.pivot_table(df, index=bqua, values=bSF, aggfunc=agg_funcs))
# let's investigate the relation between the total area, finished and unfinished area
# using the base dataframe as the area of the 2nd basement has been deemed irrelevant
bSF1 = "bSF1"
df_base['b_sum_SF'] = df_base[bSF1] + df_base["b_unfSF"] + df_base["bSF2"]
total_no_sum = df_base[df[bSF] != df_base['b_sum_SF']].copy() 
print(total_no_sum.empty)
# so as indicated by the names of the columns, there is a relation between both the 4 surface features
# let's consider 2 main new features: ration_finished, finished surface

df['bfSF'] = df_base['bSF1'] + df_base['bSF2'] 
df_test['bfSF'] = df_test_base['bSF1'] + df_test_base['bSF2']


def bf_ratio(row):
    row['bf_ratio'] = (row['bfSF'] / row[bSF])if row[bSF] != 0 else 0
    return row

apply_functions(bf_ratio)

bs_ratio = 'bs_ratio'
bfSF = 'bfSF'


In [None]:
print(pd.pivot_table(df, index=bcond, values=bqua, aggfunc=agg_funcs))
# let's consider the qualities of the condition 2 
print(df[df[bcond] == 2][bqua].value_counts()) # the majority of has quality 3
print(df[df[bcond] == 3][bqua].value_counts())
print(df[bqua].value_counts())

# so the point here is clear: bcondition is informative in a very specific way:
# condition = 0 or condition = 5 then quality = condition
# otherwise, no clear relation is visible for the moment

# let's consider the categorical feature b_is_ex
df['b_is_ex'] = (df[bcond] == 5).astype(int)
df_test['b_is_ex'] = (df_test[bcond] == 5).astype(int)

print(pd.pivot_table(df, index=bexp, values=bqua, aggfunc=agg_funcs)) # the feature does not seem to influence the 
# the quality of the basement.

print(pd.pivot_table(df, index=bf1, values=bqua, aggfunc=agg_funcs)) 

In [None]:
# to fanalize the state of the basement
# I will introduce the bstate feature
# as the product to the (quality + b_is_ex) * np.log(b_exp + 2)
df['bstate'] = (df['bqua'] + 0.5 * df['b_is_ex']) * np.floor(np.log(df['bexp'] + np.exp(1)))
df_test['bstate'] = (df_test['bqua'] + df_test['b_is_ex']) * np.floor(np.log(df_test['bexp'] + np.exp(1)))


In [None]:
X1 = df.drop(BAS_PCA, axis=1)
y1 = X1.pop(Y)

print(model_performance(X1, y1))

X2 = df.drop(['bstate', 'bfSF', 'bf_ratio'], axis=1)
y2 = X2.pop(Y)

print(model_performance(X2, y2))
# it is settled, I will keep my features
df.drop(BAS_PCA, axis=1, inplace=True)
bas_cols = ['bqua', 'bcond','bexp', 'bf1', 'bSF1', 'bf2', 'b_unfSF', 'bSF', 'bSF2', 'b_is_ex', "BsmtFullBath"]
drop_cols(bas_cols)
classify_cols()

### Area features 
There is a number of features that represent a sort of areas in the house. I will try to investigate the different possible relations between those areas and how to make the most out of them

In [None]:
area_features = [col for col in df.columns.values if 'area' in col.strip().lower() or 'sf' in col.strip().lower()]
print(area_features)

In [None]:
new_col_names({"1stFlrSF": "f1SF","2ndFlrSF": "f2SF", "LotArea": "area", "GrLivArea": "liv_area"})
df['floor_SF'] = df['f1SF'] + df['f2SF']

print(len(df[df['floor_SF'] != df['liv_area']])) # one valid assumption for this difference is that some houses have a

# df['liv_ratio'] = df['liv_area'] /  df['area']

new_col_names({"TotRmsAbvGrd": "n_rooms"})

df['area_by_room'] = df['liv_area'] / df['n_rooms']## unused

df['liv_ratio'] = df['liv_area'] / df['area'] ## unused

X_area = df[['f1SF', 'f2SF', 'area', 'liv_area', 'area_by_room', 'liv_ratio', Y]]

print(make_mi_scores(X_area.copy(), X_area.pop(Y)))

In [None]:
df.drop("area_by_room", axis=1, inplace=True)
df.drop("liv_ratio", axis=1, inplace=True)
df.drop("floor_SF", axis=1, inplace=True)

cv = df['n_rooms'].value_counts()
x = cv.index.values
y = cv.values
plt.bar(x, y)
plt.show()

In [None]:
# the main idea here is to create a feature that increases with the living area, but punishes having too many rooms or too few rooms

df['area+room'] = 4 * np.log(df['area']) / (np.abs(df['n_rooms'] - 7) + 1) 
df_test['area+room'] = 4 * np.log(df_test['area']) / (np.abs(df_test['n_rooms'] - 7) + 1) 

df.plot(kind='scatter', x='area+room', y=Y)

X = df.copy()
y = X.pop(Y)

print(make_mi_scores(X, y))

In [None]:
# first floor, second floor, number of rooms are to be dropped now
drop_cols(['f1SF', 'f2SF'])


In [None]:
print(df.columns)
print(df_test.columns)


In [None]:
# # we will consider the first 5 besides the MasVrnArea
# new_col_names({"1stFlrSF": "f1SF","2ndFlrSF": "f2SF", "LotArea": "area", "GrLivArea": "liv_area"})

# df['liv_ratio'] = df['liv_area'] /  df['area']
# df.plot(kind='scatter', x='liv_ratio', y=Y)
# # df.plot(kind='scatter', x='area', y=Y)
# # the lot area's plot it remarkably skewed. Let's do two things:
# ## consider the values with extremely large lot areas (possible outliers)
# ## consider applying a mathematical transformation

# # df[df['area'] <= 5 * 10 ** 4].plot(kind='scatter', x='area', y=Y)
# print(df['area'].describe())
# # the fourth quartile starts from 11600
# fth_q_lot_area = 11600

# sus_area = df[df['area'] > fth_q_lot_area]
# sus_area.plot(kind='scatter', x='area', y=Y)
# # apparently outliers start from 30k 

# sus_area = df[df['area'] > 3 * 10 ** 4]
# norm_area = df[df['area'] <= 3 * 10 ** 4]
# # norm_area.plot(kind='scatter', x='area', y=Y)
# norm_area['area'] = np.log(norm_area['area'])

# plt.bar(norm_area['area'], norm_area[Y])
# plt.show()

# # norm_area.plot(kind='scatter', x='area', y=Y) # close to normal distribution with a slightly high mean

#### Porches: outside areas
We have 4 features that represent certain outside areas that might add values to the house.

In [None]:
wd = "WoodDeckSF"
p = 'porch'
p_list = [col for col in df.columns if "porch" in col.lower()]
p_list.append(wd)

df[p] = 0
for pr in p_list:
    df[p] += (df[pr] != 0)
print(df[p].value_counts())

# a house in the training data has at most one porch
# the final feature for porchs is the sum of all prochs + the wood deck

def area_out(row):
    row[p] = 0
    for por in p_list:
        row[p] += row[por]
    return row

apply_functions(area_out)

drop_cols(p_list)

#### Heat and Electricity
There is a number of features related to heating and electricity that can be combined to forma more informative feature

In [None]:
hq = "HeatingQC"
ca = "CentralAir"
es = "Electrical"
hf = [hq, ca, es]
for v in hf:
    print(df[v].value_counts())

# the distribution of such values is significantly unbalanced
# let's verify whether the difference is impactful or not

def set_heating(row):
    row['heat'] = ((row[hq] >= 3) and (row[ca] == 1) and (row[es] == 'SBrkr'))
    row['heat'] = row['heat'].astype(int)
    return row

apply_functions(set_heating)
drop_cols(hf)
classify_cols() # making sure all changes are reflected on both datasets

### Years features
It might be more meanignful to consider the age of the house than the it was built

In [None]:
# let's first start with converting the years to ages as they are more intuitive
age = 'age'
n_age = "new_age"

df[age] = df[yb].max() -  df[yb] + 1
df[n_age] = df[yb].max() - df[yr] + 1

df_test[age] = df_test[yb].max() -  df_test[yb] + 1
df_test[n_age] = df_test[yb].max() - df_test[yr] + 1

drop_cols([yb, yr])


In [None]:
# # let's try clustering the age

# df['age_group'] = pd.qcut(df[n_age], 8)
# df_test['age_group'] = pd.qcut(df[n_age], 8)

# age_groups = df['age_group'].cat.categories.tolist()
# age_groups.sort()

# def set_age_group(row):
#     row['age_group'] = age_groups.index(row['age_group']) + 1
#     return row

# apply_functions(set_age_group)
# X = df.copy()
# y = X.pop(Y)
# print(make_mi_scores(X, y)[[age, n_age, 'age_group']])
    

In [None]:
# df.drop(age_eval, axis=1)
# drop_cols(n_age)
# X = df.copy()
# y = X.pop(Y)
# print(model_performance(X, y))

# X1 = df.drop(age_eval, axis=1).copy()
# y1 = X1.pop(Y)
# print(model_performance(X1, y1))



### Hyperparameter tuning
I am using the XGboostRegressor which is quite a powerful model. Other considerations should be taken into account.

In [None]:
X_train = df.copy()
y_train = X_train.pop(Y)

xgb_params = dict(
    max_depth=6,           # maximum depth of each tree - try 2 to 10
    learning_rate=0.01,    # effect of each tree - try 0.0001 to 0.1
    n_estimators=1000,     # number of trees (that is, boosting rounds) - try 1000 to 8000
    min_child_weight=1,    # minimum number of houses in a leaf - try 1 to 10
    colsample_bytree=0.7,  # fraction of features (columns) per tree - try 0.2 to 1.0
    subsample=0.7,         # fraction of instances (rows) per tree - try 0.2 to 1.0
    reg_alpha=0.5,         # L1 regularization (like LASSO) - try 0.0 to 10.0
    reg_lambda=1.0,        # L2 regularization (like Ridge) - try 0.0 to 10.0
    num_parallel_tree=1,   # set > 1 for boosted random forests
)

xgb = xgr(**xgb_params)
model_performance(X_train, y_train, xgb)

In [None]:
X_test =df_test.values
# XGB minimizes MSE, but competition loss is RMSLE
# So, we need to log-transform y to train and exp-transform the predictions
xgb.fit(X_train, np.log(y_train))
predictions = np.exp(xgb.predict(X_test))

def create_sub(y_pred, sub_name):
    global df_test_org
    sub_df = pd.DataFrame({"Id": df_test_org['Id'],"SalePrice": y_pred})
    sub_df.to_csv(sub_name, index=False)

create_sub(predictions, "XGBoostRegressor.csv")
print("Your submission was successfully saved!")