``Difference with previous versions:``
- Using a different approach to Encoding and imputing data, meaning that having more zeros for either the missing numerical values, and nan values in the categorical ones. Since I will be using all (or at least most) the feature in the dataset it could be helpfull to just have zeros rather values that are probably misleading. The columns with a low number of missing values will just imputed using the KNN algorithm.
- Using regularizers more extensively, as well as controlling the properties of Layers such as weight and bias-initializers more closely.
- Written some new utility functions that can help enhance EDA process.

In [1]:
# Imports:
import pandas as pd
import numpy as np
from utils import *
import seaborn as sns

In [2]:
# Retrieve Data
data = retrieve_data()
train = data['train'].copy()
test = data['test'].copy()

# The dependent feature
y_feat = 'SalePrice'

# Preprocessing:
The general strategy is to combine both the categorical and numerical values in the training and testing and then process them at the same time. For categorical variables we will be getting dictionaries based on the training data and then process them for the combined dataframe. After converting the categorical features to numerical there will be some missing values. And imputation is done using the KNN imputation.

1. Encoding categorical features: The goal is to find a number that could represent the unique values in each of the categorical columns. By having those values each of the string values will be replaced with a numerical one. The encode_categorical_feature function in ./utils.py gets the dataset and the column that we are trying to encode, then it returns the average of the all prices with a given value in the column divided by the sum of all averages for different values in the column: Given that x1,x2,x3,..,xn are unique values in column C, the average SalePrice (dependent column in the dataset) when C is x1 will be avg1, and respectively for each of these unique values in the dataset, there will be avg2, avg3, ..., avgn. Now, in order to have small values for encoding each of the averages will be divided by the sum of all averages: Avg1 = avg1 / (avg1 + 1vg2 + ..+ avgn) and so on. As it is apparent the sum of all the returned encoding values will be one: Avg1 + Avg2 + ... + Avgn = 1. Now when using this technique we can impute the NaN values within the categorical features just by replacing them with the same logic and it will not need any further imputation. Yet it is important to note that if the number of missing values (NaN) is a lot then we should not do the encoding for NaNs and impute them after they are encoded. So, if more than 10% if the data was missing then don't encode the Nan in those columns (implemented in get_encoding_dicts).

2. Imputing numerical data: The numerical data will simply be imputed using the KNN imputer module of sklearn. 

## In-depth analysis of categorical variables:
1. Compare the different NaNs for the same categories (and not) in the number of NaNs they have.
2. Given that 90% data is not missing for a given feature (column) map their encoded numerical values in the dataframe, otherwise, only impute non-nan values in the feature and then impute the rest of the missing values using any other technique. Dropping the column for values with too many missing might be a general option but in order to use the data for Nerual Networks, it would make sense to just impute the missing values with zeros.

In [3]:
# Get the DataFrames
cat_info, num_info = missing_info(data)

In [4]:
cat_info

Unnamed: 0,Test,Train
Alley,1352,1369
MasVnrType,16,8
BsmtQual,44,37
BsmtCond,45,37
BsmtExposure,44,38
BsmtFinType1,42,37
BsmtFinType2,42,38
Electrical,0,1
FireplaceQu,730,690
GarageType,76,81


### Important to note for categorical features:
1. Alley, PoolQc, Fence, MiscFeature are the features with an ecessive number of missing values both in training and testing.
2. FireplaceQu does not have as many missing values as the features above but it is going to be treated the same way.
3. Although for some these values NA means that they just don't have that feature: Alley, MiscFeature, PoolQc, we still will be imppute them as numerical variables instead of imuting with zeros.

#### Note: To conclude there are 5 features that the Nan values will be not encoded in their column's category.


In [34]:
num_info

Unnamed: 0,Test,Train
LotFrontage,259.0,227
MasVnrArea,8.0,15
GarageYrBlt,81.0,78
BsmtFinSF1,0.0,1
BsmtFinSF2,0.0,1
BsmtUnfSF,0.0,1
TotalBsmtSF,0.0,1
BsmtFullBath,0.0,2
BsmtHalfBath,0.0,2
GarageCars,0.0,1


#### The missing values in the test and train dataset for numerical variables are close to each other and there is not huge difference.

## Numerical data:
Based on this dataframe, there some features missing in Training that are not missing in the test data. There is no need manually impute anything in the case of numerical values and I am just going to let KNN handle it.

In [6]:
# Get the length of training data to rebreak the combined data further along the way
train_len = train.shape[0]
test_len = test.shape[0]

# Combine the train and test:
# Note: Pass the copies so the actual dataframes won't change and we can still use them
feat_cols = combine_train_test(train.copy(), test.copy())

In [7]:
feat_cols

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2006,WD,Normal
2915,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Abnorml
2916,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2006,WD,Abnorml
2917,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [8]:
# Get the needed dictionaries to be used for encoding categorical features
cat_dicts = get_encoding_dicts(train, data['train_cat_list'])  

Ignoring: Alley
Ignoring: FireplaceQu
Ignoring: PoolQC
Ignoring: Fence
Ignoring: MiscFeature


In [9]:
# Cheking one of the values.
cat_dicts['Alley']

{nan: 0, 'Grvl': 0.4211, 'Pave': 0.5789}

In [10]:
# Do the encoding
encoded_feat_cols = encode_categorical(feat_cols.copy(), cat_dicts)
encoded_feat_cols[data['train_cat_list']]

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,0.2590,0.5818,0.0,0.1993,0.2376,0.5682,0.1826,0.3097,0.0430,0.1133,...,0.1817,0.2939,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1726
1,0.2590,0.5818,0.0,0.1993,0.2376,0.5682,0.1837,0.3097,0.0519,0.0875,...,0.1817,0.2939,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1726
2,0.2590,0.5818,0.0,0.2493,0.2376,0.5682,0.1826,0.3097,0.0430,0.1133,...,0.1817,0.2939,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1726
3,0.2590,0.5818,0.0,0.2493,0.2376,0.5682,0.1875,0.3097,0.0458,0.1133,...,0.1201,0.2067,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1443
4,0.2590,0.5818,0.0,0.2493,0.2376,0.5682,0.1837,0.3097,0.0729,0.1133,...,0.1817,0.2939,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1726
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,0.1713,0.5818,0.0,0.1993,0.2376,0.5682,0.1826,0.3097,0.0214,0.1133,...,0.0925,0.1503,0.1064,0.1263,0.4298,0.0,0.000,0.000,0.1035,0.1726
2915,0.1713,0.5818,0.0,0.1993,0.2376,0.5682,0.1826,0.3097,0.0214,0.1133,...,0.0985,0.2067,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1443
2916,0.2590,0.5818,0.0,0.1993,0.2376,0.5682,0.1826,0.3097,0.0340,0.1133,...,0.1201,0.2067,0.1930,0.2296,0.4298,0.0,0.000,0.000,0.1035,0.1443
2917,0.2590,0.5818,0.0,0.1993,0.2376,0.5682,0.1826,0.3097,0.0340,0.1133,...,0.0925,0.1503,0.1064,0.1263,0.4298,0.0,0.247,0.227,0.1035,0.1726


#### Note: They were some features which did not have any missing values in the training dataset however they did in test set. Hence they are going to be some missing values in the previousley categorical features and from now on they are going to be imputed the same way the numerical features will be imputed, in other words, they will be treated as numerical.

## Imputing Data with KNN:
- Both the features of train and test are going to be implemented at the same time together using the KNN algorithm

In [12]:
# Impute the missing values with KNNImputer
from sklearn.impute import KNNImputer

# Get the list of columns with missing values
missing_features = encoded_feat_cols.columns[encoded_feat_cols.isna().any()].tolist()

# The number of neighbors that the function look for is the 1/3 of the whole dataframe
num = (train_len + test_len) // 3

# Instantiate the Imputer object
imputer = KNNImputer(n_neighbors=num, weights="distance")
# Fit and transform using the imputer on the missing data and get the imputed combined data
imputed_combined = pd.DataFrame()
imputed_combined[encoded_feat_cols.columns.to_list()] = pd.DataFrame(imputer.fit_transform(encoded_feat_cols))

# Check the imputation:
True in imputed_combined.isna().any().values

False

### Feature Selecrion: It would be all the features for now

In [36]:
# Get the fitting and predicting datasets:

# features:
features = imputed_combined.columns.to_list()

train_part = pd.DataFrame()
train_part = imputed_combined.iloc[:train_len]
# Add the y_feat column (for use further along the way)
train_part.loc[:, (y_feat)] = train[y_feat]

# X would be the features that will be used for both prediction and training
X = train_part[features]
y = train[y_feat] # y, the dependent column of the dataset

# The dataset used for prediction
X_test = imputed_combined[train_len: ].reset_index()
X_test.drop(['index'], inplace=True, axis=1)

# Normalized version of datasets
norm_X = normalize(X.copy())
norm_y = normalize(y.copy())
norm_X_test = normalize(X_test.copy())

In [14]:
train_part.corr()[y_feat].nlargest(13)[1:]

OverallQual     0.790982
Neighborhood    0.738629
GrLivArea       0.708624
ExterQual       0.690933
BsmtQual        0.681904
KitchenQual     0.675721
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
GarageFinish    0.553058
Name: SalePrice, dtype: float64

## Fitting parts
The general idea is to come up with a number of Neural Network architectures, then fine-tune their hyperparameters and put into the models.py file. It is important to take into account that if a model had a error less than 0.14 then just save it in the models folder with the name NN-[error rate].

In [15]:
# General imports
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers, regularizers, losses, metrics
from tensorflow.keras.regularizers import l1, l2, l1_l2, L1L2
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, InputLayer,LeakyReLU
from tensorflow.keras.optimizers.schedules import ExponentialDecay, InverseTimeDecay
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import MeanSquaredLogarithmicError
from tensorflow.keras.initializers import TruncatedNormal

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

# This line should be written so all the values inside of each layer is calculated as a float64
tf.keras.backend.set_floatx('float64')

In [19]:
# Scheduler objects to control the optimizer learning rate:
# Note: this part should be dismissed since it is stricly written for the models.py file
from tensorflow.keras.optimizers.schedules import InverseTimeDecay, ExponentialDecay

def TimeDecayScheduler(learning_rate=0.001, decay_steps=200, decay_rate=1.2, name=""):
    """ Returns an InverseTimeDecay object with the given properties to be used in the optimizer. """
    return InverseTimeDecay(
        initial_learning_rate=learning_rate, 
        decay_steps=decay_steps,
        decay_rate=decay_rate,
        name=name
    )


def ExponentialScheduler(initial_learning_rate, decay_steps, decay_rate, name=""):
    """ Returns an ExponentialDecay object with the given properties to be used in the optimizer. """
    return InverseTimeDecay(
        initial_learning_rate=initial_learning_rate, 
        decay_steps=decay_steps,
        decay_rate=decay_rate,
        name=name
    )


# Actual Optimizers: Adam and RMSprop are the main two optimizers that are going to be used for this project since they accept schedulers and happen to be effective.

from tensorflow.keras.optimizers import Adam, RMSprop

def AdamOptimizer(learning_rate=0.001, scheduler=None):
    """
        # params:
        learning_rate: the initial learning rate to be used
        scheduler: If this is passed by the user then use it in the optimizer instead of the learning rate

        # returns: an Adam optimizer
    """
    if scheduler == None:
        return Adam(learning_rate)
    else:
        return Adam(scheduler)
    

def RMSpropOptimizer(learning_rate=0.001, scheduler=None):
    """
        # params:
            learning_rate: the initial learning rate to be used
            scheduler: If this is passed by the user then use it in the 
            optimizer instead of the learning rate
        
        # returns: an RMSprop optimizer
    """
    if scheduler == None:
        return RMSprop(learning_rate)
    else:
        return RMSprop(scheduler)

# CallBacks:
from tensorflow.keras.callbacks import EarlyStopping

def EarlyStopCallBack(patience=100):
    """
        # params: patience of the object for the number of epochs passed with no improvement
        # returns: a EarlyStopping callback object 
    """
    return EarlyStopping(monitor='val_loss', patience=patience)


# Models: 
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization  # Layers 
from tensorflow.keras.regularizers import l2, l1, l1_l2, L1L2  # Regularizer
from tensorflow.keras.losses import MeanSquaredLogarithmicError # Error-metric
import tensorflow_docs as tfdocs # For logging puposes

## Ideas to try out and improve the model
1. Weight-Initializers:
    - Use tf.keras.initializers.RandomNormal and tf.keras.initializers.RandomUniform
    - Tweak their properties and see how they would work.
2. Bias in Dense layers:
    - Setup an initiallizer and regularizer for the bias of the layer
    - Also use it those for the weights too
    - Tweak arguments of earlt-stop call back
3. Layers:
    - Use LeakyRelu/TreshholdRelu/PRelu as a layer
    - Maybe try-out tf.keras.layers.experimental.preprocessing.Normalization*
    - Tweak BatchNormalization layer arguments
4. Overfitting:
    - use tf.keras.layers.GaussianDropout and tf.keras.layers.GaussianNoise ( which could be viewed as a Data augmentation method.)

In [32]:
# A samll neural net to test how does the keras-tuner work
from kerastuner.tuners import RandomSearch

# This line should be written so all the values inside of each layer is calculated as a float64
tf.keras.backend.set_floatx('float64')

def test_model():
    
    model = keras.Sequential([
         Dense(64, activation='elu', 
              kernel_regularizer=l1(0.001),
              bias_regularizer=l2(0.001),
              bias_initializer=TruncatedNormal(mean=0, stddev=0.005),
              kernel_initializer=TruncatedNormal(mean=0, stddev=25)
        ),
        Dense(1)
    ])
    
    # Try to implement the hp in here
    time_lr = InverseTimeDecay(
      initial_learning_rate=0.0015,
      decay_steps=5000,
      decay_rate=0.0009
    )
    
    optimizer = Adam(time_lr)
        
    model.compile(
        loss=MeanSquaredLogarithmicError(name='MSLE'), 
        optimizer=optimizer,
    )
  
    return model

EPOCHS = 2000

# The patience parameter is the amount of epochs to check for improvement
early_stop = EarlyStopping(monitor='val_loss', patience=135, mode='min', restore_best_weights=True)

m = test_model()

In [33]:
history = m.fit(X, y, epochs=3000,
          verbose=0, validation_split=0.33,
          callbacks=[early_stop, tfdocs.modeling.EpochDots()])
print()
validate(quantize(pd.DataFrame(m.predict(X_test, verbose=0))[0]))


Epoch: 0, loss:233.5263,  val_loss:228.9488,  
....................................................................................................
Epoch: 100, loss:71.1638,  val_loss:71.0619,  
....................................................................................................
Epoch: 200, loss:53.3159,  val_loss:53.2319,  
....................................................................................................
Epoch: 300, loss:38.2720,  val_loss:38.2062,  
....................................................................................................
Epoch: 400, loss:26.4442,  val_loss:26.3950,  
....................................................................................................
Epoch: 500, loss:17.5191,  val_loss:17.4849,  
....................................................................................................
Epoch: 600, loss:10.9667,  val_loss:10.9437,  
................................................................

In [68]:
def NN02():
    """
        With the lowest number of layers, generate the highest amount of bais and highly
        regularite it.
    """
    model = keras.Sequential([
        InputLayer(input_shape=[len(X.keys())]),
        
        Dense(64, activation='elu', 
              kernel_regularizer=l2(0.001),
              bias_regularizer=l2(0.001),
              bias_initializer=TruncatedNormal(mean=0, stddev=0.005),
              kernel_initializer=TruncatedNormal(mean=0, stddev=0.25)
        ), BatchNormalization(),
        Dense(256, activation='elu', 
              kernel_regularizer=l2(0.0001),
              bias_regularizer=l2(0.001),
              bias_initializer=TruncatedNormal(mean=0, stddev=0.005),
              kernel_initializer=TruncatedNormal(mean=0, stddev=0.25)
        ), BatchNormalization(),
        Dense(1028, activation='elu', 
              kernel_regularizer=l2(0.009),
              bias_regularizer=l2(0.009),
              bias_initializer=TruncatedNormal(mean=0, stddev=0.005),
              kernel_initializer=TruncatedNormal(mean=0, stddev=0.25)
        ),
        
        BatchNormalization(),
        
        Dense(4, 
              kernel_regularizer=L1L2(0.04, 0.004), 
              bias_regularizer=l2(0.1),
              bias_initializer=TruncatedNormal(mean=0, stddev=2), 
              kernel_initializer=TruncatedNormal(mean=0, stddev=5)
        ),
        Dense(4, 
              kernel_regularizer=L1L2(0.05, 0.005), 
              bias_regularizer=l2(0.001), 
              bias_initializer=TruncatedNormal(mean=0, stddev=2), 
              kernel_initializer=TruncatedNormal(mean=0, stddev=5)
        ),
        Dense(4, 
              kernel_regularizer=L1L2(0.6, 0.06), 
              bias_regularizer=l2(0.2), 
              bias_initializer=TruncatedNormal(mean=0, stddev=2), 
              kernel_initializer=TruncatedNormal(mean=0, stddev=5)
        ),
        
        Dense(1)
      ])
    
    time_lr = TimeDecayScheduler(learning_rate=0.018, decay_steps=20000, decay_rate=0.059, name="")
    
    optimizer = Adam(time_lr)
        
    model.compile(
        loss=MeanSquaredLogarithmicError(name='MSLE'), 
        optimizer=optimizer, 
    )
  
    return model

model = NN02()

In [23]:
lr = 0.0015

for i in range(2000):
    if i < 50 or i % 100 == 0:
        print(i, lr)
    lr = lr / (1 + 0.0005 * i / 1000)

0 0.0015
1 0.0015
2 0.001499999250000375
3 0.001499997750002625
4 0.001499995500009375
5 0.0014999925000243747
6 0.0014999887500524998
7 0.0014999842500997495
8 0.0014999790001732488
9 0.0014999730002812478
10 0.001499966250433121
11 0.0014999587506393677
12 0.0014999505009116126
13 0.001499941501262605
14 0.001499931751706219
15 0.001499921252257453
16 0.0014999100029324311
17 0.001499898003748401
18 0.0014998852547237359
19 0.001499871755877933
20 0.0014998575072316144
21 0.0014998425088065261
22 0.0014998267606255397
23 0.0014998102627126499
24 0.0014997930150929763
25 0.001499775017792763
26 0.0014997562708393775
27 0.001499736774261312
28 0.0014997165280881827
29 0.0014996955323507298
30 0.0014996737870808172
31 0.0014996512923114323
32 0.0014996280480766872
33 0.0014996040544118165
34 0.0014995793113531791
35 0.0014995538189382573
36 0.0014995275772056562
37 0.0014995005861951046
38 0.0014994728459474546
39 0.001499444356504681
40 0.0014994151179098815
41 0.0014993851302072775
42

In [61]:
EPOCHS = 2000

# The patience parameter is the amount of epochs to check for improvement
early_stop = EarlyStopping(monitor='val_loss', patience=135, mode='min', restore_best_weights=True)

In [72]:
history = model.fit(norm_X, y, epochs=EPOCHS,
          verbose=0, validation_split=0.33,
          callbacks=[early_stop, tfdocs.modeling.EpochDots()])
print()
validate(quantize(pd.DataFrame(model.predict(norm_X_test, verbose=0))[0]))


Epoch: 0, loss:0.6540,  val_loss:0.6901,  
....................................................................................................
Epoch: 100, loss:0.6180,  val_loss:6.0551,  
....................................................................................................
Epoch: 200, loss:0.6059,  val_loss:0.6355,  
....................................................................................................
Epoch: 300, loss:0.6090,  val_loss:0.6387,  
....................................................................................................
Epoch: 400, loss:0.5975,  val_loss:0.6749,  
....................................................................................................
Epoch: 500, loss:0.5861,  val_loss:0.6163,  
....................................................................................................
Epoch: 600, loss:0.5862,  val_loss:0.5794,  
................................................................................

### Testing different hyperparameters
- Run number of different fitting and see their results.
- Normalized X works way betetr!! The training time is way faster and the loss and val_loss decrease close to each other.
- Using BatchNormalizations before each layer would be very helpful.

Note: A lower loss value would not necessary mean that the model's prediction would improve
##### Important Note: when we increase the amount of regularazation, the loss will be increase naturally. Hence, if one were to compare different model epochs, this should be taken into account.

In [90]:
output = pd.DataFrame({'Id': test.Id,
                      'SalePrice': quantize(pd.DataFrame(model.predict(norm_X_test, verbose=0))[0])})
output.to_csv('submissions/submission.csv', index=False)