<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Downloading-and-Caching-Datasets" data-toc-modified-id="Downloading-and-Caching-Datasets-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Downloading and Caching Datasets</a></span></li><li><span><a href="#NUMERICAL-FEATURES" data-toc-modified-id="NUMERICAL-FEATURES-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>NUMERICAL FEATURES</a></span></li><li><span><a href="#Categorical-Features" data-toc-modified-id="Categorical-Features-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Categorical Features</a></span></li></ul></div>

In [1]:
import os
import requests
import zipfile
import tarfile
import hashlib
from mxnet import gluon, autograd, init, np, npx
from mxnet.gluon import nn
import d2l
import matplotlib.pyplot as plt
import pandas as pd
npx.set_np()


In [2]:
%matplotlib inline

## Downloading and Caching Datasets

In [3]:
DATA_HUB = dict() #@save
DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/' #@save

In [4]:
def download(name, cache_dir=os.path.join('..', 'data')):  #@save
    """Download a file inserted into DATA_HUB, return the local filename."""
    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
    url, sha1_hash = DATA_HUB[name]
    d2l.mkdir_if_not_exist(cache_dir)
    fname = os.path.join(cache_dir, url.split('/')[-1])
    if os.path.exists(fname):
        sha1 = hashlib.sha1()
        with open(fname, 'rb') as f:
            while True:
                data = f.read(1048576)
                if not data:
                    break
                sha1.update(data)
        if sha1.hexdigest() == sha1_hash:
            return fname  # Hit cache
    print(f'Downloading {fname} from {url}...')
    r = requests.get(url, stream=True, verify=True)
    with open(fname, 'wb') as f:
        f.write(r.content)
    return fname

In [5]:
def download_extract(name, folder=None):  #@save
    """Download and extract a zip/tar file."""
    fname = download(name)
    base_dir = os.path.dirname(fname)
    data_dir, ext = os.path.splitext(fname)
    if ext == '.zip':
        fp = zipfile.ZipFile(fname, 'r')
    elif ext in ('.tar', '.gz'):
        fp = tarfile.open(fname, 'r')
    else:
        assert False, 'Only zip/tar files can be extracted.'
    fp.extractall(base_dir)
    return os.path.join(base_dir, folder) if folder else data_dir

def download_all():  #@save
    """Download all files in the DATA_HUB."""
    for name in DATA_HUB:
        download(name)

In [6]:
DATA_HUB['kaggle_house_train'] = (  #@save
    DATA_URL + 'kaggle_house_pred_train.csv',
    '585e9cc93e70b39160e7921475f9bcd7d31219ce')

DATA_HUB['kaggle_house_test'] = (  #@save
    DATA_URL + 'kaggle_house_pred_test.csv',
    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')

In [7]:
train_data = pd.read_csv(download('kaggle_house_train'))
test_data = pd.read_csv(download('kaggle_house_test'))

In [8]:
print(train_data.shape)
print(test_data.shape)

(1460, 81)
(1459, 80)


The training dataset contain 1460 examples (observations), 80 features, and 1 label (SalePrice), while the test data contain 1459 examples and 80 features.

---
a look at the first four, the eight, the tenth and last two features as well as the label (SalePrice) for the first 3 examples

In [9]:
train_data.iloc[0:3,[0,1,2,3,7,9, -3, -2, -1]]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotShape,Utilities,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,Reg,AllPub,WD,Normal,208500
1,2,20,RL,80.0,Reg,AllPub,WD,Normal,181500
2,3,60,RL,68.0,IR1,AllPub,WD,Normal,223500


we going to concatenante both the training and test data and dropping the ID feature since it contains no information but just servering as an indetifier

In [10]:
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

In [11]:
all_features.shape

(2919, 79)

In [12]:
all_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     2919 non-null   int64  
 1   MSZoning       2915 non-null   object 
 2   LotFrontage    2433 non-null   float64
 3   LotArea        2919 non-null   int64  
 4   Street         2919 non-null   object 
 5   Alley          198 non-null    object 
 6   LotShape       2919 non-null   object 
 7   LandContour    2919 non-null   object 
 8   Utilities      2917 non-null   object 
 9   LotConfig      2919 non-null   object 
 10  LandSlope      2919 non-null   object 
 11  Neighborhood   2919 non-null   object 
 12  Condition1     2919 non-null   object 
 13  Condition2     2919 non-null   object 
 14  BldgType       2919 non-null   object 
 15  HouseStyle     2919 non-null   object 
 16  OverallQual    2919 non-null   int64  
 17  OverallCond    2919 non-null   int64  
 18  YearBuil

from the result display above any features having a Non-Null Count value less than 2919 is an indication that the feature contain missing values. Data type (Dtype) represented by object are usually categorical features

## NUMERICAL FEATURES

In [13]:
numeric_features=all_features.dtypes[all_features.dtypes!='object'].index

Number of numerical features 

In [14]:
all_features[numeric_features].shape[1]

36

number of missing value in each  numerical feature

In [15]:
all_features[numeric_features].isna().sum()

MSSubClass         0
LotFrontage      486
LotArea            0
OverallQual        0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
MasVnrArea        23
BsmtFinSF1         1
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
1stFlrSF           0
2ndFlrSF           0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       2
BsmtHalfBath       2
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt      159
GarageCars         1
GarageArea         1
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
MiscVal            0
MoSold             0
YrSold             0
dtype: int64

to put all numerical features on a common scale, we standardize the numerical features by rescaling numerical features to zero mean and unit variance
$$ \frac{x-\mu}{\sigma^{2}} $$

In [16]:
all_features[numeric_features]=all_features[numeric_features].apply(lambda x: (x-x.mean())/x.std())

After standardizing the numerical features their means vanish (have 0 means), hence we can set the missing values to their means which is zero

In [17]:
all_features[numeric_features]=all_features[numeric_features].fillna(0)

## Categorical Features

In [18]:
cat_features=all_features.dtypes[all_features.dtypes=='object'].index

In [19]:
all_features[cat_features].isna().sum()

MSZoning            4
Street              0
Alley            2721
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType         24
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           81
BsmtCond           82
BsmtExposure       82
BsmtFinType1       79
BsmtFinType2       80
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
KitchenQual         1
Functional          2
FireplaceQu      1420
GarageType        157
GarageFinish      159
GarageQual        159
GarageCond        159
PavedDrive          0
PoolQC           2909
Fence            2348
MiscFeature      2814
SaleType            1
SaleCondition       0
dtype: int64

 we are dropping categorical features with missing values >=100

In [20]:
for i in cat_features:
    if all_features[i].isna().sum() >=100:
        all_features.drop(columns=[i],inplace=True)

In [21]:
all_features.shape

(2919, 70)

In [22]:
cat_features=all_features.dtypes[all_features.dtypes=='object'].index

In [23]:
all_features[cat_features].isna().sum()

MSZoning          4
Street            0
LotShape          0
LandContour       0
Utilities         2
LotConfig         0
LandSlope         0
Neighborhood      0
Condition1        0
Condition2        0
BldgType          0
HouseStyle        0
RoofStyle         0
RoofMatl          0
Exterior1st       1
Exterior2nd       1
MasVnrType       24
ExterQual         0
ExterCond         0
Foundation        0
BsmtQual         81
BsmtCond         82
BsmtExposure     82
BsmtFinType1     79
BsmtFinType2     80
Heating           0
HeatingQC         0
CentralAir        0
Electrical        1
KitchenQual       1
Functional        2
PavedDrive        0
SaleType          1
SaleCondition     0
dtype: int64

 Number of unique category in each categorical feature

In [24]:
for i in cat_features:
    print(all_features[i].unique())

['RL' 'RM' 'C (all)' 'FV' 'RH' nan]
['Pave' 'Grvl']
['Reg' 'IR1' 'IR2' 'IR3']
['Lvl' 'Bnk' 'Low' 'HLS']
['AllPub' 'NoSeWa' nan]
['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
['Gtl' 'Mod' 'Sev']
['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']
['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']
['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll'
 'ClyTile']
['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc'
 'CBlock' nan]
['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoa

fill the missing values in each categorical feature with the category in ecah feature having the highest value

In [25]:
for i in cat_features:
    all_features[i].fillna(max(all_features[i].value_counts()),inplace=True)

In [26]:
all_features.isna().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MiscVal          0
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
Length: 70, dtype: int64

In [27]:
all_features.shape

(2919, 70)

In [28]:
all_features = pd.get_dummies(all_features)
all_features.shape

(2919, 265)

In [29]:
n_train = train_data.shape[0]
train_features = np.array(all_features[:n_train].values, dtype=np.float32)
test_features = np.array(all_features[n_train:].values, dtype=np.float32)
train_labels = np.array(train_data.SalePrice.values.reshape(-1, 1), dtype=np.float32)

In [30]:
## Model

In [31]:
def get_net():
    net=nn.Sequential()
    net.add(nn.Dense(20,activation='relu'))
    net.add(nn.Dense(15,activation='relu'))
    net.add(nn.Dense(10,activation='relu'))
    net.add(nn.Dense(1))
    net.initialize()
    return net

# Loss
<img src="images/pred_loss.jpg" /><img src="images/pred_loss1.jpg" />
(source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 192-193)

np.clip(a, a_min, a_max, out=None)

In [32]:
loss=gluon.loss.L2Loss()
def log_rmse(net, features, labels):
    # To further stabilize the value when the logarithm is taken, set the
    # value less than 1 as 1
    clipped_preds =np.clip(net(features),1,float('inf')) 
    return np.sqrt(2*loss(np.log(clipped_preds),np.log(labels)).mean())

In [33]:
def train(net, train_features, train_labels, test_features, test_labels,num_epochs, learning_rate, 
          weight_decay, batch_size):
    train_ls, test_ls = [], []
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    # The Adam optimization algorithm is used here
    trainer =gluon.Trainer(net.collect_params(),'adam',{'learning_rate':0.5,'wd':weight_decay})
    for epoch in range(num_epochs):
        for X,y in train_iter:
            with autograd.record():
                l=loss(net(X),y)
            l.backward()
            trainer.step(batch_size)
        train_ls.append(log_rmse(net,train_features,train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net,test_features,test_labels))
    return train_ls,test_ls       

# K-Fold Cross-Validation

In [34]:
def get_k_fold_data(k, i, X, y):
    assert k>1
    fold_size=X.shape[0]//k
    X_train, y_train = None, None
    for j in range(k):
        idx=slice(j*fold_size,(j+1)*fold_size)
        X_part,y_part=X[idx,:],y[idx]
        if j==i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train,y_train=X_part,y_part
        else:
            X_train=np.concatenate([X_train,X_part],0)
            y_train=np.concatenate([y_train,y_part],0)
    return X_train,y_train,X_valid,y_valid

In [35]:
def k_fold(k, X_train, y_train, num_epochs,learning_rate, weight_decay, batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        print(f'fold {i + 1}, train log rmse {float(train_ls[-1]):f}, 'f'valid log rmse {float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

In [36]:
k, num_epochs, lr, weight_decay, batch_size = 6, 100, 5, 0.2, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,weight_decay, batch_size)
print(f'{k}-fold validation: avg train log rmse: {float(train_l):f}, 'f'avg valid log rmse: {float(valid_l):f}')

fold 1, train log rmse 0.078861, valid log rmse 0.145665
fold 2, train log rmse 0.078889, valid log rmse 0.181006
fold 3, train log rmse 0.064732, valid log rmse 0.169467
fold 4, train log rmse 0.266506, valid log rmse 0.141528
fold 5, train log rmse 0.061772, valid log rmse 0.145504
fold 6, train log rmse 0.066492, valid log rmse 0.185322
6-fold validation: avg train log rmse: 0.102875, avg valid log rmse: 0.161415
