# Houses Kaggle Competition (revisited with Deep Learning 🔥) 

[<img src='https://github.com/lewagon/data-images/blob/master/ML/kaggle-batch-challenge.png?raw=true' width=600>](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

⚙️ Let's re-use our previous **pipeline** built in the module **`05-07-Ensemble-Methods`** and try to improve our final predictions with a Neural Network!

## (0) Libraries and imports

In [91]:
%load_ext autoreload
%autoreload 2

# DATA MANIPULATION
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

# DATA VISUALISATION
import matplotlib.pyplot as plt
import seaborn as sns

# VIEWING OPTIONS IN THE NOTEBOOK
from sklearn import set_config; set_config(display='diagram')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## (1) 🚀 Getting Started

### (1.1) Load the datasets

💾 Let's load our **training dataset**

In [92]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_train_raw.csv")
X = data.drop(columns='SalePrice')
y = data['SalePrice']

💾 Let's also load the **test set**

❗️ Remember ❗️ You have access to `X_test` but only Kaggle has `y_test`

In [55]:
X_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_test_raw.csv")

### (1.2) Train/Val Split

❓ **Holdout** ❓ 

As you are not allowed to use the test set (and you don't have access to `y_test` anyway), split your dataset into a training set and a validation set.

In [57]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)

### (1.3) Import the preprocessor

🎁 You will find in `utils/preprocessor.py` the **`data-preprocessing pipeline`** that was built in our previous iteration.

❓ Run the cell below, and make sure you understand what the pipeline does. Look at the code in `preprocessor.py` ❓

In [58]:
from utils.preprocessor import create_preproc

preproc = create_preproc(X_train)
preproc

❓ **Scaling your numerical features and encoding the categorical features** ❓

Apply these transformations to _both_ your training set and your validation set.

In [59]:
preproc=preproc.fit(X_train, y_train)

X_train_proc=preproc.transform(X_train)
X_val_proc=preproc.transform(X_val)

## (2) 🔮 Your predictions in Tensorflow/Keras

🚀 This is your first **regression** task with Keras! 

💡 Here a few tips to get started:
- Kaggle's [rule](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation) requires to minimize **`rmsle`** (Root Mean Square Log Error). 
    - As you can see, we can specify `msle` directly as a loss-function with Tensorflow.Keras!
    - Just remember to take the square-root of your loss results to read your rmsle metric.
    
    
😃 The best boosted-tree ***rmsle*** score to beat is around ***0.13***

---

<img src="https://i.pinimg.com/564x/4c/fe/ef/4cfeef34af09973211f584e8307b433c.jpg" alt="`Impossible mission" style="height: 300px; width:500px;"/>

---


❓ **Your mission, should you choose to accept it:** ❓
- 💪 Beat the best boosted-tree 💪 

    - Your responsibilities are:
        - to build the ***best neural network architecture*** possible,
        - and to control the number of epochs to ***avoid overfitting***.

### (2.1) Predicting the houses' prices using a Neural Network

❓ **Preliminary Question: Initializing a Neural Network** ❓

Create a function `initialize_model` which initializes a Dense Neural network:
- You are responsible for designing the architecture (number of layers, number of neurons)
- The function should also compile the model with the following parameters:
    - ***optimizer = "adam"***
    - ***loss = "msle"*** (_Optimizing directly for the Squared Log Error!_)
        

In [94]:
from tensorflow.keras import models, layers

def initialize_model():
    
    #############################
    #  1 - Model architecture   #
    ############################# 
    
    model = models.Sequential()
    model.add(layers.Dense(50, activation='relu', input_dim=159)) 
    model.add(layers.Dense(50, activation='relu')) 
    model.add(layers.Dense(1, activation='linear'))
    
    #############################
    #  2 - Optimization Method  #
    #############################
    model.compile(loss='msle', # different from binary_crossentropy because we have multiple classes
                  optimizer='adam') 

    return model 


model = initialize_model()

❓ **Questions/Guidance** ❓

1. Initialize a Neural Network
2. Train it
3. Evaluate its performance
4. Is the model overfitting the dataset? 

In [95]:
X_train_proc.shape

(1022, 159)

In [96]:
history=model.fit(X_train_proc, y_train, epochs=1000, verbose=0)

In [97]:
model.evaluate(X_val_proc, y_val)



0.01635262928903103

🎁 We coded a `plot_history` function that you can use to detect overfitting

In [98]:
def plot_history(history):
    plt.plot(np.sqrt(history.history['loss']))
    plt.plot(np.sqrt(history.history['val_loss']))
    plt.title('Model Loss')
    plt.ylabel('RMSLE')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='best')
    plt.show()

In [99]:
loss = np.array(history.history["loss"])**0.5
loss

array([10.67276252,  9.26814099,  8.41006417,  7.81003013,  7.34089794,
        6.95897807,  6.64856402,  6.38898266,  6.16500681,  5.96653699,
        5.78381786,  5.61116955,  5.4469848 ,  5.2901192 ,  5.14009124,
        4.99754025,  4.86309731,  4.73713568,  4.61945712,  4.50945264,
        4.4063725 ,  4.30952793,  4.21816719,  4.13184875,  4.0498811 ,
        3.97188402,  3.8974328 ,  3.82632367,  3.75804597,  3.69250278,
        3.62947845,  3.56865041,  3.50995813,  3.45325549,  3.39831276,
        3.34509593,  3.29348306,  3.24331491,  3.19458582,  3.14718853,
        3.10098223,  3.0559994 ,  3.01216346,  2.9693293 ,  2.92752541,
        2.88672073,  2.84677282,  2.80775127,  2.76957343,  2.73212266,
        2.69551409,  2.6596422 ,  2.62442537,  2.5899289 ,  2.5560746 ,
        2.52283369,  2.49024969,  2.45819899,  2.42670555,  2.39579961,
        2.36536662,  2.33548418,  2.30609429,  2.27716713,  2.24866234,
        2.22065395,  2.19307239,  2.16588519,  2.13911686,  2.11

### (2.2) Challenging yourself

❓ **Questions to challenge yourself:** ❓
- Are you satisfied with your score?
- Before publishing it, ask yourself whether you could really trust it or not?
- Have you cross-validated your neural network? 
    - Feel free to cross-validate it manually with a *for loop* in Python to make sure that your results are robust against the randomness of a _train-val split_ before before submitting to Kaggle

### (2.3) (Bonus) Using all your CPU cores to run Neural Networks

🔥 **BONUS** 🔥 **Multiprocessing computing using [dask](https://docs.dask.org/en/latest/delayed.html)** and **all your CPU cores**:

_(to mimic SkLearn's `n_jobs=-1`)_

In [100]:
!pip install --quiet dask

In [101]:
X_preproc=X_train_proc

In [102]:
from sklearn.model_selection import KFold
from dask import delayed

cv = 5
kf = KFold(n_splits = cv, shuffle = True)
#f = delayed(evaluate_model)

results = delayed([(X_preproc, y, train_index, val_index) for (train_index, val_index) in kf.split(X_preproc)]).compute(scheduler='processes', num_workers=8)

#pd.concat(results, axis=0).reset_index(drop=True)
results

[(array([[0.17806333, 0.        , 0.        , ..., 0.        , 1.        ,
          0.        ],
         [0.09178522, 0.58974359, 0.        , ..., 0.        , 1.        ,
          0.        ],
         [0.34740707, 0.47008547, 0.        , ..., 0.        , 0.        ,
          0.        ],
         ...,
         [0.18173474, 0.        , 0.        , ..., 0.        , 1.        ,
          0.        ],
         [0.41005048, 0.        , 0.        , ..., 0.        , 0.        ,
          0.        ],
         [0.2248738 , 0.        , 0.        , ..., 0.        , 0.        ,
          0.        ]]),
  0       208500
  1       181500
  2       223500
  3       140000
  4       250000
           ...  
  1455    175000
  1456    210000
  1457    266500
  1458    142125
  1459    147500
  Name: SalePrice, Length: 1460, dtype: int64,
  array([   0,    1,    2,    3,    6,    7,    8,    9,   11,   12,   13,
           14,   15,   16,   17,   18,   19,   20,   21,   22,   24,   25,
           2

## (3) 🏅FINAL SUBMISSION

🦄 Predict the ***prices of the houses in your test set*** and submit your results to Kaggle! 



In [103]:
X_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_test_raw.csv")
X_test_preproc = preproc.transform(X_test)
# ALREADY DONE ABOVE

In [104]:
y_pred = model.predict(X_test_preproc)

💾 Save your predictions in a Dataframe called `results` with the format required by Kaggle so that when you export it to a `.csv`, Kaggle can read it.

In [105]:
df= pd.DataFrame(y_pred)

📤  Export your results using Kaggle's submission format and submit it online!

_(Uncomment the last cell of this notebook)_

In [107]:
df.to_csv("submission_final.csv", header = True, index = False)

---

🏁 Congratulations!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... it's time for the Recap!