# MLP  Regression
* Data pipline
* Model structure and design
* Model Compilation 
* Training / Testing 

In [None]:
# import the needed modules

import tensorflow as tf
import numpy as np
import pandas as pd 
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Data pipeline
* Read and load that data
* Understand the data 
* Remove noise and normalize the data 
* Define batches 
* convert to tensors 

### Predicting House Prices on Kaggle
Now that we have introduced some basic tools for building and training deep networks and regularizing them with techniques including weight decay and dropout, we are ready to put all this knowledge into practice by participating in a Kaggle competition. The house price prediction competition is a great place to start. The data is fairly generic and do not exhibit exotic structure that might require specialized models (as audio or video might). This dataset covers house prices in Ames, IA from the period of 2006--2010.
It is considerably larger than the famous [Boston housing dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names) of Harrison and Rubinfeld (1978), boasting both more examples and more features. In this section, we will walk you through details of data preprocessing, model design, and hyperparameter selection. We hope that through a hands-on approach, you will gain some intuitions that will guide you in your career as a data scientist.

* The training dataset includes 1460 examples, 80 features, and 1 label
* The validation data contains 1459 examples and 80 features.

On the house price prediction competition page, you can find the dataset (under the \"Data\" tab), submit predictions, and see your ranking, The URL is right here: > https://www.kaggle.com/c/house-prices-advanced-regression-techniques ![The house price prediction competition page.]

In [None]:
# Make a data directory to store the data!
os.makedirs(os.path.join('.', 'data'), exist_ok=True)


#Thanks to D2L! You can also download from Kaggle
dataname = "HousePrices"
raw_train_url = "http://d2l-data.s3-accelerate.amazonaws.com/kaggle_house_pred_train.csv"
raw_test_url = "http://d2l-data.s3-accelerate.amazonaws.com/kaggle_house_pred_test.csv"



def load_data(train_url, test_url, name, folder=".", save_data=False):
  raw_train = pd.read_csv(train_url)
  raw_test = pd.read_csv(test_url)

  if save_data:
    raw_train.to_csv(folder+name+"Train.csv")
    raw_test.to_csv(folder+name+"Test.csv")

  return raw_train, raw_test



raw_train, raw_test = load_data(raw_train_url, raw_test_url, dataname, "data/", save_data=False)

raw_train.shape, raw_test.shape

**Have a look at the data!**

In [None]:
raw_train.head()

**Here is the columns!**

In [None]:
#All columns
raw_test.columns

**Numeric Columns**

In [None]:
#numeric columns
numeric_columns = raw_test.dtypes[raw_test.dtypes!='object'].index
numeric_columns

**Object columns**

In [None]:
#object columns
object_columns = raw_test.dtypes[raw_test.dtypes=='object'].index
object_columns

**Understand the data**

Plot histogram of numerical variables

In [None]:
'''Plot histogram of numerical variables to validate pandas intuition.'''

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure()
    
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=40,ax=ax,color = 'blue',alpha=0.7, figsize = (40, 60))
        ax.set_title(var_name, fontsize = 30)
        ax.tick_params(axis = 'both', which = 'major', labelsize = 20)
        ax.tick_params(axis = 'both', which = 'minor', labelsize = 20)
        ax.set_xlabel('')
    fig.tight_layout(rect = [0, 0.03, 1, 0.95])  # Improves appearance a bit.
    plt.show()
    
draw_histograms(raw_train[numeric_columns].drop(columns=['Id']), numeric_columns[1:], 9, 4)

**Explore some columns and their correlations**

In [None]:
explore_columns = ["LotArea", "TotalBsmtSF", "GarageArea", "ScreenPorch", "PoolArea","SalePrice"]


corr = raw_train[explore_columns].corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(corr, linewidths=.5, vmax=1, square=True)

**Explore some columns and their correlation to SalePrice**

In [None]:
k = 10 #number of variables for heatmap
cols = corr.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(raw_train[cols].values.T)
f, ax = plt.subplots(figsize=(15, 12))
sns.set(font_scale=1.5)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 20}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

**General correlation heatmap for columns**

In [None]:
corr = raw_train[numeric_columns].corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.set(font_scale=1)
sns.heatmap(corr, linewidths=.5, vmax=1, square=True)

### **Data Preprocessing**
We can see that in each example, the first feature is the ID. This helps the model identify each training example. While this is convenient, it does not carry any information for prediction purposes. Hence, we will remove it from the dataset before feeding the data into the model. Besides, given a wide variety of data types, we will need to preprocess the data before we can start modeling. Let's start with the numerical features. First, we apply a heuristic, [**replacing all missing values by the corresponding feature's mean.**] Then, to put all features on a common scale, we (**standardize the data by rescaling features to zero mean and unit variance**): $$ x \\leftarrow \\frac{x - \\mu}{\\sigma}, $$