# Machine Learning, DNN for regression

### Description of the dataset NO2
The dataset that we are going to use for this lesson was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/ (NO2).
The data are a subsample of 500 observations from a data set that originate in a study where air pollution at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads Administration. The response variable (column 1) consist of hourly values of the logarithm of the concentration of NO2 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor variables (columns 2 to 8) are the logarithm of the number of cars per hour, temperature $2$ meters above ground (degree C), wind speed (meters/second), the temperature difference between $25$ and $2$ meters above ground (degree C), wind direction (degrees between 0 and 360), hour of day and day number from October 1 2001. Submitted by Magne Aldrin (magne.aldrin@nr.no). [28/Jul/04] (19kbytes) 


Target variable:
- NO2 concentration (log) [lno2]  

Features:    
- log of car per hour [lc] 

- temperature 2 meters above the ground (degree C)[t2] 

- wind speed (meters/seconds) [ws]

- temperature difference between 25 meters and 2 meters above the ground (degree C) [td25] 

- wind direction (degrees between 0 and 360) [wd] 

- hour of day [hd] average house occupancy

- day number from Oct. 1 2001 [dn] 

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras import metrics

: 

### ASSIGNING LABELS TO FEATURES
We assign the labels to the corresponding columns to have a clean representation of the dataset 


In [None]:
features = ['lc', 't2', 'ws', 'td25', 'wd', 'hd', 'dn']

: 

### CREATE THE DNN LARGE NETWORK MODEL


<img src="images/DNN-all.png" alt="DNN Model Large" width="500"/>


#### Create the network as indicated in the image above. All activation functions should be ReLU. Have a look at the slide for hints on the code to use. 


In [None]:
def create_model_large():
    model = Sequential()
    model.add(Dense(128, input_dim=7, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model


: 

### CREATE THE TINY ANN MODEL

#### Create an ANN with 1 layer containing 3 neurons (model tiny), alla activation functions should be ReLU.

In [None]:
def create_model_tiny():
    model = Sequential()
    model.add(Dense(3, input_dim=7, activation='relu'))  # 1 layer con 3 neuroni
    model.add(Dense(1, activation='sigmoid'))  # Assumendo classificazione binaria
    return model


: 

### CREATE THE SMALL ANN MODEL

#### Create an ANN with 1 layer containing 10 neurons (model small), alla activation functions should be ReLU.

In [None]:
def create_model_small():
    model = Sequential()
    model.add(Dense(10, input_dim=7, activation='relu'))  # 1 layer con 10 neuroni
    model.add(Dense(1, activation='sigmoid'))  # Assumendo classificazione binaria
    return model


: 

### CREATE THE MEDIUM DNN MODEL

#### Create an DNN with 2 layers containing respectively 10 and 30 neurons (model medium), alla activation functions should be ReLU.


In [None]:
def create_model_medium():
    model = Sequential()
    model.add(Dense(10, input_dim=7, activation='relu'))  # 1° layer con 10 neuroni
    model.add(Dense(30, activation='relu'))  # 2° layer con 30 neuroni
    model.add(Dense(1, activation='sigmoid'))  # Assumendo classificazione binaria
    return model


: 

### Evaluating the model

The following functions computes the Root Mean Squared Error and the Normalized Root Mean Squared Error between the groud thruth (real) and inferred (pred) responses.  

In [None]:
def RMSE(real, pred):
    return sqrt(mean_squared_error(real, pred))

def NRMSE(real, pred):
    return sqrt(mean_squared_error(real, pred)/(real.max() - real.min()))

: 

### Loading  the data 

#### Using the panda library to load data, provide a descriptive summary

In [None]:
# Load data
df = pd.read_csv('NO2.csv', index_col=False)

# Descriptive statistics summary
df.describe()

: 

### Visualzing the relationships among the features in the data

#### Compute and visualize a correlation matring among the features (using the seaborn library).

In [None]:
# Correlation matrix
corrmat = df.corr()

# Generate a mask for the upper triangle
matrix = np.triu(corrmat)
f, ax = plt.subplots(figsize=(12, 9))
sns.set(font_scale=1)
sns.heatmap(corrmat, vmin=-1, vmax=1, center= 0, square=True, annot=True, annot_kws={'size': 8}, mask=matrix, fmt='.2g', cmap= 'coolwarm')

plt.show()

: 

### Standizing the data

#### Standardize features to have 0 mean, sigma 1 and range between -1 and 1.  Do not standardize the response.

In [None]:
# Standardizing data
sc= MinMaxScaler(feature_range=(-1,1))

for var in features:
    if(var != 'lno2'):
        df[var] = sc.fit_transform(df[var].values.reshape(-1, 1))


: 

### Preparing the data for training, using validation set approach 

#### Remove labels and create the validation set.

In [None]:
#NumPy representation of the data frame (removing labels)
df = df.to_numpy() #df=df.values

#divide predictors from features
X = df[:, 1:8]
y = df[:, 0]

seed = 7
np.random.seed(seed)

# split dataset in 75% for traininig and 25% for testing (500 -> 375,125)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=seed)

# split training in 70% for traininig and 30% for validating (375 -> 300,75)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed)

# Model creation (choose between tiny, small, medium and large by using the appropriate function)
model = create_model_tiny()
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_absolute_error'])
model.summary()

: 

### Fit the DNN to the data

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=150, batch_size=32)

: 

### Evaluate accuracy and loss of your model. This is how your plots should look like:
<table><tr>
<td> <img src="images/accuracy.png" style="width: 500px;"/> </td>
<td> <img src="images/loss.png" style="width: 500px;"/> </td>
</tr></table>

In [None]:
# Summarize history for accuracy
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

: 

In [None]:
# Summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()

: 

### Now evaluate the predictions of your model in terms of RMSE and NRMSE. You should obtain plots like these:
<table><tr>
<td> <img src="images/inference.png" style="width: 500px;"/> </td>
<td> <img src="images/inference_diff.png" style="width: 500px;"/> </td>
</tr></table>

In [None]:
# Prediction
pred = model.predict(X_test).reshape(1,-1)[0]

rmse = RMSE(y_test, pred)
nrmse = NRMSE(y_test, pred)


print("rmse : ",rmse,"  nrmse : ",nrmse)

: 

In [None]:
my_x=np.arange(0,len(X_test[:,0]),1)

fig = plt.figure(figsize=(20,10))
plt.scatter(my_x, y_test, label='Real', color='blue', marker='o')
plt.scatter(my_x, pred, label='Inference', color='red', marker='s')
plt.title(f'Inference of log(NO2)  rmse: {rmse:.3f}, nrmse: {nrmse:.3f}')
plt.ylabel('log(NO2)')
plt.xlabel('readings (rows of file)')
plt.grid()
plt.legend()
plt.show()

: 

In [None]:
fig = plt.figure(figsize=(20,10))
my_d = abs(y_test-pred)

plt.bar(my_x,my_d)
plt.title(f'Inference of log(NO2) -- absolute difference between inference and ground truth rmse: {rmse:.3f}, nrmse: {nrmse:.3f}')
plt.ylabel('absolute value of difference in inference [log(NO2)]')
plt.xlabel('readings (rows of file)')
plt.grid()
plt.show()

: 