# **Lab 5. Neural networks for classification**
This lab's primary purpose is to explain the process of building, training and evaluating a simple deep neural network using the California housing dataset. Our goal is to predict whether the median house value in a given California neighborhood is high or not, using a threshold-based approach to create a binary classification problem.

## **1. Preparing the dataset**
As usal, let's start by setupping the necessary libraries and utilities.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from matplotlib import pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'numpy'

Then we create two dataframes `df_train` and `df_test` as the distinction between the training set and the test set is made at the dataset level.

In [None]:
# Import dataset
df_train = pd.read_csv("dataset_train.csv")
df_test = pd.read_csv("dataset_test.csv")

### **1.1. Dataset shuffling**
In order to remove any unintended patterns or biases in the data, we randomize the order of samples by shuffling the dataset. We can do that by creating a randomly permutated sequence of indexes based on the number of rows in the `df_train` dataframe and then by reordering the rows of the dataframe according to the randomly permuted indexes.

In [None]:
# Data shuffling
df_train = df_train.reindex(np.random.permutation(df_train.index))

### **1.2. Dataset normalization**
Then we compute the mean and the standard deviation of the training set in order to perform z-score normalization. This ensures that all features are on a common scale, with a mean of zero, leading to more efficient and effective training of the neural network.

In [None]:
# Computing the z-scores
df_train_mean = df_train.mean()
df_train_std = df_train.std()

# Performing z-score normalization
df_train_norm = (df_train - df_train_mean)/df_train_std
df_test_norm = (df_test - df_train_mean)/df_train_std

Of course we need to use the scores computed on the training set to normalize also the test set. Let's use the `head()` method to inspect the first few rows of `df_train_norm`.

In [None]:
df_train_norm.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
6331,0.654364,-0.811862,0.588757,-0.512244,-0.546646,-0.240078,-0.577919,-0.300855,-0.877717
16653,-1.589839,1.345025,-1.000192,0.712098,0.233901,0.246918,0.295896,0.820384,0.386253
16574,-1.559916,1.260808,-1.079639,-0.049389,-0.297535,-0.025765,-0.195625,0.931644,-0.300912
10673,-0.502647,0.806973,-0.841297,-0.889776,-0.862186,-0.792413,-0.902999,-0.461481,-0.913929
14369,-1.270663,0.942656,-1.635772,0.435027,0.734495,0.41593,0.636579,-0.11151,-0.081054


### **1.3. Creating binary labels**
È un dataset pensato per la regressione perché come outcome vogliamo predire il prezzo delle case. Si può trasformare in un problema di classificazione su base statistica, cioè fissando una trheshold statistica. Consideriamo per le nostre tasche un prezzo alto classificazione personalizzata fissare  come prezzo elevato il 75esimo percentile signiica mettersi gia`nelle code quindi un prezzo al di sopra della media. Dividiamo ragionevolmente la label del dataset in uni e zeri dove uno è prezzo alto e zero è prezzo norale. Infatti in questo caso per dar senso ad 1 e 0 il nome della label è mhis high. Come vediamo è stato binarizzato dai valori regressivi.

In [None]:
# 75th percentile of median house value
threshold = df_train['median_house_value'].quantile(q = 0.75)
print(threshold)

# Creating labels
df_train_norm['median_house_value_is_high'] = (df_train['median_house_value'] > threshold).astype(float)
df_test_norm['median_house_value_is_high'] = (df_test['median_house_value'] > threshold).astype(float)

265000.0


We can inspect the results of our labelling by using again the `head()` command.

In [None]:
df_train_norm['median_house_value_is_high'].head()

6331     0.0
16653    0.0
16574    0.0
10673    0.0
14369    0.0
Name: median_house_value_is_high, dtype: float64

## **2. Representing data**
Let's define an empty list that will eventually hold all the feature columns, keeping in mind that the features we are interested in to predict whether the median house value is high or not (`median_house_value_is_high`) are `latitude`, `longitude`, `median_income` and `population`.

In [None]:
# Empty list
feature_columns = []


Before training the model, we need to perform some preprocessing and features engineering techniques: bucketing and feature crossing.

### **2.1. Bucketing (discretization)**
Latitude and longitude are continuous values, but we want to simplify them and protect privacy. Instead of using precise geographical coordinates, we groud these values into categories known as buckets. Imagine dividing the geographic area into a grid where each cell represents a range of latitude and longitude. We choose a resolution equal to 30% of the standard deviation of the normalized data. This means each bucket covers a section of the geographic area that corresponds to the 30% of the standard deviation.

In practice, this involves creating regular intervals for latitude and longitude, turning them into categories that represent broader, less detailed geographic areas.

Bucketing is a technique there continuous numerical values are converted into discrete intervals or categories called buckets. This can simplify the model's task and reduce the noise from minute variations in the data. As we said, for the features `latitude` and `longitude`, we group their values into ranges rather than using their precise values.

First we set `resolution_in_Zs` which determines the size of each bucket. Initially we scaled all the columns, including `latitude` and `longitude`, into their z-scores. So, instead of picking a resolution in degrees, we use a resolution in terms of z-scores. A `resolution_in_Zs` equal to 1 corresponds to a full standard deviation. In our case it's set to 30%. 

In [None]:
# Resolution
resolution_in_Zs = 0.3

Then we start by treating `latitude` as a numeric column, enabling `tensorflow` to handle it. After that we create boundaries for the buckets by generating a range of values from the minimum to the maximum latitude in the training set. Each step in this range is 0.3 of the standard deviation of the normalized data. For example, if the range of our latitude is from -2 to 2 and the standard deviation is 1, this would create buckets at -2, -1.7, -1.4 and so on. Finally, we create the bucketized columns using these boundaries.

In [None]:
# Creating a bucket feature column for latitude
latitude_as_a_numeric_column = tf.feature_column.numeric_column("latitude")
latitude_boundaries = list(np.arange(int(min(df_train_norm['latitude'])), 
                                     int(max(df_train_norm['latitude'])), 
                                     resolution_in_Zs))
latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column, latitude_boundaries)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


We do the same thing for `longitude`.

In [None]:
# Creating a bucket feature column for longitude
longitude_as_a_numeric_column = tf.feature_column.numeric_column("longitude")
longitude_boundaries = list(np.arange(int(min(df_train_norm['longitude'])), 
                                      int(max(df_train_norm['longitude'])), 
                                      resolution_in_Zs))
longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column, longitude_boundaries)

### **2.2. Feature crossing**
After creating the buckets for latitude and longitude, we don't treat them as independent columns. Instead, we combine these buckets into a new feature that represents the geographic location. This process is called feature crossing. Rathere than considering `latitude` and `longitude` separately, we combine them to form a single feature that captures the interaction between these two variabes. This combination allows us to handle geographic position as a unified entity rather than as two separate measurements, making it easier for the model to understand how location affects house prices. 

For our purpose, we use the bucketized columns to create the crossed feature `latitude_x_longitude`.

In [None]:
# Crossing features
latitude_x_longitude = tf.feature_column.crossed_column([latitude, longitude], hash_bucket_size=100)

Instructions for updating:
Use `tf.keras.layers.experimental.preprocessing.HashedCrossing` instead for feature crossing when preprocessing data to train a Keras model.


where the `hash_bucket_size` parameter specifies the number of hash buckets used to manage the crossed feature. Hashing is used to efficiently map the combination of the buckets into a fixed number of categories. This helps to handle potentially large combinations of latitude and longitude buckets without creating an unwieldy number of features.

Once we have the crossed feature, we create an indicator column `crossed_feature`. This represents the crossed feature as a one-hot necoded vector, which is suitable for feeding into our model. Finally, we add this crossed feature to our previously empty-delcared list `feature_columns`.

In [None]:
# Adding the crossed features to the list
crossed_feature = tf.feature_column.indicator_column(latitude_x_longitude)
feature_columns.append(crossed_feature)  

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


We do the same thing for the remaining features `median_income` and `population`.

In [None]:
# Adding median income to the list
median_income = tf.feature_column.numeric_column("median_income")
feature_columns.append(median_income)

# Adding population to the list
population = tf.feature_column.numeric_column("population")
feature_columns.append(population)

Now that we have our list of features, we convert it into a single dense tensor that can be fed into our deep neural network. This step is crucial because neural networks expect their input to be in the form of a dense matrix (or tensor), where each input feature is represented as a continuous value or a one-hot encoded vector.

We use the `DenseFeatures` layer from `keras` to combine all the features into a single dense tensor. This tensor is what the neural network will use as input.

In [None]:
# Converting the list of features into a layer
my_feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

AttributeError: module 'keras._tf_keras.keras.layers' has no attribute 'DenseFeatures'

## **3. Building the neural network**
We can finally build our neural network.

### **3.1. Defining functions**
Since we do not have a built-in function for visualization, we need to create one called. One crucial plot we require is the curve showing the evolution of a seres of metrics over the epochs (as neural networks are dynamic). Tipically we might use log loss or cross-entropy loss.

Our `plot_curve()` function will take as an input the number of epochs, the history of the model and the list of metrics to be considered.


In [None]:
# Defining the plotting function
def plot_curve(epochs, hist, list_of_metrics):
    """Plot a curve of one or more classification metrics vs epoch."""
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Value')

    for m in list_of_metrics:
        x = hist[m]
        plt.plot(epochs[1:], x[1:], label = m)

    plt.legend()

Let's now define a function that allows us to create the model. Our `create_model()` function will take as an input the learning rate, the previously built input layer `feature_layer` and the list of metrics to be considered. We first initialize our model through the `Sequential()` method from the `models` propoerty of `keras`.

Then we add the input layer and after that we add a single dense layer with the sigmoid function as the activation function. In `keras` it is quite straightforward to specialize the model because we call the `add()` method each time we need to specify a certain characteristic.

Finally we compile the model. We need to specify the optimizer, the loss and the evaluation metrics. The only loss available for binary classification is `BinaryCrossentropy()`.

In [None]:
# Defining the model
def create_model(my_learning_rate, feature_layer ,my_metrics):
    """"Create and compile a simple neural network model."""

    # Simple model
    model = tf.keras.models.Sequential()

    # Adding the layer containing the feature columns
    model.add(feature_layer)

    # Adding a single layer
    model.add(tf.keras.layers.Dense(units = 1, input_shape = (1,), activation = tf.sigmoid),)

    # Building the layers into a model tensorflow can execute
    model.compile(optimizer = tf.keras.optimizers.RMSprop(lr = my_learning_rate), loss = tf.keras.losses.BinaryCrossentropy(), metrics = my_metrics)

    return model

To train our neural network model, it's good practice to define a structured training function rather than using inline scripting. Training neural networks is more complex than other algorithms and ofter requires more than just one line of code. Our `train_model()` function will take as an input the model created using the previous function, the training dataset, the number of epochs, the batch size (the number of samples used in each training batch) and the label name (the target output).

In `tensorflow`, the training dataset needs to be split into features and labels. Unlike other libraries that accept dataframes structures directly, `tensorflow` requires separate structures for the features and the labels. Therefore, we need to extract the features and the label from the training dataset.

When fitting the model, we want to keep track of the training history. The fitting process uses the feas the features as the input and the label column as the output. Additional parameters include the number of epochs and the batch size. We also set the property `shuffle` to true, meaning that the data rows will be shuffled again before creating the batches for training. To extract the epoch information, we access the `epoch` attribute of the `history` object returned by the fitting porcess.

Since we prefer working with `pandas` dataframes for visualization, we convert the `history` dictionary into the dataframe `hist`.

In [None]:
# Defining the training function
def train_model(model, dataset, epochs, batch_size, label_name):
    """Feed a dataset into the model in order to train it."""

    # Splitting dataset into features and label
    features = {name:np.array(value) for name, value in dataset.items()}
    label = np.array(features.pop(label_name))

    # Storing fitting results
    history = model.fit(x = features, y = label, batch_size = batch_size, epochs = epochs, shuffle = True)

    # Getting useful details for plotting the loss curve
    epochs = history.epoch

    # Converting the history dictionary to a pandas dataframe
    hist = pd.DataFrame(history.history)

    return epochs, hist

### **3.2. Defining parameters and metrics**
When training the neural network, several parameters need to be specified. Grid search is a common method for tuning these parameters, where we define a grid of possible values by specifying ranges on the x and y axes, creating a space of potential combinations. However, for simplicity and to avoid prolonged training times, we fix these parameters such as the learning rate, the number of epochs and the bath size. 

In [None]:
# Hyperparameters
learning_rate = 0.001
epochs = 20
batch_size = 100

We also specify the label of the outcome we are interested in and the classification threshold.

In [None]:
label_name = 'median_house_value_is_high'
classification_threshold = 0.50

Finally we define the list of metrics we are interested in, in order to evaluate the model from different points of view.

In [None]:
# List of metrics
metrics = [tf.keras.metrics.BinaryAccuracy(name = 'accuracy', threshold = classification_threshold),
           tf.keras.metrics.Precision(name = 'precision', thresholds = classification_threshold),
           tf.keras.metrics.Recall(name = 'recall', thresholds = classification_threshold)]

### **3.3. Training the model**
We can finally implement the model using our `create_model()` function and then train it using our `train_model()` function.

In [None]:
# Implementing the model
my_model = create_model(learning_rate, my_feature_layer, metrics)

# Training the model
epochs, hist = train_model(my_model, df_train_norm, epochs, batch_size, label_name)

Finally we plot the metrics using our `plot_curve()` function. As we can see, precision and recall set after a certain number of epochs, while accuracy becomes a plateau.

In [None]:
# Evaluating metrics
list_of_metrics_to_plot = ['accuracy', 'precision', 'recall']
plot_curve(epochs, hist, list_of_metrics_to_plot)

## **3. Testing model**
Finally we certify our trained model. Again, we build two separate structures for the features and for the label.

In [None]:
test_features = {name:np.array(value) for name, value in df_test_norm.items()}
test_label = np.array(test_features.pop(label_name))


Then we evaluate our model on the test set with the same parameters used during the training phase.

In [None]:
my_model.evaluate(x = test_features, y = test_label, batch_size = batch_size)