# Welcome to the DCOM 2023 How Models Learn Hands On Challenge

<img src="dcom23back.jpg" width="800" height="600">

So we've got this fake real estate agency called California DCom and they want our help figuring out how to price houses in the Boston area. To do that, we're gonna take a look at some data and figure out what factors influence how much a house costs.

We're gonna be using this dataset called California Housing Data, which is pretty popular for teaching people about this kind of stuff. It's not meant to be super up-to-date or anything, just good for learning.

The dataset has a bunch of columns with info like the median income of people in the area, the average number of rooms and bedrooms in a household, and the population size. Plus, there's the latitude and longitude of each area, which is kinda cool.

The thing we're trying to predict here is the median value of a house in each area, given in hundreds of thousands of dollars. So it's a **regression problem**, meaning we're trying to predict a number instead of just putting things into categories.

Oh, and fun fact - the dataset was made from info gathered in the 1990 US census, with one row of data per "census block group". Basically, that just means a group of people living in a certain area.

And one last thing to keep in mind - some areas in the dataset might have crazy big values for things like number of rooms or bedrooms. That's because those columns are looking at the average per household, and some areas might have lots of empty houses or vacation homes.

🎯 The **learning objectives** are: 
1. Gain an understanding of hyperparameters and their role in deep learning.
2. Learn how to evaluate model performance and compare different models using metrics like mean squared error (MSE) and coefficient of determination (R-squared).

Dive into the world of ML and gain valuable experience in hyperparameter tuning and deep learning, enter your results to the leaderboard, and  Good luck to all participants!

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, LayerNormalization, Dropout
from keras.optimizers import Adam
from keras.callbacks import History
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
import time

In [None]:
california_housing = fetch_california_housing(as_frame=True)
california_housing.frame

# Visualizing house price distributions

So we've the longitude and latitude that tell us where the districts from the dataset are on a map. And we're thinking maybe we can use that info to figure out if certain spots have really expensive houses or not.

To test that out, we made this scatter plot thingy where the horizontal (x) axis is latitude and the vertical (y) axis is longitude. And then we made these circles that show how big and colorful they are depending on how much the houses in that area are worth. It's a pretty cool way to see if there are any trends or patterns in the data.

In [None]:
california_img=mpimg.imread('california.png')
sns.scatterplot(data=california_housing.frame, x="Longitude", y="Latitude",
                size="MedHouseVal", hue="MedHouseVal",
                palette="viridis", alpha=0.5)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.legend(title="MedHouseVal", bbox_to_anchor=(1.05, 0.95),
           loc="upper left")
_ = plt.title("Median house value depending of\n their spatial location")

So if you're not really familiar with California, you might not know that all these data points we're looking at actually make a map of the state. And it's pretty cool because we can see that the houses that are worth the most are all huddled up along the coast where the big cities are, like San Diego, Los Angeles, San Jose, and San Francisco. Guess people really like living by the ocean and in the city, huh?

In [None]:
# Load the dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

In [None]:
# Normalize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
# Maybe you will want to fix the random state variable to an integer of your choice
random_state = [42]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state[0])

# Challenge: find the hyperparameter values that will provide the closest estimation for the real house prices

## Deep neural networks

So we're gonna be training this thing called a deep neural network - it's a type of machine learning that's actually inspired by the way our own brains work! The network is made up of tons of little nodes called neurons, which work together to analyze patterns and make predictions.

Basically, each neuron takes input from other neurons and uses that input to make its own calculations. Then it spits out an output based on all that input. And when you put all those neuron outputs together, you get the final output of the network.

Neural networks are super helpful when you're dealing with a ton of data and patterns that are just too complex for us humans to figure out on our own. They're used all the time in things like image recognition, natural language processing, and predicting stuff in fields like finance, medicine, and engineering.

Now, it's worth noting that there are plenty of other algorithms out there for regression that might be easier to use and work just as well or even better. But in this challenge, we're gonna focus on this deep neural network thing and learn how to tune it up to make it work better.

## Model Architecture

Okay, so this next part is all about the way we set up our model to predict how much houses in California are worth. We made a function called create_model that creates this neural network thing that does the predicting.

Basically, the neural network is made up of a bunch of layers that are all connected to each other, and we use something called ReLU to help make the connections stronger. We also use something called layer normalization and dropout to help the network learn faster and avoid making too many mistakes.

We put all this together and use a loss function called mean squared error to make sure our predictions are accurate, and then we use this thing called Adam optimizer to help make the predictions even better. And at the end, we have a model that we can use to make predictions about house prices in California. Cool, right?


In [None]:
def create_model(input_shape = 8, learning_rate = 0.1 , num_hidden_layers = 1, 
                 num_neurons_per_layer = 32 , dropout_prob = 0.1):
    model = Sequential()
    model.add(Dense(num_neurons_per_layer, input_shape=input_shape, activation='relu'))
    model.add(LayerNormalization(axis=1))
    model.add(Dropout(dropout_prob, input_shape=(2,)))
    
    for i in range(num_hidden_layers):
        model.add(Dense(num_neurons_per_layer, activation='relu'))
        model.add(LayerNormalization(axis=1))
        model.add(Dropout(dropout_prob, input_shape=(2,)))
    
    model.add(Dense(1))
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='mean_squared_error', optimizer=optimizer)
    return model

## Hyperparameters

Okay, so you know how when you're cooking something, you might adjust the heat, the amount of salt, or the cooking time to get it just right? Well, in machine learning, we have something kind of similar called hyperparameter tuning.

Basically, when we're building a machine learning model, we have to make some choices about how it's going to work. These choices are called hyperparameters, and they include things like how many layers a neural network should have, how many nodes are in each layer, and how quickly the model should learn from the data.

Hyperparameter tuning is the process of experimenting with different choices for these hyperparameters to try and find the best combination for a particular problem. It's kind of like adjusting the heat or seasoning when you're cooking - you try different settings until you get the best result.

The goal of hyperparameter tuning is to create a machine learning model that is as accurate as possible, while also being efficient enough to work quickly and not use up too many resources. It can be a bit of trial and error, but it's an important part of building a good machine learning model.

## Set the parameters by changing the variables below



In [None]:
# Let's have an idea of how long it took you to find your solution.
start_time = time.time()

### Batch Size: 
The batch size is how many examples the model looks at together when it's learning. A bigger batch size can make things move more quickly, but it might also make the model less precise.

In [None]:
batch_size = 32

### Epochs: 
So, an epoch is basically when the network goes through all the training data once. By adjusting the number of epochs, you can decide how many times the network should go through the training data. If you increase the number of epochs, the model can become more accurate, but it can also start to overfit the data.

In [None]:
epochs = 10

### Learning Rate
The learning rate is like the gas pedal for the neural network during training. If the learning rate is high, the network goes full speed ahead and updates the weights more aggressively, but it can make the model less accurate. On the other hand, if the learning rate is low, the network takes it slow and steady, making more conservative updates to the weights, which can make the model more accurate, but also slower to train. So it's like a trade-off between speed and accuracy, and you have to find the sweet spot that works best for your problem.

In [None]:
learning_rate = 0.1

### Number of Hidden Layers: 
Hidden layers are kind of like secret layers of neurons inside a neural network that don't talk directly to the input or output layers. The more of these hidden layers there are, the more complex patterns the network can learn, but it can also make it harder to train and more likely to overfit. So, adding more hidden layers can make the network stronger, but also riskier.

In [None]:
num_hidden_layers = 3

### Number of Neurons per Layer: 
The number of neurons in a hidden layer determines how much the network can learn and how complex it can be. The more neurons in a layer, the more powerful the network is, but it can also cause overfitting.

In [None]:
num_neurons_per_layer = 16

### Dropout Probability:

Dropout is like a bouncer that randomly kicks out a percentage of neurons in each layer during training. The dropout probability decides how many neurons are removed. If we increase the dropout probability, it can make the model less likely to overfit and more robust, but it might also lower its accuracy.

In [None]:
dropout_prob = 0.1

## Compile the model

In [None]:
# Train the model
input_shape = (X_train.shape[1],)
model = create_model(input_shape, learning_rate, num_hidden_layers, num_neurons_per_layer,dropout_prob)
model.summary()


## Launch the training

In [None]:
history = History()
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), callbacks=[history], verbose = 1)


# Is your model overfitting?

Overfitting happens when a deep neural network learns the training data too well, but starts to perform poorly on new data. Here are some ways to check if your model is overfitting:

Check the training and validation loss: During training, **keep an eye on the training and validation loss**. If the training loss keeps decreasing while the validation loss starts to increase or level off, it might be a sign of overfitting. This means that the model is memorizing the training data too well and is not doing a good job of generalizing to new data.

Use regularization techniques: Regularization techniques, such as dropout, can help prevent overfitting by adding noise to the network during training. This can force the network to learn more robust features that generalize better to new data.

Get more data: One of the best ways to prevent overfitting is to get more data. This can help the network learn more representative patterns and reduce the chance of memorizing specific examples.

By using these methods, you can analyze and prevent overfitting in your deep neural network, resulting in a more accurate and robust model.

In [None]:
# Plot the training and validation loss over epochs
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()


# Evaluate the model

When evaluating a regression model, you can use the R2 score and Mean Squared Error (MSE) as metrics to determine its performance.

The **R2 score** is a measure of how well the model fits the data, ranging from 0 to 1. A higher R2 score means a better fit, with a score of 1 indicating a perfect prediction and a score of 0 indicating the model performs no better than predicting the mean value of the target variable.

Another commonly used metric is **Mean Squared Error (MSE)**, which measures the average squared difference between the predicted and actual values of the target variable. The lower the MSE, the better the model's performance.

These metrics allow you to compare different regression models and choose the one that fits the data best. R2 score is a good metric because it captures the variability in the data, while MSE is good because it penalizes large errors more than small ones, and can be easily interpreted.

In [None]:
preds = model.predict(X_test)

print('MSE on test data: %.3f' % (
        mean_squared_error(y_test, preds)))
print('R^2 score on test data: %.3f' % (
        r2_score(y_test, preds)))
print("Your execution time was %s seconds" % (time.time() - start_time))

Hey, just so you know, there are other ways to check if your model is actually solving the problem it's supposed to solve. Even if the metrics we talked about earlier seem okay, it's a good idea to check the residual plot to see if the errors are distributed evenly across all possible house prices. And keep in mind that sometimes the dataset just doesn't have enough information to make accurate predictions about median house prices.

 # Submit your results 🏁 

Enter the R2 score and the MSE loss above to the leaderboard:
https://sap-my.sharepoint.com/:l:/p/anderson_santana_de_oliveira/FLIQ2iJMYj1CkQFq2hp-X7oBahYvlAAuO5IwcEqppqulSg?e=7BHEwK


# Optional: Predict Prices around Palo Alto

In [None]:
# Delimiting a zone around Palo Alto
xmin, xmax = -122.5, -121.5
ymin, ymax = 37.2, 38.2

# Filtering the dataframe to show the prices around of Palo Alto
df = california_housing.frame
palo_alto_df = df.loc[(df['Longitude'] >= xmin) & (df['Longitude'] <= xmax) & (df['Latitude'] >= ymin) & (df['Latitude'] <= ymax)]

In [None]:
# There are 3.7k entries in this perimeter
palo_alto_df

In [None]:
#Lets select randomly a handful of entries to predict
palo_alto_df = palo_alto_df.sample(5)

In [None]:
# Remember we have to scale the features before we predict the price
palo_alto_input = scaler.fit_transform(palo_alto_df.drop('MedHouseVal', axis = 1))

In [None]:
#Let's save the predictions in the Palo Alto dataframe
palo_alto_df['Predictions'] = model.predict(palo_alto_input)

In [None]:
california_img=mpimg.imread('california.png')
sns.scatterplot(data=palo_alto_df, x="Longitude", y="Latitude",
                size="MedHouseVal", hue="MedHouseVal",
                palette="viridis", alpha=0.5)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.legend(title="MedHouseVal", bbox_to_anchor=(1.05, 0.95),
           loc="upper left")
_ = plt.title("Real Median house value for 5 instances in Palo Alto")

In [None]:
california_img=mpimg.imread('california.png')
sns.scatterplot(data=palo_alto_df, x="Longitude", y="Latitude",
               hue="Predictions",  size="Predictions",  
                palette="viridis", alpha=0.5)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.legend(title="Predicted Prices", bbox_to_anchor=(1.05, 0.95),
           loc="upper left")
_ = plt.title("Predicted median house value for 5 instances in Palo Alto")