# Neural Network - Bonus: Data Detective

This notebook is not mandatory - feel free to do it if you are looking for some more experience, and want to detect a pitfall that typically occurrs at least once to every data scientist.

The notebook is a variation of project 3 on regression for the `diamonds` dataset.

### Preparation

In [None]:
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam

Data loading:

In [None]:
# Load the dataset
data = pd.read_csv('diamonds.csv')

# Define X by dropping the 'price' column
X = data.drop('price', axis=1)

# Keep only numerical columns in X
X = X.select_dtypes(include=['number'])

# Define y as the 'price' column
y = data['price']

### Train Test split
**Exercise:** Split the dataset into training and testing sets.

In [None]:
'''
  Hint:
    1) Utilize the train_test_split function from the sklearn.model_selection module to split the dataset into training and testing sets.
    2) Set the test_size parameter to 0.2 to allocate 20% of the data for testing.
    3) Use random_state=1 to ensure the split is reproducible.
'''

# X_train, X_test, y_train, y_test = ...

### Transformation
**Exercise:** The next step is to scale the data using a `StandardScaler`

In [None]:
'''
  Hint:
      1) To standardize the features, use the StandardScaler class from the sklearn.preprocessing module.
      2) Create an instance of StandardScaler and use the fit_transform() method on the training data to compute and apply the scaling.
      3) Use the transform() method to apply the same scaling to the test data, ensuring consistency across both datasets.
'''

# Initialize the scaler
# ...

# Fit the scaler on the training data and transform both training and test data
# ...
# X_train_scaled = ...
# X_test_scaled = ...

### Neural network architecture

It is time to define the neural network model using TensorFlow's Keras API to perform regression on the diamond dataset.

* **Input Layer:**
A dense layer with ***50*** neurons and the *ReLU* activation function. The input_shape is set to the number of features in `X_train`, which allows the model to know the shape of the input data.
* **Hidden Layer 1:**
A second dense layer with ***75*** neurons and the ReLU activation function.
* **Hidden layer 2:**
A third dense layer with ***30*** neurons and the ReLU activation function.
* **Output Layer:**
A dense layer with a **single neuron** (*no activation function*) since this is a regression problem and we want to predict a continuous value (`price` of the diamond).

**Exercise:** Using tensorflow and keras, define a neural network according to these specifications.

In [None]:
'''
Hint:
  1) Use tf.keras.Sequential() to create a Sequential model. This allows you to add layers to the model one by one.
  2) Add the first Dense layer with 100 neurons and activation='relu'.
  3) Add a hidden Dense layer with 50 neurons and activation='relu'.
  4) Add a final Dense layer with 1 neuron. Since it's a regression task, there is no activation function for the output layer..
'''

# model = ...

In [None]:
# Print the model summary
model.summary()

### Model Training

Now that we have defined the model, we can prepare the code to train it.

**Exercise:** Implement the necessary steps to train the model. You can check one of the notebooks from the second block to get some inspiration. What are key parameters you have to set?

In [None]:
# ...

### Model Evaluation
Now we are ready to evaluate the model. 

**Exercise:** predict the prices of both the training and the test data. Evaluate the performance using the MSE and the R2-score.

In [None]:
# Evaluate the model
#...

**Exercise:** Compare this result with the result you obtained in the 3rd project (if you did the regression project).

Would you trust the result?

Let us run a little experiment. Let's look at how we loaded the data - we used the following command to do so:

`data = pd.read_csv('diamonds.csv')`

Now, replace that line by

`data = pd.read_csv('diamonds.csv', index_col=0)`

and run the notebook again. 

What has happened? Why did this small change make such a big difference?