# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
76,0.71,710.5,269.5,220.5,3.5,0.1,12.255
640,0.79,637.0,343.0,147.0,7.0,0.4,43.85
595,0.79,637.0,343.0,147.0,7.0,0.4,44.975
683,0.86,588.0,294.0,147.0,7.0,0.4,31.875
131,0.69,735.0,294.0,220.5,3.5,0.1,12.5


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [2]:
from sklearn.preprocessing import StandardScaler

# features to standardize
features = data.iloc[:, :-1]

# Create StandardScaler
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

# Convert into DF
standardized_data = pd.DataFrame(standardized_features, columns=features.columns)
standardized_data.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
2,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
3,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
4,1.284979,-1.229239,0.0,-1.198678,1.0,-1.760447


### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [8]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
import numpy as np


X = standardized_data
y = data['Average Temperature']

# Create RegLin SGD
model = SGDRegressor(loss='squared_error', max_iter=1000, tol=1e-3)

# 10-fold Cv
scores = cross_val_score(model, X, y, cv=10, scoring='neg_mean_squared_error')

# Convert scores in positive MSE
mse_scores = -scores
print("Result CV (MSE) :")
print(mse_scores)
print("\nMean MSE : ", np.mean(mse_scores))

Result CV (MSE) :
[17.92974941  6.31729545  9.1499346   9.85794101  5.98230436  9.74408713
  6.68730637  9.27537654  9.62548707  5.85524217]

Mean MSE :  9.042472409417071


❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [9]:
from sklearn.metrics import make_scorer, max_error
from sklearn.model_selection import cross_val_score

# compute R2 mean with CV
r2 = cross_val_score(model, X, y, cv=10, scoring='r2').mean()

# compute max error positive value
def max_error_positive(y_true, y_pred):
    return max_error(y_true, y_pred)

# Compute max error on all folds
max_error_celsius = cross_val_score(model, X, y, cv=10, scoring=make_scorer(max_error_positive, greater_is_better=True)).max()

print("Score mean R2 : ", r2)
print("Max error in celcius : ", max_error_celsius)

Score mean R2 :  0.897990756400435
Max error in celcius :  9.849909159125911


### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [10]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

# Create RegLin with optimised SGD on MAE loss
model_mae = SGDRegressor(loss='epsilon_insensitive', epsilon=0.01, max_iter=1000, tol=1e-3)

# execute cv at 10-folds
mae_scores = cross_val_score(model_mae, X, y, cv=10, scoring='neg_mean_absolute_error')

# Convert scores in positive MAE
mae_scores = -mae_scores

print("Result cv (MAE) :")
print(mae_scores)
print("\nMean MAE : ", np.mean(mae_scores))

Result cv (MAE) :
[3.58039648 2.02803781 2.19757716 2.56629626 1.81520912 2.31851565
 1.92319708 2.30091734 2.38464491 1.7955801 ]

Mean MAE :  2.2910371916480314


❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [11]:
from sklearn.metrics import make_scorer, max_error
from sklearn.model_selection import cross_val_score

# Compute score of R2 mean with cv on optimised MAE
r2_mae = cross_val_score(model_mae, X, y, cv=10, scoring='r2').mean()

def max_error_positive(y_true, y_pred):
    return max_error(y_true, y_pred)

# Compute max error on optimised model on MAE
max_error_mae = cross_val_score(model_mae, X, y, cv=10, scoring=make_scorer(max_error_positive, greater_is_better=True)).max()

print("Score R2 moyen pour MAE : ", r2_mae)
print("Erreur maximale pour MAE : ", max_error_mae)

Score R2 moyen pour MAE :  0.8756120371896783
Erreur maximale pour MAE :  11.227877971582256


## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

> YOUR ANSWER HERE

# 🏁 Check your code and push your notebook

In [12]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error_celsius,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/francoisgirard/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/francoisgirard/code/francoisgirard51/05-ML/04-Under-the-hood/data-loss-functions/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

