# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Import the data from the attached csv file

In [2]:
# YOUR CODE HERE
import pandas as pd

df = pd.read_csv('data.csv')
df

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.440
1,0.98,514.5,294.0,110.25,7.0,0.0,18.440
2,0.98,514.5,294.0,110.25,7.0,0.0,18.440
3,0.98,514.5,294.0,110.25,7.0,0.0,18.440
4,0.90,563.5,318.5,122.50,7.0,0.0,24.560
...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,0.4,19.640
764,0.62,808.5,367.5,220.50,3.5,0.4,16.710
765,0.62,808.5,367.5,220.50,3.5,0.4,16.775
766,0.62,808.5,367.5,220.50,3.5,0.4,16.545


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [3]:
# YOUR CODE HERE
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X = df.drop(columns='Average Temperature')
y = df['Average Temperature']

scaler.fit(X)
X_scaled = scaler.transform(X)

### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [53]:
# YOUR CODE HERE
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDRegressor

model = SGDRegressor()
cv = cross_validate(model, X_scaled, y, cv=10, scoring=['r2', 'max_error'])

In [54]:
cv['test_r2'].mean()

0.8973941768554994

In [55]:
cv['test_max_error'].mean()

-9.014279076977825

👇 Compute 
- the mean cross-validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [56]:
# YOUR CODE HERE
r2 = cv['test_r2'].mean()
max_error = cv['test_max_error'].mean()

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [57]:
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

model2 = SGDRegressor(loss='epsilon_insensitive')


In [58]:
# YOUR CODE HERE
#model2 = SGDRegressor(loss='mae')
#cv2 = cross_validate(model, X_scaled, y, cv=10, scoring=['max_error', 'r2'])

👇 Compute 
- the mean cross-validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [50]:
# YOUR CODE HERE
r2_mae = cv2['test_r2'].mean()
max_error_mae = cv2['test_max_error'].mean()

0.8973719640764303

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

> YOUR ANSWER HERE

# 🏁 Check your code

In [36]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/bitazaratustra/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/bitazaratustra/code/bitazaratustra/data-challenges/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order [31mFAILED[0m[31m [ 33%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[31m          [ 66%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae [31mFAILED[0m[31m      [100%][0m

[31m[1m____________________ TestLossFunctions.test_max_error_order ____________________[0m

self = <tests.test_loss_functions.TestLossFunctions testMethod=test_max_error_order>

    [94mdef[39;49;00m [92mtest_max_error_order[39;49;00m([96mself[39;49;00m):
>       [96mself[39;49;00m.assertLess([96mabs[39;49;00m([96mself[39;49;00m.result.max_error), [96mabs[