# Loss Functions

In this exercise, you will compare the effects of Loss functions on a linear regression model.

👇 Import the data from the attached csv file

In [1]:
import pandas as pd
import numpy as np

df=pd.read_csv("data.csv")

In [2]:
df

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.440
1,0.98,514.5,294.0,110.25,7.0,0.0,18.440
2,0.98,514.5,294.0,110.25,7.0,0.0,18.440
3,0.98,514.5,294.0,110.25,7.0,0.0,18.440
4,0.90,563.5,318.5,122.50,7.0,0.0,24.560
...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,0.4,19.640
764,0.62,808.5,367.5,220.50,3.5,0.4,16.710
765,0.62,808.5,367.5,220.50,3.5,0.4,16.775
766,0.62,808.5,367.5,220.50,3.5,0.4,16.545


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climatic needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

In order to penalize outliers, we use MSE.

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [3]:
# features are numercial so we can scale them

X = df.drop(columns = "Average Temperature")
y = df["Average Temperature"]

from sklearn.preprocessing import MinMaxScaler
mmscaler = MinMaxScaler()
mmscaler.fit(X)
X = mmscaler.transform(X)
X = pd.DataFrame(data = X, columns = df.columns[:-1])
X

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area
0,1.000000,0.000000,0.285714,0.000000,1.0,0.0
1,1.000000,0.000000,0.285714,0.000000,1.0,0.0
2,1.000000,0.000000,0.285714,0.000000,1.0,0.0
3,1.000000,0.000000,0.285714,0.000000,1.0,0.0
4,0.777778,0.166667,0.428571,0.111111,1.0,0.0
...,...,...,...,...,...,...
763,0.055556,0.916667,0.571429,1.000000,0.0,1.0
764,0.000000,1.000000,0.714286,1.000000,0.0,1.0
765,0.000000,1.000000,0.714286,1.000000,0.0,1.0
766,0.000000,1.000000,0.714286,1.000000,0.0,1.0


### 2.2 Modelling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [4]:
from sklearn.model_selection import cross_validate

from sklearn.linear_model import SGDRegressor
sgdr = SGDRegressor(loss='squared_loss') # OLS solved by SGD

cv = cross_validate(sgdr, X, y, cv=10, scoring=['max_error',
                                                'r2', 
                                                'neg_mean_absolute_error',
                                                'neg_mean_squared_error'])
cv = pd.DataFrame(cv)
cv



Unnamed: 0,fit_time,score_time,test_max_error,test_r2,test_neg_mean_absolute_error,test_neg_mean_squared_error
0,0.006735,0.001158,-9.287883,0.774477,-3.387399,-18.784843
1,0.006432,0.001068,-9.121657,0.899754,-2.026848,-6.927465
2,0.004071,0.000972,-9.496657,0.884832,-2.388593,-10.089618
3,0.004825,0.001043,-9.811958,0.874554,-2.544204,-10.658519
4,0.006499,0.000938,-9.387168,0.925773,-1.820095,-6.452464
5,0.006155,0.000991,-9.127379,0.892308,-2.261665,-10.154525
6,0.006832,0.000997,-9.033465,0.924115,-1.881864,-6.980851
7,0.00597,0.000989,-9.386187,0.913028,-2.236755,-9.582432
8,0.010452,0.001416,-8.913273,0.891719,-2.246383,-10.006127
9,0.005421,0.001052,-8.273607,0.934118,-1.87137,-6.408185


👇 Compute 
- the mean cross validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [5]:
r2 = cv['test_r2'].mean()
r2

0.8914679727710334

In [6]:
max_error = abs(cv['test_max_error'].min())
max_error

9.811957549244884

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [11]:
from sklearn.model_selection import cross_validate

from sklearn.linear_model import SGDRegressor
sgdrmae = SGDRegressor(loss="epsilon_insensitive", epsilon = 0) 

cvmae = cross_validate(sgdrmae, X, y, cv=10, scoring=['max_error',
                                                'r2', 
                                                'neg_mean_absolute_error',
                                                'neg_mean_squared_error'])
cvmae = pd.DataFrame(cvmae)
cvmae

Unnamed: 0,fit_time,score_time,test_max_error,test_r2,test_neg_mean_absolute_error,test_neg_mean_squared_error
0,0.015713,0.003752,-11.482624,0.725709,-3.720256,-22.846952
1,0.012502,0.003332,-9.990904,0.86618,-2.002703,-9.247606
2,0.011593,0.003315,-10.421739,0.867641,-2.265463,-11.595651
3,0.012039,0.003111,-10.818016,0.844125,-2.594048,-13.243959
4,0.011354,0.002953,-11.209622,0.905039,-1.943613,-8.25491
5,0.010206,0.002965,-11.242505,0.855285,-2.593103,-13.645546
6,0.011221,0.002915,-10.741845,0.912465,-1.989929,-8.052623
7,0.011393,0.003247,-12.029915,0.87041,-2.472264,-14.277961
8,0.011392,0.003482,-11.431069,0.867793,-2.430876,-12.217146
9,0.011527,0.003185,-10.888683,0.923596,-1.882536,-7.431707


👇 Compute 
- the mean cross validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [12]:
r2_mae = cvmae['test_r2'].mean()

max_error_mae = abs(cvmae['test_max_error'].min())

r2_mae, max_error_mae

(0.8638242127057282, 12.029914987639422)

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing risk of killing plants!

    
</details>

> YOUR ANSWER HERE

# 🏁 Check your code

In [13]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/cherif/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/cherif/code/cherifbenham/data-challenges/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m [ 33%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m          [ 66%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master


In [10]:
print("hello")

hello
