# Regularization

Let's improve our understanding of what impacted **Titanic** passengers' chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [71]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge, Lasso, LinearRegression, RidgeClassifier,


In [2]:
#data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")
# the dataset is already one-hot-encoded
data.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [3]:
# We build X and y

y = data["survived"]
X = data.drop(columns="survived")
X.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [4]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X.shape

(714, 12)

In [5]:
X_scaled.describe()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,0.618347,0.367921,0.102521,0.071895,0.067719,0.365546,0.260504,0.497199,0.116246,0.182073,0.039216,0.77591
std,0.419125,0.18254,0.185957,0.142215,0.103291,0.481921,0.439217,0.500343,0.320745,0.386175,0.194244,0.417274
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.247612,0.0,0.0,0.015713,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.5,0.346569,0.0,0.0,0.030726,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,0.472229,0.2,0.166667,0.065144,1.0,1.0,1.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
X_scaled.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,1.0,0.271174,0.2,0.0,0.014151,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,0.472229,0.2,0.0,0.139136,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.321438,0.0,0.0,0.015469,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.434531,0.2,0.0,0.103644,1.0,1.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.434531,0.0,0.0,0.015713,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## 1.  Logistic Regression without regularization

❓ Rank the features by decreasing order of importance after training a simple **non-regularized** Logistic Regression (i.e. look at the coefficients after fitting)
- Careful: `LogisticRegression` is penalized by default
  - take a look at the [penalty parameter](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to find out how to remove the penalty)
- Increase `max_iter` to a larger number until model converges

<details>
    <summary>Hint</summary>
    <img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/05-Model-Tuning/model_selection.png" alt="penalizing a regression" width="500" height="600">
</details>

In [161]:
log_reg_no_penalty = LogisticRegression(
    penalty=None,
    max_iter=100)
#tol=0.000002

log_reg_no_penalty.fit(X_scaled, y)

print("Iterations to resolution", log_reg_no_penalty.n_iter_)

pd.Series(
    log_reg_no_penalty.coef_[0], log_reg_no_penalty.feature_names_in_
).sort_values(ascending=False)

array([30], dtype=int32)

sex_female                 2.680469
fare                       1.443161
who_child                  1.358557
class_First                0.738188
embark_town_Cherbourg      0.103875
embark_town_Southampton   -0.193365
embark_town_Queenstown    -0.580103
pclass                    -0.693957
class_Third               -0.829465
parch                     -0.925921
age                       -2.155362
sibsp                     -2.498910
dtype: float64

In [157]:
log_reg_no_penalty.tol

2e-06

In [158]:
# Issue circumvented by changing solver
log_reg_no_penalty = LogisticRegression(penalty=None, solver="saga", max_iter=3000
                                    )

log_reg_no_penalty.fit(X_scaled, y)

pd.Series(
    log_reg_no_penalty.coef_[0], log_reg_no_penalty.feature_names_in_
).sort_values(ascending=False)

sex_female                 2.672109
fare                       1.360842
who_child                  1.336734
class_First                1.244102
pclass                     0.314349
parch                     -0.894680
class_Third               -1.340305
age                       -2.195306
sibsp                     -2.477157
embark_town_Cherbourg     -3.407387
embark_town_Southampton   -3.708719
embark_town_Queenstown    -4.103779
dtype: float64

In [159]:
log_reg_no_penalty.n_iter_

array([2402], dtype=int32)

In [160]:
log_reg_no_penalty.tol

0.0001

❓How do you interpret, in plain English, the value for the coefficient `sex_female`?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds of survival by exp(2.67) = 14"

</details>


In [9]:
# All else being equal, sex=female increases log-odds of surviving by 2.68.

❓ What is the feature that most impacts the chances of survival according to your model?  
Fill the `top_1_feature` list below with the name of this feature

In [35]:
top_1_feature = ["embark_town_Queenstown"]

In [36]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature = top_1_feature)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_unregularized.py::TestUnregularized::test_top_1 [32mPASSED[0m[32m              [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/unregularized.pickle

[32mgit[39m commit -m [33m'Completed unregularized step'[39m

[32mgit[39m push origin master



## 2.  Logistic Regression with an L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance (look at the coefficients)
- By "strongly regularized" we mean "more than Sklearn's default regularization factor". 
- Sklearn's default values are very useful orders of magnitudes to keep in mind for "scaled features"

In [96]:
log_reg_l2 = LogisticRegression(penalty="l1", solver="saga", max_iter=3000, C=0.8
                                    )

log_reg_l2.fit(X_scaled, y)

pd.Series(
    log_reg_l2.coef_[0], log_reg_l2.feature_names_in_
).sort_values(ascending=False)


sex_female                 2.506551
who_child                  1.052041
class_First                0.689248
embark_town_Cherbourg      0.336729
fare                       0.000000
embark_town_Southampton    0.000000
embark_town_Queenstown    -0.012846
parch                     -0.087071
pclass                    -0.773258
class_Third               -0.788551
age                       -1.707951
sibsp                     -1.750422
dtype: float64

❓ What are the top 2 features driving chances of survival according to your model?  
Fill the `top_2_features` list below with the name of these features

In [97]:
top_2_features = ["sex_female", "who_child"]

#### 🧪 Test your code below

In [98]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2 = top_2_features)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_ridge.py::TestRidge::test_top2 [32mPASSED[0m[32m                               [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/ridge.pickle

[32mgit[39m commit -m [33m'Completed ridge step'[39m

[32mgit[39m push origin master



## 2. Logistic Regression with an L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance

In [102]:
log_reg_l1 = LogisticRegression(penalty="l1", solver="saga", max_iter=3000, C=0.6
                                    )
log_reg_l1.fit(X_scaled, y)

pd.Series(
    log_reg_l1.coef_[0], log_reg_l1.feature_names_in_
).sort_values(ascending=False)


sex_female                 2.465885
who_child                  0.995650
class_First                0.671182
embark_town_Cherbourg      0.330848
parch                      0.000000
fare                       0.000000
embark_town_Queenstown     0.000000
embark_town_Southampton    0.000000
pclass                    -0.715022
class_Third               -0.797423
age                       -1.520140
sibsp                     -1.547123
dtype: float64

❓ What are the features that have absolutely no impact on chances of survival, according to your L1 model?  
Fill the `zero_impact_features` list below with the name of these features; you may have to add elements to the list.

- Do you notice how some of them were "highly important" according to the non-regularized model? 
- From now on, we will always regularize our linear models!

In [103]:
zero_impact_features = ["parch", "fare", "embark_town_Queenstown", "embark_town_Southampton"]

#### 🧪 Test your code below

In [104]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_lasso.py::TestLasso::test_zero_impact [32mPASSED[0m[32m                        [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/lasso.pickle

[32mgit[39m commit -m [33m'Completed lasso step'[39m

[32mgit[39m push origin master



**🏁 Congratulation! Don't forget to commit and push your notebook**