# Regularization

Let's improve our understanding of what impacted **Titanic** passengers' chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")

# the dataset is already one-hot-encoded
data.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [3]:
# We build X and y

y = data["survived"]
X = data.drop(columns="survived")
X.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [4]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X.shape

(714, 12)

## 1.  Logistic Regression without regularization

❓ Rank the features by decreasing order of importance after training a simple **non-regularized** Logistic Regression (i.e. look at the coefficients after fitting)
- Careful: `LogisticRegression` is penalized by default
  - take a look at the [penalty parameter](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to find out how to remove the penalty)
- Increase `max_iter` to a larger number until model converges

In [13]:
from sklearn.linear_model import LogisticRegression


logistic_regression = LogisticRegression(penalty='none', max_iter=10000)
logistic_regression.fit(X_scaled, y)
feature_importances = logistic_regression.coef_[0]
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})
ranked_importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(ranked_importance_df)

                    Feature  Importance
5                sex_female    2.671883
0                    pclass    2.547187
6               class_First    2.360417
4                      fare    1.358812
8                 who_child    1.336356
3                     parch   -0.893820
1                       age   -2.196151
7               class_Third   -2.456891
2                     sibsp   -2.477131
9     embark_town_Cherbourg  -11.221671
11  embark_town_Southampton  -11.523126
10   embark_town_Queenstown  -11.918725




❓How do you interpret, in plain English, the value for the coefficient `sex_female`?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds of survival by exp(2.67) = 14"

</details>


❓ What is the feature that most impacts the chances of survival according to your model?  
Fill the `top_1_feature` list below with the name of this feature

In [18]:
top_1_feature = ["embark_town_Queenstown"]

In [19]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature = top_1_feature)
result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/francoisgirard/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/francoisgirard/code/francoisgirard51/05-ML/05-Model-Tuning/data-regularization/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_unregularized.py::TestUnregularized::test_top_1 [32mPASSED[0m[32m              [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/unregularized.pickle

[32mgit[39m commit -m [33m'Completed unregularized step'[39m

[32mgit[39m push origin master



## 2.  Logistic Regression with an L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance (look at the coefficients)
- By "strongly regularized" we mean "more than Sklearn's default regularization factor". 
- Sklearn's default values are very useful orders of magnitudes to keep in mind for "scaled features"

In [20]:
from sklearn.linear_model import LogisticRegression


logistic_regression_l2 = LogisticRegression(penalty='l2', C=0.001, max_iter=10000)
logistic_regression_l2.fit(X_scaled, y)

feature_importances_l2 = logistic_regression_l2.coef_[0]
importance_df_l2 = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances_l2})
ranked_importance_df_l2 = importance_df_l2.sort_values(by='Importance', ascending=False)

print(ranked_importance_df_l2)

                    Feature  Importance
5                sex_female    0.086500
6               class_First    0.041793
9     embark_town_Cherbourg    0.023665
8                 who_child    0.015245
4                      fare    0.008876
3                     parch    0.004379
2                     sibsp   -0.001203
10   embark_town_Queenstown   -0.003203
1                       age   -0.005416
11  embark_town_Southampton   -0.021575
0                    pclass   -0.047718
7               class_Third   -0.053644


❓ What are the top 2 features driving chances of survival according to your model?  
Fill the `top_2_features` list below with the name of these features

In [21]:
top_2_features = ["sex_female", "class_Third"]

#### 🧪 Test your code below

In [22]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2 = top_2_features)
result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/francoisgirard/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/francoisgirard/code/francoisgirard51/05-ML/05-Model-Tuning/data-regularization/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_ridge.py::TestRidge::test_top2 [32mPASSED[0m[32m                               [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/ridge.pickle

[32mgit[39m commit -m [33m'Completed ridge step'[39m

[32mgit[39m push origin master



## 2. Logistic Regression with an L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance

In [27]:
from sklearn.linear_model import LogisticRegression

# Instantiate a strongly regularized LogisticRegression with L2 penalty and C=0.001 (strong regularization)
logistic_regression_l2_strong = LogisticRegression(penalty='l2', C=0.001, max_iter=10000, solver='saga')
logistic_regression_l2_strong.fit(X_scaled, y)

# Get feature importances
feature_importances_l2_strong = logistic_regression_l2_strong.coef_[0]

# Create a DataFrame to rank the features by importance
importance_df_l2_strong = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances_l2_strong})
ranked_importance_df_l2_strong = importance_df_l2_strong.sort_values(by='Importance', ascending=False)

print(ranked_importance_df_l2_strong)

                    Feature  Importance
5                sex_female    0.086498
6               class_First    0.041793
9     embark_town_Cherbourg    0.023662
8                 who_child    0.015244
4                      fare    0.008876
3                     parch    0.004378
2                     sibsp   -0.001204
10   embark_town_Queenstown   -0.003203
1                       age   -0.005416
11  embark_town_Southampton   -0.021574
0                    pclass   -0.047719
7               class_Third   -0.053645


❓ What are the features that have absolutely no impact on chances of survival, according to your L1 model?  
Fill the `zero_impact_features` list below with the name of these features; you may have to add elements to the list.

- Do you notice how some of them were "highly important" according to the non-regularized model? 
- From now on, we will always regularize our linear models!

In [24]:
zero_impact_features = ["pclass", "age", "sibsp", "parch"]

#### 🧪 Test your code below

In [25]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/francoisgirard/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/francoisgirard/code/francoisgirard51/05-ML/05-Model-Tuning/data-regularization/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_lasso.py::TestLasso::test_zero_impact [31mFAILED[0m[31m                        [100%][0m

[31m[1m__________________________ TestLasso.test_zero_impact __________________________[0m

self = <tests.test_lasso.TestLasso testMethod=test_zero_impact>

    [94mdef[39;49;00m [92mtest_zero_impact[39;49;00m([96mself[39;49;00m):
        res_set = [96mset[39;49;00m([96mself[39;49;00m.result.zero_impact_features)
        minimal_truth_set = {[33m"[39;49;00m[33membark_town_Queenstown[39;49;00m[33m"[39;49;00m}
>       [96mself[39;49;00m.assertTrue([96mlen[39;49;00m(res_set & minimal_truth_set) == [94m1[39;49

**🏁 Congratulation! Don't forget to commit and push your notebook**