# Lasso Regularization

In [62]:
import pandas as pd
data = pd.read_csv('data.csv')
data

Unnamed: 0,age,children,bmi,sex_female,sex_male,smoker,region_northeast,region_northwest,region_southeast,region_southwest,price_range
0,28,1,37.620,1,0,0,0,0,1,0,cheap
1,28,1,24.320,1,0,0,1,0,0,0,expensive
2,35,1,34.800,1,0,0,0,0,0,1,cheap
3,51,3,36.385,1,0,0,0,1,0,0,expensive
4,20,0,30.590,1,0,0,1,0,0,0,cheap
...,...,...,...,...,...,...,...,...,...,...,...
129,53,1,36.100,0,1,0,0,0,0,1,expensive
130,18,0,30.115,1,0,0,1,0,0,0,expensive
131,40,4,29.300,1,0,0,0,0,0,1,expensive
132,28,0,25.800,1,0,0,0,0,0,1,cheap


- Each row corresponds to the profile of health insurance client
- The target the `price_range` category
- The features are client specificities


👇 Optimize the regularization penalty of a Lasso classification model. According to your optimal model, which features do not influence the charges paid by a client?

We won't do a train/test split for now, simply assuming that it was already done and our data is now the training set.

You can use RandomizedSearch or GridSearch or a combination of both

Note: not all solvers support all types of penalty. Look at the [documentation for Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)!

<details>
    <summary>Hints</summary>

- [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) uses **Ridge** regularization by default. You just have to tune the hyperparameter `C` = 1/`alpha`

- To use **Lasso**, simply change the penalty hyperparameter to "l1" and the solver to ‘liblinear’ or ‘saga’ (not all solvers support all penalty types)

``` python
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty='l1', solver='liblinear', C=1/10)
```
</details>


In [63]:
data.columns

Index(['age', 'children', 'bmi', 'sex_female', 'sex_male', 'smoker',
       'region_northeast', 'region_northwest', 'region_southeast',
       'region_southwest', 'price_range'],
      dtype='object')

In [69]:
y = data['price_range']
X = data[['age', 'children', 'bmi', 'sex_female', 'sex_male', 'smoker',
       'region_northeast', 'region_northwest', 'region_southeast',
       'region_southwest']]

In [70]:
# Encode the target
from sklearn.preprocessing import LabelEncoder
#Instantiate the encoder
le = LabelEncoder()
#Fit the encoder on the required columns
y = le.fit_transform(y)
target = pd.DataFrame(y, columns=['price range'])

In [71]:
target.value_counts()

price range
1              67
0              67
dtype: int64

In [73]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X[['age', 'children', 'bmi']]= scaler.fit_transform(X[['age', 'children', 'bmi']] )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[['age', 'children', 'bmi']]= scaler.fit_transform(X[['age', 'children', 'bmi']] )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


Unnamed: 0,age,children,bmi,sex_female,sex_male,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,0.217391,0.2,0.670736,1,0,0,0,0,1,0
1,0.217391,0.2,0.231937,1,0,0,1,0,0,0
2,0.369565,0.2,0.577697,1,0,0,0,0,0,1
3,0.717391,0.6,0.629990,1,0,0,0,1,0,0
4,0.043478,0.0,0.438799,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
129,0.760870,0.2,0.620587,0,1,0,0,0,0,1
130,0.000000,0.0,0.423128,1,0,0,1,0,0,0
131,0.478261,0.8,0.396239,1,0,0,0,0,0,1
132,0.217391,0.0,0.280765,1,0,0,0,0,0,1


In [74]:
# Grid Search
from sklearn.linear_model import LogisticRegression
Log = LogisticRegression(penalty='l1', solver = 'liblinear')

from sklearn.model_selection import GridSearchCV
# Hyperparameter Grid
grid = {'C': [0.1, 0.5, 1, 2, 5]}

# Instanciate Grid Search
grid_search = GridSearchCV(Log, grid, 
                           scoring = 'recall',
                           cv = 5,
                           n_jobs=-1 # paralellize computation
                          ) 

# # Fit data to Grid Search
grid_search.fit(X,target);
grid_search.best_estimator_

  return f(**kwargs)


LogisticRegression(C=0.5, penalty='l1', solver='liblinear')

In [75]:
grid_search.best_score_

0.8802197802197803

In [77]:
best_log = LogisticRegression(penalty='l1', solver = 'liblinear', C=0.5)
best_log.fit(X,target)

  return f(**kwargs)


LogisticRegression(C=0.5, penalty='l1', solver='liblinear')

In [86]:
# Rank the features by order of importance

In [85]:
coefs = pd.DataFrame({
    "coef_lasso": pd.Series(best_log.coef_[0], index= X.columns),
}).applymap(lambda x: round(x, 1))
coefs

Unnamed: 0,coef_lasso
age,2.3
children,-0.1
bmi,-0.1
sex_female,-0.3
sex_male,-0.4
smoker,1.9
region_northeast,0.0
region_northwest,0.0
region_southeast,-0.5
region_southwest,-0.2


Age and smoker are the key feature

# Ridge Regularization

Redo the same with Ridge regularization. You can simply change the penalty to l2.

In [79]:
# Grid Search
from sklearn.linear_model import LogisticRegression
Log = LogisticRegression(penalty='l2', solver = 'liblinear')

from sklearn.model_selection import GridSearchCV
# Hyperparameter Grid
grid = {'C': [0.1, 0.5, 1, 2, 5]}

# Instanciate Grid Search
grid_search = GridSearchCV(Log, grid, 
                           scoring = 'recall',
                           cv = 5,
                           n_jobs=-1 # paralellize computation
                          ) 

# # Fit data to Grid Search
grid_search.fit(X,target);
grid_search.best_estimator_

  return f(**kwargs)


LogisticRegression(C=0.1, solver='liblinear')

In [80]:
# Best regularization penalty and best score
best_log = LogisticRegression(penalty='l2', solver = 'liblinear', C=0.5)
best_log.fit(X,target)

  return f(**kwargs)


LogisticRegression(C=0.5, solver='liblinear')

In [81]:
# Rank the features by order of importance
best_log.coef_

array([[ 2.32663749, -0.14117744, -0.09623545, -0.25274511, -0.42040058,
         1.92180987,  0.04494919,  0.03456274, -0.51836081, -0.23429682]])

⚠️ Please, push the exercice once you have completed it 🙃

<span style="font-size:2em;">🏁</span>

In [90]:
coefs['coef_ridge'] = pd.Series(best_log.coef_[0], index= X.columns)
coefs = coefs.applymap(lambda x: round(x, 1))
coefs

Unnamed: 0,coef_lasso,coef_ridge
age,2.3,2.3
children,-0.1,-0.1
bmi,-0.1,-0.1
sex_female,-0.3,-0.3
sex_male,-0.4,-0.4
smoker,1.9,1.9
region_northeast,0.0,0.0
region_northwest,0.0,0.0
region_southeast,-0.5,-0.5
region_southwest,-0.2,-0.2


# Regularization

## Import data

In [None]:
import pandas as pd
data = pd.read_csv('data.csv')
data

- Each row corresponds to the profile of health insurance client
- The target the `price_range` category
- The features are client specificities


We won't do a train/test split for now, simply assuming that it was already done and our data is now the training set.


👇 Create your `X` and `y`. Encode your binary target, and scale your features.

## Lasso

👇 Optimize the regularization penalty of a Lasso classification model. Don't forget to scale your features to optimize your regularization

❓ According to your optimal model, which features do not influence the charges paid by a client?

You can use RandomizedSearch or GridSearch or a combination of both  
Note: not all solvers support all types of penalty. Look at the [documentation for Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)!

<details>
    <summary>Hints</summary>

- [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) uses **Ridge** regularization by default. You just have to tune the hyperparameter `C` = 1/`alpha`

- To use **Lasso**, simply change the penalty hyperparameter to "l1" and the solver to ‘liblinear’ or ‘saga’ (not all solvers support all penalty types)

``` python
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty='l1', solver='liblinear', C=1/10)
```
</details>


## Ridge

Redo the same with Ridge regularization. You can simply change the penalty to l2.

## Let's play with `GridSearchCV` a bit more

Gridsearch can be computationally expensive. You don't want to run them multiple time if you want to measure multiple performance metrics. 

👇 Can you make **one** GridSearchCV where you keep log of `accuracy`, `precision` and `recall` score at each fit, while keeping `accuracy` as your decision metric to automatically choose the `best_estimator_` ?  (Read the docs!)

<details><summary>Hints</summary>

Look at the `refit` argument
<details>

👇 Take some time to understand what's in your `GridSearchCV().cv_results_` instance attribute.
Can you rank, for instance, your trainings per mean cross-validated `recall` scores?
(Turn the cv_results_ into a DataFrame to make things clearer)

⚠️ Please, push the exercice once you have completed it 🙃

<span style="font-size:2em;">🏁</span>