# Solvers ⚙️

In this exercise, you will investigate the effects of different `solvers` on `LogisticRegression` models.

👇 Run the code below

In [1]:
import pandas as pd

df = pd.read_csv("data.csv")

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol,quality rating
0,9.47,5.97,7.36,10.17,6.84,9.15,9.78,9.52,10.34,8.8,6
1,10.05,8.84,9.76,8.38,10.15,6.91,9.7,9.01,9.23,8.8,7
2,10.59,10.71,10.84,10.97,9.03,10.42,11.46,11.25,11.34,9.06,4
3,11.0,8.44,8.32,9.65,7.87,10.92,6.97,11.07,10.66,8.89,8
4,12.12,13.44,10.35,9.95,11.09,9.38,10.22,9.04,7.68,11.38,3


- The dataset consists of different wines 🍷
- The features describe different properties of the wines 
- The target 🎯 is a quality rating given by an expert

## 1. Target engineering

In this section, you are going to transform the ratings into a binary target.

👇 How many observations are there for each rating?

In [2]:
qr=df['quality rating']
qr.unique(), len(qr)

(array([ 6,  7,  4,  8,  3,  1,  2, 10,  5,  9]), 100000)

In [3]:
len(df[df['quality rating'] == 6])
for i in range(11):
    print(i, len(df[df['quality rating'] == i]))
    
# it is quite balanced !

0 0
1 10090
2 10030
3 9838
4 9928
5 10124
6 9961
7 9954
8 9977
9 9955
10 10143


👇 Create `y` by transforming the target into a binary classification task where quality ratings below 6 are bad [0], and ratings of 6 and above are good [1]

In [18]:
y = qr
y = y.apply(lambda x: 0 if x<6 else 1)

👇 Check the class balance of the new binary target

In [21]:
y.sum()/len(y)

0.4999

Create your `X` by scaling the features. This will allow for fair comparison of different solvers.

In [23]:
X=df[df.columns[:-1]]
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(X)
X=mms.transform(X)
X=pd.DataFrame(data=X,columns=df.columns[:-1])

In [24]:
X

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol
0,0.531348,0.285244,0.265966,0.504968,0.229879,0.363248,0.451878,0.432173,0.557503,0.413523
1,0.576803,0.420113,0.459984,0.343270,0.412348,0.123932,0.442488,0.370948,0.435926,0.413523
2,0.619122,0.507989,0.547292,0.577236,0.350606,0.498932,0.649061,0.639856,0.667032,0.432028
3,0.651254,0.401316,0.343573,0.457995,0.286659,0.552350,0.122066,0.618247,0.592552,0.419929
4,0.739028,0.636278,0.507680,0.485095,0.464168,0.387821,0.503521,0.374550,0.266156,0.597153
...,...,...,...,...,...,...,...,...,...,...
99995,0.332288,0.215695,0.337914,0.363144,0.371555,0.568376,0.287559,0.596639,0.785323,0.427046
99996,0.617555,0.453947,0.465643,0.397471,0.391400,0.458333,0.579812,0.643457,0.473165,0.425623
99997,0.590909,0.520677,0.620049,0.648600,0.341786,0.350427,0.469484,0.472989,0.524644,0.403559
99998,0.357367,0.190320,0.239289,0.390244,0.320838,0.427350,0.647887,0.515006,0.336254,0.459075


## 2. LogisticRegression solvers

👇 Logistic Regression models can be optimized using different **solvers**. Find out 
- Which is the `fastest_solver` ?
- What can you say about their respective precision score?

`solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']`
 
For more information on these 5 solvers, check out [this stackoverflow thread](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions)

In [34]:
model = LogisticRegression(solver=solver)
cv = cross_validate(model, X, y, cv=10, scoring=['precision'])
cv = pd.DataFrame(cv)
cv

Unnamed: 0,fit_time,score_time,test_precision
0,0.366684,0.011384,0.879263
1,0.375994,0.011701,0.872837
2,0.371884,0.011346,0.878232
3,0.377856,0.011336,0.868964
4,0.37322,0.010418,0.872066
5,0.377454,0.011417,0.877699
6,0.37983,0.011321,0.87868
7,0.377059,0.011377,0.870309
8,0.376533,0.011223,0.878507
9,0.398969,0.011572,0.865811


In [35]:
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

solver = "newton-cg"
fit_time = {}
precision = {}

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
for solver in solvers:
    model = LogisticRegression(solver=solver)
    cv = cross_validate(model, X, y, cv=10, scoring=['precision'])
    cv = pd.DataFrame(cv)
    fit_time[solver] = cv["fit_time"].mean()
    precision[solver] = cv["test_precision"].mean()


In [36]:
fit_time, precision

({'newton-cg': 0.3727376461029053,
  'lbfgs': 0.4983675479888916,
  'liblinear': 0.33599421977996824,
  'sag': 0.8242790937423706,
  'saga': 1.5173983573913574},
 {'newton-cg': 0.8742370014407493,
  'lbfgs': 0.8742344676126175,
  'liblinear': 0.8743399285518076,
  'sag': 0.8741957220807933,
  'saga': 0.8742370014407493})

In [38]:
fastest_solver = "liblinear"

<details>
    <summary>☝️ Intuition</summary>

All solvers should produce similar precision scores because our cost-function is "easy" enough to have a global minimum which is found by all 5 solvers. For very complex cost-functions such as in Deep Learning, different solvers may stopping at different values of the loss function. 

</details> 

###  🧪 Test your code

In [39]:
from nbresult import ChallengeResult

result = ChallengeResult('solvers',
                         fastest_solver=fastest_solver
                         )
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/cherif/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/cherif/code/cherifbenham/data-challenges/05-ML/04-Under-the-hood/02-Solvers
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_solvers.py::TestSolvers::test_fastest_solver [32mPASSED[0m[32m           [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solvers.pickle

[32mgit[39m commit -m [33m'Completed solvers step'[39m

[32mgit[39m push origin master


## 3. Stochastic Gradient Descent

Logistic Regression models can also be optimized via Stochastic Gradient Descent.

👇 Evaluate a Logistic Regression model optimized via **Stochastic Gradient Descent**. How do its precision score and training time compare to the performance of the models trained in section 2.?


<details>
<summary>💡 Hint</summary>

- If you are stuck, look at the [SGDClassifier doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)!

</details>



In [43]:
from sklearn.linear_model import SGDClassifier
model_sgd = SGDClassifier()
cv = cross_validate(model_sgd, X, y, cv=10, scoring=['precision'])
cv = pd.DataFrame(cv)
cv["fit_time"].mean(), cv["test_precision"].mean()

(0.20905606746673583, 0.8875590953222305)

☝️ The SGD model should have the shortest training time, for similar performance. This is a direct effect of performing each epoch of the Gradient Descent on a single data point.

## 4. Predictions

👇 Use the best model to predict the binary quality (0 or 1) of the following wine. Store your
- `predicted_class`
- `predicted_proba_of_class`

In [41]:
new_data = pd.read_csv('new_data.csv')

new_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,sulphates,alcohol
0,9.54,13.5,12.35,8.78,14.72,9.06,9.67,10.15,11.17,12.17


In [42]:
data1 = mms.transform(new_data)
data1

array([[0.53683386, 0.63909774, 0.66936136, 0.37940379, 0.66427784,
        0.35363248, 0.43896714, 0.50780312, 0.64841183, 0.65338078]])

In [60]:
model_log = LogisticRegression(solver='liblinear')
model_log.fit(X,y)

predicted_class = model_log.predict(data1)[0]
predicted_proba_of_class = model_log.predict_proba(data1)[0][0]



In [62]:

predicted_class, predicted_proba_of_class

(0, 0.9669923040923084)

##### predicted_class, predicted_proba_of_class

# 🏁  Check your code and push your notebook

In [63]:
from nbresult import ChallengeResult

result = ChallengeResult('new_data_prediction',
    predicted_class=predicted_class,
    predicted_proba_of_class=predicted_proba_of_class
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/cherif/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/cherif/code/cherifbenham/data-challenges/05-ML/04-Under-the-hood/02-Solvers
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 2 items

tests/test_new_data_prediction.py::TestNewDataPrediction::test_predicted_class [32mPASSED[0m[32m [ 50%][0m
tests/test_new_data_prediction.py::TestNewDataPrediction::test_predicted_proba [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/new_data_prediction.pickle

[32mgit[39m commit -m [33m'Completed new_data_prediction step'[39m

[32mgit[39m push origin master
