# Homework 4

(╯°□°)╯︵◓

You've been search far and wide, but you're still trying to understand the power that's inside. This time round though, you're armed with new weapons: supervised learning algorithms. Pokemons will have no more secrets after you analyse the pokedex!

The data can be found under `pokedex/pokemons.csv`, and is the same as homeworks 1, 2, & 3. Run the cell below to get an overview of the dataset:

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('pokedex/pokemons.csv')
df.head()



Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


## Problem 1

We've learned that Pokemon masters never mix training and battles. Since we are building supervised learning models, we want to _split our data_ into train and test datasets. We are only running a single round of experiments with no hyperparameter optimisation, so we'll skip validation sets this time.

💪 **Task: Split the `df` DataFrame into training and test DataFrames.**
- the split should be 80% training 20% test
- use `random_state=0`
- store your datasets into two variables called `train_df` and `test_df`. No need to save to disk!

In [2]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.20, random_state=0)

In [3]:
def test_split():
    assert "train_df" in globals(), "Can't find train_df, have you used the correct variable name for your train dataset?"
    assert "test_df" in globals(), "Can't find train_df, have you used the correct variable name for your test dataset?"
    assert train_df["Total"].sum() == 275746, "Your dataset split doesn't look quite right. Did you use the correct random_state?"
    print('Success! 🎉')

test_split()

Success! 🎉


In [4]:
# Just in case
train_df.to_csv('pokedex/pokemons_train.csv', index=False)
test_df.to_csv('pokedex/pokemons_test.csv', index=False)

## Problem 2

Now that we have split our dataset, we are ready to train. A crucial statistic in pokemon battles is `HP`. This is the amount of damage you have to inflict to your opponent to win the fight, so being able to _predict_ this amount would be an enormous advantage 👊.

💪 **Task: Train a linear regression model which predicts the label `HP`.**
- train the model on your training dataset
- use `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed` as features
- use `HP` as label
- scale the features using standardization before you train the model
- store your trained model in a variable called `reg`

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


def to_features(df, feature_list, label_list):
    X = df[feature_list].values
    y = df[label_list].values
    return X, y

feature_list = ['Attack','Defense','Sp. Atk','Sp. Def','Speed']
label_list = 'HP'
X_train, y_train = to_features(train_df, feature_list, label_list)

#Standardization
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

#Training model
reg = LinearRegression().fit(X_train_scaled, y_train)

In [8]:
import math

def test_regression():
    assert reg, "Can't find reg, have you used the correct variable name for your model?"
    assert math.isclose(reg.coef_.sum(), 15.46275, rel_tol=1e-6), "Your model parameters don't look quite right"
    print('Success! 🎉')

test_regression()

Success! 🎉


🧠 **Bonus Task: List and describe the main steps that happen during your linear regression model's _training_ , i.e inside of sklearn's `.fit()` method.**


- Step 1 : random init for `theta` parameter of the model
- Step 2 : cost function estimation for the model J
- Step 3 : derivative applied to the cost function d J / d theta
- Step 4 : moving propotionaly to the negative of the result of Step 3 derivation -  (d J / d theta)
- Step 5 : Go on Step 2 and reproduce until the cost function estimation is always on the same minimal value (limit?)  ==>   d J / d theta ~= 0

🧠 **Bonus Task: Explain the purpose of feature scaling, and why it's a good idea to use it here.**

Main goal is to scale different features on different range on the same range while keeping the proportion. Some functions used on machine learning will not work properly whitout this normalization.
In our case using regression it's a good idea to use it because we calculate derivative of cost function, so i think it could go wrong mathematicaly if every feature is not balanced proportionnaly.

## Problem 3

You encounter an unknown pokemon, and it looks very strong. 🙀 Use your `HP` regression model to see if you can take it on!

💪 **Task: Predict the `HP` of an unknown pokemon using your linear regression model.**
- the stats of the unknown pokemon are found below
- predict using your trained model, `reg`
- store the prediction in a variable called `y_predict`


In [16]:
attack = 79
defense = 109
sp_atk = 73
sp_def = 84
speed = 68

X_unknown_pokemon = np.asarray([attack,defense,sp_atk,sp_def,speed]).reshape(1, 5)
X_scaled_unknown_pokemon = scaler.transform(X_unknown_pokemon)
y_predict = reg.predict(X_scaled_unknown_pokemon)

In [17]:
def test_predict_hp():
    expected_prediction = 70.324
    assert y_predict, f"Can't find y_predict, have you used the correct variable name?"
    assert math.isclose(y_predict, expected_prediction, rel_tol=1e-4), f'The prediction should be {expected_prediction}, but your model predicted {y_predict}'
    print('Success! 🎉')
    print(f"The unknown pokemon has predicted HP: {y_predict.item():.1f}")
    return

test_predict_hp()

Success! 🎉
The unknown pokemon has predicted HP: 70.3


## Problem 4

Professor Oak told you about a rare breed of exceptionally powerful pokemon... the _legendary_ pokemon. A trainer who finds and captures a legendary pokemon is sure to become invicible!

💪 **Task: Train a logistic regression model which predicts if pokemons are `Legendary`.**
- train the model on your training dataset
- use `HP`, `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed` as features
- use `Legendary` as label
- scale the features using standardization before you train the model
- store your trained model in a variable called `clf`

In [18]:
from sklearn.linear_model import LogisticRegression

feature_list = ['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']
label_list = 'Legendary'
X_train, y_train = to_features(train_df, feature_list, label_list)

# Standardization
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

# Learning model
clf = LogisticRegression(random_state=0).fit(X_train_scaled, y_train)

In [19]:
def test_classification():
    assert clf, "Can't find clf, have you used the correct variable name for your model?"
    assert math.isclose(clf.coef_.sum(), 5.71640, rel_tol=1e-5), "Your model parameters don't look quite right"
    print('Success! 🎉')
    return

test_classification()

Success! 🎉


🧠 **Bonus Task: What are the differences between logistic regression and linear regression?**

Linear regression aim to answer to a quantity prediction problem while Logistic regression aim to answer to a classification problem (predicting a label)

## Problem 5

Finding legendary pokemons is no easy task, and we expect that we need a more _powerful_ model to accurately predict them.

💪 **Task: Train a logistic regression model with polynomial features and regularization which predicts if pokemons are Legendary.**
- use `HP`, `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed` as features
- use `Legendary` as label
- add polynomial features of degree 3
- scale the features using standardization before you train the model
- use ridge logistic regression to regularize your model
- store your trained model in a variable called `clf`  
Pro-tip: [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)

In [43]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeClassifier

feature_list = ['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']
label_list = 'Legendary'
X_train, y_train = to_features(train_df, feature_list, label_list)
X_test, y_test = to_features(test_df, feature_list, label_list)

# Polynomial features
poly_degree = 3
poly = PolynomialFeatures(poly_degree, include_bias=False)
poly = poly.fit(X_train)
X_poly = poly.transform(X_train)
X_test_poly = poly.transform(X_test)

# Standardization
scaler = StandardScaler()
scaler = scaler.fit(X_poly)
X_train_scaled = scaler.transform(X_poly)
X_test_scaled = scaler.transform(X_test_poly)

# Learning model
clf = RidgeClassifier(random_state=0).fit(X_train_scaled, y_train)
clf_score = clf.score(X_test_scaled, y_test)

In [44]:
def test_polynomial_regression():
    assert clf, "Can't find clf, have you used the correct variable name for your model?"
    assert math.isclose(clf.coef_.sum(), 0.340915, rel_tol=1e-5), "Your model parameters don't look quite right"
    print('Success! 🎉')
    return

test_polynomial_regression()

Success! 🎉


🧠 **Bonus Task: How do polynomial features make our model more powerful?**

Linear models works well but have some limitations, polynomial features allows to our models to learn more complex hypotheses by adding more features to the learning algorithm. 

🧠 **Bonus Task: What is the purpose of regularization? Why is it a good idea to use it here?**

Regularization helps to solve the overfitting problem of our model. If we use classic logistic Regression we will need more max_iter to find a convergence and we will have a result that is overfitting our data.
As refered to the RidgeClassifier docs : This classifier first converts the target values into {-1, 1} and then treats the problem as a regression task (multi-output regression in the multiclass case). Therefore it's more logic to use that for a task where we used polynomial features.

💪 **Bonus Task: Train the exact same regularized logistic regression model with polynomial features, but this time, chain your preprocessors and your model into a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)**

In [46]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

feature_list = ['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']
label_list = 'Legendary'
X_train, y_train = to_features(train_df, feature_list, label_list)
X_test, y_test = to_features(test_df, feature_list, label_list)

# Composite estimator
model = make_pipeline(PolynomialFeatures(include_bias=False),
                     StandardScaler(),
                     RidgeClassifier(random_state=0))
params = {
    'polynomialfeatures__degree' : [2,3,4]
}

grid = GridSearchCV(model, param_grid=params, cv=4)
grid.fit(X_train, y_train)

GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures(include_bias=False)),
                                       ('standardscaler', StandardScaler()),
                                       ('ridgeclassifier',
                                        RidgeClassifier(random_state=0))]),
             param_grid={'polynomialfeatures__degree': [2, 3, 4]})

In [47]:
f"Problem 5 parms : 'polynomialfeatures__degree': {poly_degree} with a score of {clf_score}"

"Problem 5 parms : 'polynomialfeatures__degree': 3 with a score of 0.93125"

In [48]:
f"Best params : {grid.best_params_} with a score of {grid.score(X_test, y_test)}"

"Best params : {'polynomialfeatures__degree': 2} with a score of 0.925"

**Which means Problem 5 polynomial degree is overfitting ?**

## Problem 6

You are travelling across the land, when you spot a large rainbow bird in the sky. Maybe it's a legendary pokemon! Let's check with our freshly trained regularized polynomial classifier.

💪 **Task: Predict if the rainbow bird is a legendary pokemon using your classifier.**
- the stats of the rainbow bird pokemon are found below
- predict using your trained model, `clf`
- store the prediction in a variable called `y_predict`

In [52]:
hp = 106
attack = 130
defense = 90
sp_atk = 110
sp_def = 154
speed = 90

X_unknown_pokemon = np.asarray([hp,attack,defense,sp_atk,sp_def,speed]).reshape(1, 6)
X_poly_pokemon = poly.transform(X_unknown_pokemon)
X_scaled_pokemon = scaler.transform(X_poly_pokemon)
y_predict = clf.predict(X_scaled_pokemon)

In [53]:
def test_predict_legendary():
    assert y_predict == True, f'The prediction should be {True}, but your model predicted {y_predict}'
    print('Success! 🎉')
    print("The rainbow bird is predicted to be a legendary pokemon!")
    
test_predict_legendary()

Success! 🎉
The rainbow bird is predicted to be a legendary pokemon!


## Problem 7

This legendary pokemon classifier is neat, and the Pokedex scientists are interested in including it in their next update. However, they want to make sure that it is accurate enough. 


💪 **Task: Evaluate the accuracy of your legendary Pokemon classifier.**
- evaluate your model on your test dataset
- store the prediction in a variable called `accuracy`

In [54]:
accuracy = clf.score(X_test_scaled, y_test)

In [55]:
def test_evaluation():
    assert math.isclose(accuracy, 0.93125, rel_tol=1e-5), "Your accuracy doesn't look quite right"
    print('Success! 🎉')
    print(f"You can predict legendary pokemons with an accuracy of {accuracy*100:.1f}%!")
    
test_evaluation()

Success! 🎉
You can predict legendary pokemons with an accuracy of 93.1%!


🧠 **Bonus Task: What is the definition of the accuracy metric?**

It's the number of correct predictions divided by the total number of predictions multiplied by 100