# Assignment 3

(╯°□°)╯︵◓

You've been search far and wide, but you're still trying to understand the power that's inside. This time round though, you're armed with new weapons: supervised learning algorithms. Pokemons will have no more secrets after you analyse the pokedex!

The data can be found under `pokedex/pokemons.csv`, and is the same as assignment 1 & 2. Run the cell below to get an overview of the dataset:

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('pokedex/pokemons.csv')
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


## Problem 1

A crucial statistic in pokemon battles is `HP`. This is the amount of damage you have to inflict to your opponent to win the fight, so being able to _predict_ this amount would be an enormous advantage 👊.

💪 **Task: Train a linear regression model which predicts the label `HP`.**
- use `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed` as features
- use `HP` as label
- scale the features using standardization before you train the model
- store your trained model in a variable called `reg`

In [2]:
# Importing Linear Regression Model.
from sklearn.linear_model import LinearRegression

#Defining Essentials for our Model.
linear_features = df[['Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
linear_labels = df['HP']

#Defining the features and labels in terms of Matrix X and vector y.
#Reshaping isn't required in our case as we have the appropriate formatting present already.
linear_X = linear_features.values
linear_y = linear_labels.values

# Importing Standard Scaler to scale our features.
from sklearn.preprocessing import StandardScaler

# Fitting our Features through the Scaler.
linear_scaler = StandardScaler().fit(linear_X)
linear_scaled_features = linear_scaler.transform(linear_X)

#Training and Storing our Data at 'reg'
reg = LinearRegression().fit(linear_scaled_features, linear_y)

#Getting Predictions of label HP.
PredictedHP = reg.predict(linear_scaled_features)

In [3]:
import math

def test_regression():
    assert reg, "Can't find reg, have you used the correct variable name for your model?"
    assert math.isclose(reg.coef_.sum(), 15.20036, rel_tol=1e-6), "Your model parameters don't look quite right"
    print('Success! 🎉')

test_regression()

Success! 🎉


🧠 **Task: List and describe the main steps that happen during your linear regression model's _training_ , i.e inside of sklearn's `.fit()` method.**

🧠 **Task: Explain the purpose of feature scaling, and why it's a good idea to use it here.**

🧠 **Bonus Task: Can you list other feature scaling methods?**

💡 ***Answer 1:*** Passing the features through the `.fit()` method results in an optimized and comparable output using gradient descent optimization. The `.fit()` method:

* Starts with random parameters of **$\theta$** for all the features in the calculation model. ***(Multi-variate Linear Regression in the case above.)***
* Then determines a function ***J*** which is the cost in regards to our model and its features.
* Calculates the derivative of our cost function ***J*** at these intial parameters ***($\frac{dJ}{d\theta}$)***.
* A proportional step is taken to the negative of the derivative. This is done to accurately determine the minimum points in our function ***($-\alpha \frac{dJ}{d\theta}$)***.
* Redetermining the cost function and repeating it all ahead until we reach a minimum point ***($\frac{dJ}{d\theta} \approx 0$)***.

***

💡 ***Answer 2:*** So `feature scaling` allows us to get more get data of varying scales and adjust it as such that we avoid having possible biases from big outliers in our comparisons.

In our case, we use sklearn's `StandardScaler` which uses ***standardization***. This involves transforming the data in such a manner that the `mean` becomes **0** and the `variance` becomes **1**. Now the data is presented in a much more smaller, comparable scale with higher accuracy while still being proportional to the actual values of the data.

***

💡🎉 ***Bonus Answer:*** Below is a list of a few feature scaling methods.

* **Standardization:** Discussed above. (Great if the data is normally distributed.)
* **Normalization *(also called Min Max Scaling)*:** Normalization involves taking the feautres and binding its them within a specific range such as [0,1] or [-1,1]. (Great if the standard deviation is small.)
* **Maximum Absolute Scaling:** Scales each feature individually that the maximum absolute value of those features are equal to `1.0`. The data is never shifted or centered and the sparsity of data remains.
* **Robust Scaling:** Used when data contains many outliers. Uses `Quantile Ranges` *(25th, 75th)* to remove the median and scales of the data. Using quantiles for medians and centering reduces the influence of the few huge marginal outliers in the data and a more accurate scaling in achieved in comparison to the mean and standard methods.

> These are a few that I could logically understand out of all the ones I found in my research online. There might be more easier ones or tougher ones remaining but my submission this time will really close to the set deadline.

***

## Problem 2

You encounter an unknown pokemon, and it looks very strong. 🙀 Use your `HP` regression model to see if you can take it on!

💪 **Task: Predict the `HP` of an unknown pokemon using your linear regression model.**
- the stats of the unknown pokemon are found below
- predict using your trained model, `reg`
- store the prediction in a variable called `y_predict`


In [4]:
attack = 79
defense = 109
sp_atk = 73
sp_def = 84
speed = 68

#The stats of the pokemon shall be taken as the feature and written in the form of a Numpy array.
unknown_pokemon = np.array([attack, defense, sp_atk, sp_def, speed])

#Now features must be a matrix and we have array so we reshape it into the desired size of the matrix.
unknown_pokemon_feature = unknown_pokemon.reshape(1,5)

#Scaling up this feature so it is accurately to our existing scaled and regressed data.
unknown_pokemon_feature_scaled = linear_scaler.transform(unknown_pokemon_feature)

#Scaled Feature is ready for prediction so running it through reg.predict and storing in variable 'y_predict'
y_predict = reg.predict(unknown_pokemon_feature_scaled)
y_predict

array([69.55099276])

In [5]:
def test_predict_hp():
    expected_prediction = 69.551
    assert y_predict, f"Can't find y_predict, have you used the correct variable name?"
    assert math.isclose(y_predict, expected_prediction, rel_tol=1e-4), f'The prediction should be {expected_prediction}, but your model predicted {y_predict}'
    print('Success! 🎉')
    return

test_predict_hp()

Success! 🎉


## Problem 3

Professor Oak told you about a rare breed of exceptionally powerful pokemon... the _legendary_ pokemon. A trainer who finds and captures a legendary pokemon is sure to become invicible!

💪 **Task: Train a logistic regression model which predicts if pokemons are `Legendary`.**
- use `HP`, `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed` as features
- use `Legendary` as label
- scale the features using standardization before you train the model
- store your trained model in a variable called `clf`

In [6]:
#Importing Logistic Regression Model.
from sklearn.linear_model import LogisticRegression

#Standard Scaler pre-imported in Problem 1. 
#Defining Essentials of our Model.
logistic_features = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
logistic_labels = df['Legendary']

#Defining features and labels in terms of Matrix X and vector y
logistic_X = logistic_features.values
logistic_y = logistic_labels.values

#Scaling and Standardizing our features.
logistic_scaler = StandardScaler().fit(logistic_X)
logistic_scaled_features = logistic_scaler.transform(logistic_X)

#Training and Storing our Data at 'clf'
clf = LogisticRegression().fit(logistic_scaled_features, logistic_y)

#Getting Predictions of label 'Legendary' based on standardized features.
PredictedLegendary = clf.predict(logistic_scaled_features)

In [7]:
def test_regression():
    assert clf, "Can't find clf, have you used the correct variable name for your model?"
    assert math.isclose(clf.coef_.sum(), 5.80924, rel_tol=1e-5), "Your model parameters don't look quite right"
    print('Success! 🎉')
    return

test_regression()

Success! 🎉


🧠 **Task: What are the differences between logistic regression and linear regression?**

💡 ***Answer 3:***

| **Linear Regression** | **Logistic Regression** |
| --- | --- |
| Predicts a ***Continuous*** Dependent Variable based on the values of the Independent variables | Predicts a ***Categorical*** Dependent Variable based on the values of the Independent Variable |
| Estimated on the ***least square*** (the regression coefficients are chosen such that they minimize the sum of the square distance between each of the observed responses | Estimated on the ***maximum likelihood*** (the regression coefficients are chosen such that the probability of *y* under given values *X* is maximized) |
| Best represented in the form of a ***Straight Line*** | Best represented in the form of a ***Curve*** |
| The Dependent and Independent variables ***have*** a linear relationship | The Dependent and Independent variables ***may or may not have*** a linear relationship |
| Resultant Output is a ***Predicted Integer Value*** | Resultant Output is a ***Predicted Binary Value*** |

> `Extra`
- **Common Application of Linear Regression:** Forecasting Sales in Business Models.
- **Common Application of Logistic Regression:** Classification Problems and Image Processing.

***

## Problem 4

Finding legendary pokemons is no easy task, and we expect that we need a more _powerful_ model to accurately predict them.

💪 **Task: Train a logistic regression model with polynomial features and regularization which predicts if pokemons are Legendary.**
- use `HP`, `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed` as features
- use `Legendary` as label
- add polynomial features of degree 3
- scale the features using standardization before you train the model
- use ridge logistic regression to regularize your model
- store your trained model in a variable called `clf`  
Pro-tip: [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)

In [8]:
#Linear Regression Pre-Imported in Problem 1.
#Logistic Regression Pre-Imported in Problem 3.
#Standard Scaler Pre-Imported in Problem 1.

#Importing Polynomial Features Pre-processing.
from sklearn.preprocessing import PolynomialFeatures

#Importing Ridge Classification Model.
from sklearn.linear_model import RidgeClassifier

#Defining Essentials of our Model
pologreg_features = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
pologreg_labels = df['Legendary']

#Defining Features in Terms of Matrix X and Vector y.
pologreg_X = pologreg_features.values
pologreg_y = pologreg_labels.values

#Using Polynomial Features to add features of degree 3.
pologreg_poly = PolynomialFeatures(3, include_bias = False)
pologreg_poly = pologreg_poly.fit(pologreg_X)
X_pologreg_poly = pologreg_poly.transform(pologreg_X)

#Scaling and Standardizing features with added Polynomials.
pologreg_scaler = StandardScaler()
pologreg_scaler = pologreg_scaler.fit(X_pologreg_poly)
X_pologreg_poly_scaled = pologreg_scaler.transform(X_pologreg_poly)

#Using Ridge Logistic Regression to Regularize the Model.
pologreg_reg = RidgeClassifier(alpha=1).fit(X_pologreg_poly_scaled, pologreg_y)

#Taking this value and sending it to 'clf' as per question requirement.
clf = pologreg_reg

In [9]:
def test_regression():
    assert clf, "Can't find clf, have you used the correct variable name for your model?"
    assert math.isclose(clf.coef_.sum(), 0.37331, rel_tol=1e-5), "Your model parameters don't look quite right"
    print('Success! 🎉')
    return

test_regression()

Success! 🎉


🧠 **Task: How do polynomial features make our model more powerful?**

🧠 **Task: What is the purpose of regularization? Why is it a good idea to use it here?**

🧠 **Bonus Task: Can you cite other regularization methods?**

💪 **Bonus Task: Train the exact same regularized logistic regression model with polynomial features, but this time, chain your preprocessors and your model into a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)**

💡 ***Answer 4:*** Polynomial Features are used to create additional features by taking the polynomial of to a respective degree of the features. This allows for more variance and hence get an even more accurate regression model in proportion to the the higher the degree of the additional polynomial features.
***

💡 ***Answer 5:***  Regularization is used to reduce the chances of error by fitting a function appropriately on the given training set and avoiding overfitting. Datapoints of the features can be divided into two broad categories: 
- `Pattern`
- `Noise`

The end goal of using regularization is to have a training model with the most accurate estimations based on algortihms from the `Pattern` datapoints while ignoring the `Noise`. This is done by taking higher degree polynomials of our data points and factoring them in and then comparing it with the previous, simpler one in order to reduce the factor of the noise inside the data. Noise cannot be completely terminated but it can certainly be reduced by artificially penalizing it high degree polynomials of the features.

In our case, the datapoints are in such varying ranges with some possible outliers which would affect the regression model. Regularization ensures that we get an accurate model with the added information from higher degree Polynomials and their comparisons to factor out the noise in the datapoints such as our outliers.

## Problem 5

You are travelling across the land, when you spot a large rainbow bird in the sky. Maybe it's a legendary pokemon! Let's check with our freshly trained regularized polynomial classifier.

💪 **Task: Predict if the rainbow bird is a legendary pokemon using your classifier.**
- the stats of the rainbow bird pokemon are found below
- predict using your trained model, `clf`
- store the prediction in a variable called `y_predict`

In [10]:
hp = 106
attack = 130
defense = 90
sp_atk = 110
sp_def = 154
speed = 90

#Using Defined Stats and making it into an array.
rainbow_pokemon_stats = np.array([hp, attack, defense, sp_atk, sp_def, speed])

#Reshaping the Array so it is in a matrix form.
rainbow_pokemon_features = rainbow_pokemon_stats.reshape(1,6)

#Adding the same polynomial features to the Rainbow Pokemon's features.
rainbow_pokemon_pologreg_poly = pologreg_poly.transform(rainbow_pokemon_features)

#Scaling and Standardizing the Rainbow Pokemon's features for an accurate prediction.
rainbow_pokemon_pologreg_scaled = pologreg_scaler.transform(rainbow_pokemon_pologreg_poly)

#Storing Prediction in a variable 'y_predict'.
y_predict = clf.predict(rainbow_pokemon_pologreg_scaled)

In [11]:
def test_predict_legendary():
    assert y_predict == True, f'The prediction should be {True}, but your model predicted {y_predict}'
    print('Success! 🎉')
    
test_predict_legendary()

Success! 🎉
