# Task for Today  

***

## Legendary Pokémon Classification  

Use a FeedForward Neural Network to predict if a given Pokémon is **legendary** or not, based on *Pokémon features*.


<img src="https://wallpapers.com/images/hd/legendary-pokemon-pictures-7yo7x0f1l2b2tu0r.jpg" width="800" height="500" alt="legendaries">

Data available at: https://github.com/Vaeliss/Pokemon_challenge/blob/main/pokemon.csv

Download the `pokemon.csv` file and put it in the file section of Colab.

# Challenge

TAs want to battle!

<img src="https://pokemongohub.net/wp-content/uploads/2023/06/grunts-1.jpg" width="400" height="300" alt="TAs">

Rules of the challenge:

- Gotta catch 'em all! ...But give priority to the legendaries.
- F1-score is usually the measure of choice for imbalanced datasets; however in this case we particularly want to avoid not "catching" legendaries. They're so rare, you might not have any more chances to catch 'em if they flee...
- In ML terms, we give recall more importance than precision for the task (check the whiteboard if you don't know their meaning).
- F2-score (i.e., [F-$\beta$-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html) with $\beta = 2$) is hence used as the main evaluation metric for your model.

- **TAs achieved a F2-score of 0.80. Can you beat them?!**

# Imports and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

import torch
from torch import nn
import torch.optim as optim

In [None]:
_ = torch.manual_seed(42) # for a fair comparison, don't change the seed!

In [None]:
data = pd.read_csv('pokemon.csv')

In [None]:
data

In [None]:
data_raw = data.copy() # usually, if memory allows it, it's a good idea to keep a raw version of your data

# Pre-processing / encoding

In [None]:
data.info()

In [None]:
data.isna().sum()

In [None]:
data = data.drop(['dexnum', 'name', 'type2'], axis=1)
# dropping type 2 is actually a debeatable step, it may provide useful information
# data = data.drop(["#", "Name"], axis=1)

In [None]:
data['legendary'] = data['legendary'].astype(int)
data['generation'] = data['generation'].astype(str)

In [None]:
data.dtypes

Categorical variables are one-hot encoded

In [None]:
def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [None]:
data = onehot_encode(data, 'type1', 't')
data = onehot_encode(data, 'generation', 'g')

In [None]:
data.shape

## Splitting and Scaling

In [None]:
data.columns # note that only the first 9 features are continuous now

In [None]:
y = data['legendary']
X = data.drop('legendary', axis=1)

In [None]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X.iloc[:,:9])
X = np.concatenate((X_scaled, np.array(X.iloc[:,9:])), axis=1)

In [None]:
# keep the proportions for the split equal and specify a seed of 42, we want a fair fight!
# Note: end split should be 0.60,0.20,0.20 for train,valid,test

train_size = 0.6
valid_size = 0.4
test_size = 0.5
X_train, X_test, y_train, y_test = # TODO
X_valid, X_test, y_valid, y_test = # TODO

# Model definition

In [None]:
device = # TODO
print(f"Using {device} device")

### Define your model :

Choose yourself in the model:
- number of hidden layers
- number of neurons per layer (careful with input and output, these are not a choice)
- activation functions
- any other possible component among those seen so far in theory.

In [None]:
# TODO

Instantiate your model and print it out

In [None]:
# TODO

### Hyperparameters:

Choose carefully your:
- learning rate (this is usually the most important hyperparameter to get right, but some optimizers are more forgiving than others)
- batch size
- number of epochs.
- other hyperparameters that you might need

In [None]:
# TODO

### Loss function and optimizer:

- What's the appropriate loss function for the task?
- Decide which optimizer you want to use ([Documentation](https://pytorch.org/docs/stable/optim.html))

In [None]:
# TODO

Define your TensorDatasets and DataLoaders; remember to use the appropriate dtype for your tensors.

In [None]:
# TODO

In [None]:
# Keep track of training and validation losses during training

train_loss_list = []
valid_loss_list = []

train_length = len(trainloader)
valid_length = len(validloader)

# Training

Implement your training and evaluation (for the validation set) loops

In [None]:
# TODO

# Results

### Plotting

Plot out the training and validation losses over the epochs

In [None]:
plt.plot( ... , label='train') # TODO
plt.plot( ... , label='valid') # TODO
plt.legend(loc="best")
plt.grid("on")
plt.show()

### Metrics

Print out appropriate metrics for the task

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, fbeta_score, classification_report

# TODO

In [None]:
wrong_predictions = # OPTIONAL TODO

  Did you manage to catch them all?

______________________________________________________________________________

This notebook is largely inspired (with some improvements and updates) by a video featured on [Data Every Day](https://www.youtube.com/watch?v=3Fr1npNxkJk).