# Part 3: Deep Learning Model with Transformers
## Author: Brady Lamson
## Date: Fall 2023
# Overview and Motivation

This portion of the project has a couple distinct goals. 

Firstly, I shall utilize the `distilibert-base-uncased` model to attempt to predict the primary and secondary types of pokemon entirely from their text descriptions.

Secondly, this notebook will function as a detailed walkthrough to fine tuning a huggingface transformer. Much of the information for this task is scattered throughout documentation and articles of varying levels of utility, so compiling all of that information into one notebook will result in something that will hopefully be useful to me and anyone who may read this. 

This walkthrough will also utilize hyperparameter search using the `optuna` library, which many articles I have read online seem to lack. I hope this will give this walkthrough a useful niche that other guides have not filled.

## References

As I've never done multi-label classification using `transformers` before I'll be using [this guide by Ronak Patel](https://colab.research.google.com/github/rap12391/transformers_multilabel_toxic/blob/master/toxic_multilabel.ipynb#scrollTo=CQQ7CoOag_r7) that is featured in [towardsdatascience](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1). I won't be following it 1:1 but it's there to help me get some traction.

## Potential Limitations

Performance of this model will be sought after but is not the end goal. I fear that the dataset I am working with will put a cap on performance. Pokemon types are extremely varied, with 19 types existing in this dataset alone. On top of that, I am predicting on both primary and secondary types which turns this into a multi-label prediction problem. Thus, combinations of types become important and many combinations of types only appear once. This is a limitation that likely cannot be overcome without removing problematic rows from the training split or simply acquiring more data. 

A future improvement that is outside the scope of this project is to collect all of the pokemon descriptions from each game. There are many pokemon games, and using this would allow us to duplicate many pokemon and artifically make certain type combinations more frequent and inflate our dataset. This would also provide more descriptions to train on as they tend to be similar but not identical in every game. This is obviously not without its downsides as it would inflate the frequency of already frequent type combinations, but a variant of this plan with a bit more thought put into it may be worth considering if maximizing model performance is a priority.

# Data Loading and Preprocessing

Our goal here is to do the same pre-processing as in part 2. So we'll have a bit of a repeat of that content.
From there we'll need to convert our dataset to the transformers `DatasetDict` which will contain all of our splits. The big difference here is taking our same dataframe and doing what we need to do to it to get it working within the transforers framework.

In [1]:
import pandas as pd
import numpy as np
import os

# Set seed
np.random.seed(776)

In [2]:
data_path = "./data/pokemon.csv"
data_exists = os.path.isfile(data_path)

if not data_exists:
    # This part requires a kaggle api key. On linux this will be saved to your home directory in .kaggle/kaggle.json
    !kaggle datasets download -d cristobalmitchell/pokedex
    !unzip pokedex.zip -d data

df = (
    # load in the data
    pd.read_csv(data_path, sep='\t', encoding='utf-16-le')
    # select the relevant columns
    .loc[:, ['english_name', 'primary_type', 'secondary_type', 'description']]
    # Change the type columns into categories and handle NaNs in secondary typing
    .assign(
        primary_type=lambda x: x['primary_type'].astype("category"),
        secondary_type=lambda x: x['secondary_type'].fillna("none").astype("category")
    )
)
display(df.head())
display(df.info())
display(df.describe())

Unnamed: 0,english_name,primary_type,secondary_type,description
0,Bulbasaur,grass,poison,There is a plant seed on its back right from t...
1,Ivysaur,grass,poison,"When the bulb on its back grows large, it appe..."
2,Venusaur,grass,poison,Its plant blooms when it is absorbing solar en...
3,Charmander,fire,none,It has a preference for hot things. When it ra...
4,Charmeleon,fire,none,"It has a barbaric nature. In battle, it whips ..."


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 898 entries, 0 to 897
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   english_name    898 non-null    object  
 1   primary_type    898 non-null    category
 2   secondary_type  898 non-null    category
 3   description     898 non-null    object  
dtypes: category(2), object(2)
memory usage: 17.3+ KB


None

Unnamed: 0,english_name,primary_type,secondary_type,description
count,898,898,898,898
unique,898,18,19,896
top,Bulbasaur,water,none,Although it’s alien to this world and a danger...
freq,1,123,429,3


## Clean up the description text

In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re

def clean_text(text: str) -> str:
    # Remove special characters using regex
    text = re.sub(r'[^a-zA-Zé]', ' ', text)
    
    # Tokenize the text
    words = text.split()
    
    # Remove stopwords using nltk
    stop_words = set(stopwords.words('english'))
    filtered_words = [word.lower() for word in words if word.lower() not in stop_words]
    
    # Join the words back into a sentence
    cleaned_text = ' '.join(filtered_words)
    
    return cleaned_text

df["cleaned_text"] = df.description.apply(lambda x: clean_text(x))
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brady_fingoal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,english_name,primary_type,secondary_type,description,cleaned_text
0,Bulbasaur,grass,poison,There is a plant seed on its back right from t...,plant seed back right day pokémon born seed sl...
1,Ivysaur,grass,poison,"When the bulb on its back grows large, it appe...",bulb back grows large appears lose ability sta...
2,Venusaur,grass,poison,Its plant blooms when it is absorbing solar en...,plant blooms absorbing solar energy stays move...
3,Charmander,fire,none,It has a preference for hot things. When it ra...,preference hot things rains steam said spout t...
4,Charmeleon,fire,none,"It has a barbaric nature. In battle, it whips ...",barbaric nature battle whips fiery tail around...


## Vectorize Categorical Data

Here we'll do the same one-hot encoding as in part 2. Here we'll do it before the splits though as, in retrospect, doing this after the split made no sense. We also create

In [4]:
from typing import Tuple, List

def convert_bool_list_to_true_indices(bools: List[int]) -> List[int]:
    ret_val = [i for i, x in enumerate(bools) if x == 1]

    return ret_val
    

def one_hot_encode_categories(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """
    Goal of this function is to create "dummy variables" for each category in primary/secondary typing
    We essentially use this to create a one-hot encoding for our one-word variables.
    We combine this into one giant dataframe at the end
    """

    # Here we create the dummy vars and then add suffixes to the end to differentiate between primary/secondary
    primary_dummies = pd.get_dummies(df['primary_type'], dtype=int)
    primary_dummies.columns = [f"{col}_primary" for col in primary_dummies.columns]
    secondary_dummies = pd.get_dummies(df['secondary_type'], dtype=int)
    secondary_dummies.columns = [f"{col}_secondary" for col in secondary_dummies.columns]
    labels = list(primary_dummies.columns) + list(secondary_dummies.columns)

    # Here we combine the dummy columns to our original dataframe
    one_hot_encoding = pd.concat([df, primary_dummies, secondary_dummies], axis=1)
    one_hot_encoding["one_hot_labels"] = list(one_hot_encoding[labels].values)
    one_hot_encoding["condensed_labels"] = one_hot_encoding["one_hot_labels"].apply(convert_bool_list_to_true_indices)
    return one_hot_encoding, labels

df_encoded, labels = one_hot_encode_categories(df)
df_encoded.head()

Unnamed: 0,english_name,primary_type,secondary_type,description,cleaned_text,bug_primary,dark_primary,dragon_primary,electric_primary,fairy_primary,...,ice_secondary,none_secondary,normal_secondary,poison_secondary,psychic_secondary,rock_secondary,steel_secondary,water_secondary,one_hot_labels,condensed_labels
0,Bulbasaur,grass,poison,There is a plant seed on its back right from t...,plant seed back right day pokémon born seed sl...,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...","[9, 32]"
1,Ivysaur,grass,poison,"When the bulb on its back grows large, it appe...",bulb back grows large appears lose ability sta...,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...","[9, 32]"
2,Venusaur,grass,poison,Its plant blooms when it is absorbing solar en...,plant blooms absorbing solar energy stays move...,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...","[9, 32]"
3,Charmander,fire,none,It has a preference for hot things. When it ra...,preference hot things rains steam said spout t...,0,0,0,0,0,...,0,1,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[6, 30]"
4,Charmeleon,fire,none,"It has a barbaric nature. In battle, it whips ...",barbaric nature battle whips fiery tail around...,0,0,0,0,0,...,0,1,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[6, 30]"


## Create Type Mapping

In [5]:
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in enumerate(labels)}

print(id2label)

{0: 'bug_primary', 1: 'dark_primary', 2: 'dragon_primary', 3: 'electric_primary', 4: 'fairy_primary', 5: 'fighting_primary', 6: 'fire_primary', 7: 'flying_primary', 8: 'ghost_primary', 9: 'grass_primary', 10: 'ground_primary', 11: 'ice_primary', 12: 'normal_primary', 13: 'poison_primary', 14: 'psychic_primary', 15: 'rock_primary', 16: 'steel_primary', 17: 'water_primary', 18: 'bug_secondary', 19: 'dark_secondary', 20: 'dragon_secondary', 21: 'electric_secondary', 22: 'fairy_secondary', 23: 'fighting_secondary', 24: 'fire_secondary', 25: 'flying_secondary', 26: 'ghost_secondary', 27: 'grass_secondary', 28: 'ground_secondary', 29: 'ice_secondary', 30: 'none_secondary', 31: 'normal_secondary', 32: 'poison_secondary', 33: 'psychic_secondary', 34: 'rock_secondary', 35: 'steel_secondary', 36: 'water_secondary'}


## Sanity Check: Verify our Preprocessing and Mapping Worked

In [16]:
for index, row in df_encoded.head().iterrows():
    actual_primary = row["primary_type"]
    actual_secondary = row["secondary_type"]
    mapped_primary = id2label[row["condensed_labels"][0]]
    mapped_secondary = id2label[row["condensed_labels"][1]]

    print(f"Actual: {actual_primary}, {actual_secondary}")
    print(f"Mapped: {mapped_primary}, {mapped_secondary}\n")

Actual: grass, poison
Mapped: grass_primary, poison_secondary

Actual: grass, poison
Mapped: grass_primary, poison_secondary

Actual: grass, poison
Mapped: grass_primary, poison_secondary

Actual: fire, none
Mapped: fire_primary, none_secondary

Actual: fire, none
Mapped: fire_primary, none_secondary

