# NLP Project: Pokemon Type Predictor (Part 2)
## Author: Brady Lamson
## Date: Fall 2023

Part 2 will involve creating a basic model that is capable of multi-label classification. We'll be utilizing the pokedex description to try and predict the primary and secondary type of pokemon.

# Data Loading

Here we connect to kaggle and download the dataset. Note that an access token is required to run this code. 
This file also has some wack encoding, so I have to modify that to read in the file. If you can't access the kaggle API, [here is a link to the kaggle docs](https://www.kaggle.com/docs/api)

In [1]:
import pandas as pd
import numpy as np
import os

# Set seed
np.random.seed(776)

In [2]:
data_path = "./data/pokemon.csv"
data_exists = os.path.isfile(data_path)

if not data_exists:
    # This part requires a kaggle api key. On linux this will be saved to your home directory in .kaggle/kaggle.json
    !kaggle datasets download -d cristobalmitchell/pokedex
    !unzip pokedex.zip -d data

df = (
    # load in the data
    pd.read_csv(data_path, sep='\t', encoding='utf-16-le')
    # select the relevant columns
    .loc[:, ['english_name', 'primary_type', 'secondary_type', 'description']]
    # Change the type columns into categories and handle NaNs in secondary typing
    .assign(
        primary_type=lambda x: x['primary_type'].astype("category"),
        secondary_type=lambda x: x['secondary_type'].fillna("none").astype("category")
    )
)
display(df.head())
display(df.info())
display(df.describe())

Unnamed: 0,english_name,primary_type,secondary_type,description
0,Bulbasaur,grass,poison,There is a plant seed on its back right from t...
1,Ivysaur,grass,poison,"When the bulb on its back grows large, it appe..."
2,Venusaur,grass,poison,Its plant blooms when it is absorbing solar en...
3,Charmander,fire,none,It has a preference for hot things. When it ra...
4,Charmeleon,fire,none,"It has a barbaric nature. In battle, it whips ..."


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 898 entries, 0 to 897
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   english_name    898 non-null    object  
 1   primary_type    898 non-null    category
 2   secondary_type  898 non-null    category
 3   description     898 non-null    object  
dtypes: category(2), object(2)
memory usage: 17.3+ KB


None

Unnamed: 0,english_name,primary_type,secondary_type,description
count,898,898,898,898
unique,898,18,19,896
top,Bulbasaur,water,none,Although it’s alien to this world and a danger...
freq,1,123,429,3


## Remove stopwords and special characters

Here we clean up our text like normal. We want to be careful to keep the special e used in Pokémon though, some intuitive regex will eradicate that which is *kind of* a big deal for a Pokémon project. 

In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re

def clean_text(text: str) -> str:
    # Remove special characters using regex
    text = re.sub(r'[^a-zA-Zé]', ' ', text)
    
    # Tokenize the text
    words = text.split()
    
    # Remove stopwords using nltk
    stop_words = set(stopwords.words('english'))
    filtered_words = [word.lower() for word in words if word.lower() not in stop_words]
    
    # Join the words back into a sentence
    cleaned_text = ' '.join(filtered_words)
    
    return cleaned_text

df["cleaned_text"] = df.description.apply(lambda x: clean_text(x))
df.head()

[nltk_data] Downloading package stopwords to /home/brady/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,english_name,primary_type,secondary_type,description,cleaned_text
0,Bulbasaur,grass,poison,There is a plant seed on its back right from t...,plant seed back right day pokémon born seed sl...
1,Ivysaur,grass,poison,"When the bulb on its back grows large, it appe...",bulb back grows large appears lose ability sta...
2,Venusaur,grass,poison,Its plant blooms when it is absorbing solar en...,plant blooms absorbing solar energy stays move...
3,Charmander,fire,none,It has a preference for hot things. When it ra...,preference hot things rains steam said spout t...
4,Charmeleon,fire,none,"It has a barbaric nature. In battle, it whips ...",barbaric nature battle whips fiery tail around...


In [4]:
# Example of tokenization
print(df.description[0])
print(df.cleaned_text[0])

There is a plant seed on its back right from the day this Pokémon is born. The seed slowly grows larger.
plant seed back right day pokémon born seed slowly grows larger


## Split Data

Here our datas issues crop up. Our data is EXTREMELY unbalanced which you can normally circumvent with stratification. The issue here is that we have multilabel data that is extremely unbalanced. I can't stratify on both categories as the stratification requires every existing combination of categories to exist at least twice. We have many type combinations that only appear once, so we can't meet that criteria without removing many rows of data.

There are other ways to work around this but they're outside the scope of what I'm capable of learning right now.

As such I stratify on secondary type only as it is far more unbalanced than primary type. This stratification only somewhat helps in the end though.

Honestly this issue right here has me confident that a built-from-scratch model on this dataset couldn't work. I need way more rows of data perhaps with duplicate pokemon with different descriptions from other games. Aside from that I need an already large model with a sizable vocabulary and embedding space to tackle this. Oh well, we carry on!

In [5]:
from sklearn.model_selection import train_test_split

# So many variables assigned here, I apologize for the mess
names_train, names_test, x_text_train, x_text_test, y_primary_train, y_primary_test, y_secondary_train, y_secondary_test = train_test_split(
    df.english_name, df.cleaned_text, df.primary_type, df.secondary_type,
    test_size=0.2,
    random_state=10,
    stratify=df.secondary_type
)

## Vectorizing Text Data

We'll be using a term frequency inverse document frequency for our encoding later so we're keeping this section blank. We can actually do that stage right in the sklearn pipeline itself! 

## Vectorize Categorical Data

Things get a bit weird here as we're working with multi label data. We're going to use `pandas.get_dummies()` to convert our categorical data into one-hot encoding. We'll then combine the primary and secondary one-hot encoded matrix into one larger matrix.

In [8]:
print(f"Number of Primary Types: {len(list(df.primary_type.unique()))}")
print(f"Number of Secondary Types: {len(list(df.secondary_type.unique()))}")

Number of Primary Types: 18
Number of Secondary Types: 19


In [9]:
def one_hot_encode_categories(primary_cat: pd.Series, secondary_cat: pd.Series) -> pd.DataFrame:
    """
    Goal of this function is to create "dummy variables" for each category in primary/secondary typing
    We essentially use this to create a one-hot encoding for our one-word variables.
    We combine this into one giant dataframe at the end
    """

    # Here we create the dummy vars and then add suffixes to the end to differentiate between primary/secondary
    primary_dummies = pd.get_dummies(primary_cat, dtype=int)
    primary_dummies.columns = [f"{col}_primary" for col in primary_dummies.columns]
    secondary_dummies = pd.get_dummies(secondary_cat, dtype=int)
    secondary_dummies.columns = [f"{col}_secondary" for col in secondary_dummies.columns]

    one_hot_encoding = pd.concat([primary_dummies, secondary_dummies], axis=1)

    return one_hot_encoding
    
y_train = one_hot_encode_categories(y_primary_train, y_secondary_train)
y_test = one_hot_encode_categories(y_primary_test, y_secondary_test)
y_train.head()

Unnamed: 0,bug_primary,dark_primary,dragon_primary,electric_primary,fairy_primary,fighting_primary,fire_primary,flying_primary,ghost_primary,grass_primary,...,grass_secondary,ground_secondary,ice_secondary,none_secondary,normal_secondary,poison_secondary,psychic_secondary,rock_secondary,steel_secondary,water_secondary
621,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
305,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
633,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
589,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0


In [10]:
y_train.columns

Index(['bug_primary', 'dark_primary', 'dragon_primary', 'electric_primary',
       'fairy_primary', 'fighting_primary', 'fire_primary', 'flying_primary',
       'ghost_primary', 'grass_primary', 'ground_primary', 'ice_primary',
       'normal_primary', 'poison_primary', 'psychic_primary', 'rock_primary',
       'steel_primary', 'water_primary', 'bug_secondary', 'dark_secondary',
       'dragon_secondary', 'electric_secondary', 'fairy_secondary',
       'fighting_secondary', 'fire_secondary', 'flying_secondary',
       'ghost_secondary', 'grass_secondary', 'ground_secondary',
       'ice_secondary', 'none_secondary', 'normal_secondary',
       'poison_secondary', 'psychic_secondary', 'rock_secondary',
       'steel_secondary', 'water_secondary'],
      dtype='object')

In [11]:
y_test.columns

Index(['bug_primary', 'dark_primary', 'dragon_primary', 'electric_primary',
       'fairy_primary', 'fighting_primary', 'fire_primary', 'flying_primary',
       'ghost_primary', 'grass_primary', 'ground_primary', 'ice_primary',
       'normal_primary', 'poison_primary', 'psychic_primary', 'rock_primary',
       'steel_primary', 'water_primary', 'bug_secondary', 'dark_secondary',
       'dragon_secondary', 'electric_secondary', 'fairy_secondary',
       'fighting_secondary', 'fire_secondary', 'flying_secondary',
       'ghost_secondary', 'grass_secondary', 'ground_secondary',
       'ice_secondary', 'none_secondary', 'normal_secondary',
       'poison_secondary', 'psychic_secondary', 'rock_secondary',
       'steel_secondary', 'water_secondary'],
      dtype='object')

In [12]:
# Sanity check to ensure these contain the same columns
set(y_test.columns) == set(y_train.columns)

True

Here we have an example row and can verify that everything checks out like it should!

In [58]:
name = names_train.iloc[0]
display(x_text_train.iloc[0])
display(df.loc[df["english_name"] == name])
y_train.iloc[0]

'sculpted clay ancient times one knows driven continually line boulders'

Unnamed: 0,english_name,primary_type,secondary_type,description,cleaned_text
621,Golett,ground,ghost,They were sculpted from clay in ancient times....,sculpted clay ancient times one knows driven c...


bug_primary           0
dark_primary          0
dragon_primary        0
electric_primary      0
fairy_primary         0
fighting_primary      0
fire_primary          0
flying_primary        0
ghost_primary         0
grass_primary         0
ground_primary        1
ice_primary           0
normal_primary        0
poison_primary        0
psychic_primary       0
rock_primary          0
steel_primary         0
water_primary         0
bug_secondary         0
dark_secondary        0
dragon_secondary      0
electric_secondary    0
fairy_secondary       0
fighting_secondary    0
fire_secondary        0
flying_secondary      0
ghost_secondary       1
grass_secondary       0
ground_secondary      0
ice_secondary         0
none_secondary        0
normal_secondary      0
poison_secondary      0
psychic_secondary     0
rock_secondary        0
steel_secondary       0
water_secondary       0
Name: 621, dtype: int64

# The First Model

Here we use what's called a "Label Power Set". It's a popular way to deal with multi-label data problems and what we'll be using as a basic starter model. We won't be doing any grid search on this or anything, just some basic default parameters.

A label powerset essentially turns a multi-label dataset into a multi-class dataset by turning every category pair into its own class.

[Documentation can be found here](http://scikit.ml/api/skmultilearn.problem_transform.lp.html)

In [56]:
from skmultilearn.problem_transform import LabelPowerset
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf',  LabelPowerset(LogisticRegression(max_iter=120)))
])

# Fit the classifier on the training data and then predict
pipe.fit(x_text_train, y_train)
predictions = pipe.predict(x_text_test)

In [15]:
pipe

## Model Evaluation

Here we print out the predictions in a human-readable format. This part took an embarassingly long time to figure out as there are many different types of sparse matrices different models can return. 

In [27]:
types = list(y_train.columns)

def showcase_predictions(predictions, number):
    for index, prediction in enumerate(predictions[:5]):
        # First let's extract the real pokemon info
        pokemon_name = names_test.iloc[index]
        type_row = y_test.iloc[index]
        actual_primary, actual_secondary = type_row[type_row != 0].index.tolist()
    
        # And now the predictions
        pred_types = prediction.rows[0]
        pred_primary, pred_secondary = pred_types
        pred_primary = types[pred_primary]
        pred_secondary = types[pred_secondary]
    
        print(f"Pokemon Name: {pokemon_name}")
        print(f"Predicted Types: {pred_primary}, {pred_secondary}")
        print(f"Actual Types: {actual_primary}, {actual_secondary}\n")    

showcase_predictions(predictions, 5)

Pokemon Name: Lickitung
Predicted Types: water_primary, none_secondary
Actual Types: normal_primary, none_secondary

Pokemon Name: Buneary
Predicted Types: water_primary, none_secondary
Actual Types: normal_primary, none_secondary

Pokemon Name: Comfey
Predicted Types: water_primary, none_secondary
Actual Types: fairy_primary, none_secondary

Pokemon Name: Zacian
Predicted Types: normal_primary, none_secondary
Actual Types: fairy_primary, fairy_secondary

Pokemon Name: Happiny
Predicted Types: normal_primary, none_secondary
Actual Types: normal_primary, none_secondary



### Overall Metrics

In [17]:
import sklearn.metrics as metrics
def print_metrics(y_test, predictions):
    accuracy = metrics.accuracy_score(y_test, predictions)
    f1_score = metrics.f1_score(y_test, predictions, average='samples')
    recall = metrics.recall_score(y_test, predictions, average='samples')
    precision = metrics.precision_score(y_test, predictions, average='samples')
    print(f"accuracy: {accuracy:.2f}")
    print(f"f1: {f1_score:.2f}")
    print(f"recall: {recall:.2f}")
    print(f"precision: {precision:.2f}")

print_metrics(y_test, predictions)

accuracy: 0.13
f1: 0.33
recall: 0.33
precision: 0.33


### Class Specific Metrics

This report is large but gives metrics for each specific category. Important not to get lost in the weeds here. For some additional context, "support" is how many times a certain prediction was made. So if `bug_primary` has a support of 13, that means the model predicted it 13 times.

In [18]:
# Convert to dict to make it easier to work with
# This is beneficial if you want to make visualizations from this for instance
report = metrics.classification_report(y_test, predictions, target_names=types, output_dict=True)
report = pd.DataFrame(report).transpose()
report

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,precision,recall,f1-score,support
bug_primary,0.0,0.0,0.0,13.0
dark_primary,0.0,0.0,0.0,5.0
dragon_primary,0.0,0.0,0.0,6.0
electric_primary,0.6,0.5,0.545455,6.0
fairy_primary,0.0,0.0,0.0,9.0
fighting_primary,0.0,0.0,0.0,6.0
fire_primary,0.0,0.0,0.0,5.0
flying_primary,0.0,0.0,0.0,2.0
ghost_primary,0.0,0.0,0.0,8.0
grass_primary,0.666667,0.2,0.307692,20.0


Let's check which types even *have* a score to examine.

In [19]:
report.loc[report["f1-score"] != 0]

Unnamed: 0,precision,recall,f1-score,support
electric_primary,0.6,0.5,0.545455,6.0
grass_primary,0.666667,0.2,0.307692,20.0
normal_primary,0.15,0.428571,0.222222,21.0
water_primary,0.169811,0.72,0.274809,25.0
none_secondary,0.477778,1.0,0.646617,86.0
micro avg,0.333333,0.333333,0.333333,360.0
macro avg,0.055791,0.076988,0.053967,360.0
weighted avg,0.181715,0.333333,0.212701,360.0
samples avg,0.333333,0.333333,0.333333,360.0


# The Second Model

Here we'll be using a "classifier chain" which is another way to handle multi-label data. It can take in a ton of different classifiers so it's perfect for grid search! To be honest, kinda just throwing stuff at the wall here. 
* [Docs on classifier chains](http://scikit.ml/api/skmultilearn.problem_transform.cc.html)
* [Docs on Support Vector Classifiers](https://scikit-learn.org/stable/modules/svm.html)

In [51]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from skmultilearn.problem_transform import ClassifierChain
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', ClassifierChain()) 
])

# Try multiple classifiers with their own sets of parameters (this will take a while)
classifiers = [
    {
        'clf__classifier': [SVC()],  # Support Vector Classifier
        'clf__classifier__kernel': ['linear', 'rbf'],
        'clf__classifier__C': [0.1, 1, 10]
    },
    {
        'clf__classifier': [RandomForestClassifier()],  # Random Forest Classifier
        'clf__classifier__n_estimators': [100, 200, 300],
        'clf__classifier__max_depth': [None, 5, 10]
    }
]

# Use default 5-fold cross validation
grid_search = GridSearchCV(pipe, classifiers, cv=5)
grid_search.fit(x_text_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print(f"Best Parameters: {best_params}")
print(f"Best Score: {best_score}")

# Extract the best model and predict
best_estimator = grid_search.best_estimator_
predictions = best_estimator.predict(x_text_test)

## Model Evaluation

In [52]:
showcase_predictions(predictions.tolil(), 5)

Pokemon Name: Lickitung
Predicted Types: water_primary, none_secondary
Actual Types: normal_primary, none_secondary

Pokemon Name: Buneary
Predicted Types: water_primary, none_secondary
Actual Types: normal_primary, none_secondary

Pokemon Name: Comfey
Predicted Types: grass_primary, none_secondary
Actual Types: fairy_primary, none_secondary

Pokemon Name: Zacian
Predicted Types: water_primary, none_secondary
Actual Types: fairy_primary, fairy_secondary

Pokemon Name: Happiny
Predicted Types: water_primary, none_secondary
Actual Types: normal_primary, none_secondary



## Overall Metrics

In [53]:
print_metrics(predictions, y_test)

accuracy: 0.14
f1: 0.34
recall: 0.35
precision: 0.34


## Class Specific Metrics

In [54]:
report = metrics.classification_report(y_test, predictions, target_names=types, output_dict=True)
report = pd.DataFrame(report).transpose()
report

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,precision,recall,f1-score,support
bug_primary,0.2,0.076923,0.111111,13.0
dark_primary,0.0,0.0,0.0,5.0
dragon_primary,0.0,0.0,0.0,6.0
electric_primary,0.666667,0.333333,0.444444,6.0
fairy_primary,0.0,0.0,0.0,9.0
fighting_primary,0.0,0.0,0.0,6.0
fire_primary,0.333333,0.2,0.25,5.0
flying_primary,0.0,0.0,0.0,2.0
ghost_primary,0.666667,0.25,0.363636,8.0
grass_primary,0.5,0.35,0.411765,20.0


Let's check which types even *have* a score to examine.

In [55]:
report.loc[report["f1-score"] != 0]

Unnamed: 0,precision,recall,f1-score,support
bug_primary,0.2,0.076923,0.111111,13.0
electric_primary,0.666667,0.333333,0.444444,6.0
fire_primary,0.333333,0.2,0.25,5.0
ghost_primary,0.666667,0.25,0.363636,8.0
grass_primary,0.5,0.35,0.411765,20.0
ice_primary,1.0,0.111111,0.2,9.0
normal_primary,0.181818,0.190476,0.186047,21.0
psychic_primary,0.6,0.2,0.3,15.0
rock_primary,0.2,0.125,0.153846,8.0
steel_primary,1.0,0.142857,0.25,7.0


At first glance according to the metrics this model is basically equivalent to the previous one. But this filtered report shows more nuance. There are way more types where there are correct predictions happening than in the previous model. That, to me, is an enormous improvement. 

Overall on many of the types there are frequently high precision scores but low recall scores. This means the model is overly careful on predicting. It misses a ton of each type but tends to be accurate when it does predict. 

On some of the more frequent types like water the precision is horrible and recall is high. That just means its predicting water a TON which obviously catches the true waters but its lumping in a bunch of pokemon that it shouldn't be. 

Overall the model is still total garbage, but this data may just be very hard to work with as it is.