Name: Andres Figueroa

Email: andresfigueroa@brandeis.edu

Description: The goal of this project is to build a model that predicts whether a pokemon is Legendary or not based on
its stats and other features.


#### Building Our DataFrame

In [2]:
import pandas as pd

df = pd.read_csv('pokemon.csv')

print(df.columns)
print(df.head())

Index(['abilities', 'against_bug', 'against_dark', 'against_dragon',
       'against_electric', 'against_fairy', 'against_fight', 'against_fire',
       'against_flying', 'against_ghost', 'against_grass', 'against_ground',
       'against_ice', 'against_normal', 'against_poison', 'against_psychic',
       'against_rock', 'against_steel', 'against_water', 'attack',
       'base_egg_steps', 'base_happiness', 'base_total', 'capture_rate',
       'classfication', 'defense', 'experience_growth', 'height_m', 'hp',
       'japanese_name', 'name', 'percentage_male', 'pokedex_number',
       'sp_attack', 'sp_defense', 'speed', 'type1', 'type2', 'weight_kg',
       'generation', 'is_legendary'],
      dtype='object')
                     abilities  against_bug  against_dark  against_dragon  \
0  ['Overgrow', 'Chlorophyll']          1.0           1.0             1.0   
1  ['Overgrow', 'Chlorophyll']          1.0           1.0             1.0   
2  ['Overgrow', 'Chlorophyll']          1.0         

---

#### Defining Our ML Problem


##### Dataset Explanation
The dataset I chose contains information on all 802 Pokémon from seven generations. It includes statistics such as base stats, types, abilities, height, weight, egg steps, capture rate, experience growth, and more.


##### What We Are Predicting (Label)
The label I will be predicting is 'is_legendary'. This is a binary feature as there are two classifications whether a Pokémon is legendary or not.


##### Problem Type
This is a supervised learning problem, because we are training a model using labeled data where the target (is_legendary) is known.

It is a classification problem, a binary classification problem, since the model is predicting one of two things: legendary or not legendary.



##### Importance
Though this project is a personal work (I liked pokemon growing up), this project can be used for companies or developers working on games, to enrich user interaction, balance game dynamics, and target content creation around Pokémon classification.

---

#### Understanding Our Data

##### Missing Data

In [9]:
df.isnull().sum()[df.isnull().sum() > 0]

height_m            20
percentage_male     98
type2              384
weight_kg           20
dtype: int64

##### Note:
While 'height_m' and 'weight_kg' are numerical features, I will probably drop the following features because pokemon, even legendary, can range from big to small and I believe that height and weight have no bearing on whether a pokemon is legendary or not. I believe the same with types, a legendary pokemon can have almost any typing, so I will probably drop type1 and type2. 'percentage_male' is interesting because I remembered that there are genderless pokemon which explains the missing values. Still I believe gender doesn't play a role in determining whether a pokemon is legendary or not. As I remove these features, I simplifying the dataset and avoiding unnecessary noise. 

In [7]:
to_drop = ['abilities', 'name', 'percentage_male', 'pokedex_number', 'generation', 'type1', 'type2', 'japanese_name', 'base_egg_steps', 'classfication'] # Some features that I think are irrelevant to the problem
df = df.drop(columns = to_drop, errors = 'ignore')

to_keep = []
for col in df.columns:
    if 'against' not in col: # I want to get rid of all features with 'against' because their weakness tell us nothing about whether they are legendary or not
        to_keep.append(col)

df = df[to_keep]
#print(df.dtypes)
df['capture_rate'] = pd.to_numeric(df['capture_rate'], errors='coerce')  # safely convert to float or int#print(df.dtypes)
df = df.dropna()

df.columns

Index(['attack', 'base_happiness', 'base_total', 'capture_rate', 'defense',
       'experience_growth', 'height_m', 'hp', 'sp_attack', 'sp_defense',
       'speed', 'weight_kg', 'is_legendary'],
      dtype='object')

Explanation:
The variable 'to_drop' has a bunch of features that are irrelevant to the ML problem (whether a pokemon is Legendary or not), so they're going to be dropped.

I dropped the all columns (features) with 'against' because their weakness tell us nothing about whether they are legendary or not.

I decided to drop the feature 'base_egg_steps' because legendary pokemon don't hatch from eggs and this will essentially cause a data leakage, giving the model the answers rather than learning anything.

'capture_rate' was of the object type, specifically a string. So, I converted it to a numerical value

# Defining Label and Features

In [3]:
y = df['is_legendary']
X = df.drop(columns = 'is_legendary')

# Splitting the Data, Train and Test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234, stratify = y)

I chose to stratify to make sure that the class distribution stays the same in both the training and test datasets. Legendary pokemon are rare, so I wanna make sure that each dataset actually has legendary pokemon in it.

# Training a Decision Tree

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9871794871794872
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       142
           1       0.88      1.00      0.93        14

    accuracy                           0.99       156
   macro avg       0.94      0.99      0.96       156
weighted avg       0.99      0.99      0.99       156



The values above are performance metrics ("accuracy", "precision", "recall", "f1-score"), numbers that help us measure how well a model's predictions were. 

Accuracy tells us how many of the total predictions were correct out of all the predictions (.9871).

Precision tells us how many the the predicted legendary pokemon were actually legendary (0.88).

Recall tells us how many legendary pokemon were found out of all the actual legendary pokemon (1.00)

F1-Score gives us the balance between precision and recall

Support is the number of legendary pokemon in the set.

The model catches all legendary pokemon with a 100% recall. The model is very accurate (0.9871), but there are some false positives (precision at 0.88).