<a href="https://www.kaggle.com/code/alvinhanafie/conquering-pokemon-based-on-stats-and-capture-rate?scriptVersionId=178677627" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Capture and Conquer: A Strategic Guide to Pokémon Based on Stats and Capture Rate

Based on Pokémon dataset, we will assign each Pokémon a cluster based on their stats and capture rate. The purpose of Pokémon clustering is to inform players about Pokémon combat proficiency and rarity. Players can use the result of clustering as a guide to analyze which Pokémon suits their play style, survivability chance during battle, and whether it is worth to capture.

In [None]:
!pip install gap-stat

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import scipy.stats as stats
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.datasets import make_blobs
from gap_statistic import OptimalK

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import warnings
warnings.filterwarnings('ignore')

In [None]:
# graphic settings

sns.set_theme(rc={'figure.figsize':(20.7,8.27)})
sns.set_style("whitegrid")
sns.color_palette("dark")
plt.style.use("fivethirtyeight")

## Import Data

In [None]:
df = pd.read_csv('/kaggle/input/pokemon/pokemon.csv')

## Data Analysis

In [None]:
df.info()

## Dataset Description

name: The English name of the Pokémon

japanese_name: The Original Japanese name of the Pokémon

pokedex_number: The entry number of the Pokémon in the National Pokedex

percentage_male: The percentage of the species that are male. Blank if the Pokémon is
genderless.

type1: The Primary Type of the Pokémon

type2: The Secondary Type of the Pokémon

classification: The Classification of the Pokémon as described by the Sun and Moon Pokedex

height_m: Height of the Pokémon in metres

weight_kg: The Weight of the Pokémon in kilograms

capture_rate: Capture Rate of the Pokémon

base_egg_steps: The number of steps required to hatch an egg of the Pokémon

abilities: A stringified list of abilities that the Pokémon is capable of having

experience_growth: The Experience Growth of the Pokémon

base_happiness: Base Happiness of the Pokémon

against_?: Eighteen features that denote the amount of damage taken against an attack of a particular type

hp: The Base HP of the Pokémon

attack: The Base Attack of the Pokémon

defense: The Base Defense of the Pokémon

sp_attack: The Base Special Attack of the Pokémon

sp_defense: The Base Special Defense of the Pokémon

speed: The Base Speed of the Pokémon

generation: The numbered generation which the Pokémon was first introduced

is_legendary: Denotes if the Pokémon is legendary.


sumber: https://www.kaggle.com/datasets/rounakbanik/Pokémon?resource=download

In [None]:
# data overview the first 5 Pokémon

df.head()

In [None]:
# Descriptive statistic for numerical features

df.describe()

In [None]:
# Descriptive statistic for numerical features

df.describe(include='object')

There are 34 numerical features and 7 categorical features in Pokémon dataset. In this clustering, we will only select relevant features to Pokémon combat capability and capture rate.

By focusing on stats and capture rates, the clustering analysis will remains clear and straightforward, avoiding the complexity that might arise from incorporating too many variables. Clusters based on stats and capture rates are easier to interpret by players, providing clear insights into different categories of Pokémon.

Stats determine a Pokémon’s effectiveness in battles, and capture rates affect how easily players can obtain these Pokémon, both of which are central to gameplay strategy. Players often make decisions based on these two features when forming teams and planning captures, making them highly relevant for clustering.


## Duplicate Data Handling

The unique values in Pokémon name is 801, which is the same as total rows in the dataset. Hence all rows in Pokémon dataset is unique. Therefore, it should be no duplicate rows in the dataset.

In [None]:
df.duplicated().sum()

There is no duplicate data in the Pokémon dataset.

## Feature Selection

As explained before, not all features directly influence Pokémon combat capability and capture rate, such as base_happiness, classification, and base_egg_steps. In order to make cluster focused on Pokémon combat proficiency and capture rate, some relevant features will be selected for further analysis:

- hp
- attack
- sp_attack
- defense
- sp_defense
- speed
- capture_rate

In [None]:
# use relevant features for clustering

selected_features = ['hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'capture_rate']
pokemon_df = df[selected_features]

In [None]:
# check data type of relevant features

pokemon_df.info()

In [None]:
pokemon_df.describe()

In [None]:
pokemon_df.describe(include=object)

Each features have similar mean and median, which means that their distribution is more like a normal distribution.

We will inspect these further in outlier handling section.

## Missing Value Handling

In [None]:
pokemon_df.isna().sum()

Even though the initial checking of missing value shows no missing values in all features, we will check thoroughly whether the dataset is not have "hidden" missing value, such as unknown or invalid data, by checking them through value counts for each features.

In [None]:
for col in pokemon_df.columns:
    print(f"============= {col} =================")
    display(pokemon_df[col].value_counts())
    print()

After looking through all unique values in each features, hp and capture_rate need to be analyzed further. On the other hand, the other features have no missing or invalid value.

### hp feature

There is 1 unique value in hp feature that only have 1 HP, which is quite suspicious. We should check the name of Pokémon to confirm whether this value is valid.  

In [None]:
df[df['hp'] == 1]

Pokémon that has 1 HP is Shedinja. After checking through bulbagarden website, it is true that Shedinja only have 1 HP. Hence, this value is not categorized as missing or invalid value.

https://bulbapedia.bulbagarden.net/wiki/Shedinja_(Pok%C3%A9mon)#1_HP_trivia

### capture_rate feature

Capture rate feature is still in categorical data type.

Based on the unique values above, capture rate supposed to be in numerical (int) data type. There is one row that have value: '30 (Meteorite)255 (Core)', which is a string. We will check which Pokémon it is.

In [None]:
df[df['capture_rate'] == '30 (Meteorite)255 (Core)']

Pokémon that has capture rate of 30 (Meteorite)255 (Core) is Minior. After checking through bulbagarden website, Minior is a Pokémon that have two forms (Meteor and Core).

If Minior is in Meteor form, its capture rate will be 30. On the other hand, if Minior is in Core form, its capture rate will be 255.

After comparing to the stats in dataset and website, Minior in the Pokémon dataset is in Core form.

As we know the actual value of its capture rate (255), we will impute this value to replace '30 (Meteorite)255 (Core)'.

Median or mode impute to handle missing or invalid value is not used, since we have the actual data from the website.

https://bulbapedia.bulbagarden.net/wiki/Minior_(Pok%C3%A9mon)#Meteor_Form

In [None]:
# replace 30 (Meteorite)255 (Core)' to 255

pokemon_df = pokemon_df.copy()

pokemon_df.loc[pokemon_df['capture_rate'] == '30 (Meteorite)255 (Core)', 'capture_rate'] = '255'

In [None]:
# check whether the value of 30 (Meteorite)255 (Core)' is already replaced by 255

print(f"============= capture_rate =================")
display(pokemon_df['capture_rate'].value_counts())
print()

The value of 255 has increased from 69 to 70 after value replacement. Hence the data is already in the same format.

There is no more missing or invalid values found in the dataset.

## Feature Encoding

There is still one problem in the dataset. Capture_rate feature still in categorical data type. Hence, this feature will be encoded to numerical values.

In [None]:
# feature encoding to numerical (int) data type

pokemon_df['capture_rate'] = pokemon_df['capture_rate'].astype(int)

Direct feature encoding to numerical values is chosen instead of the other feature encoding techniques like One-Hot Encoding or Label Encoding. This is because capture rate feature has already contained actual numerical values of Pokémon capture rate. This feature is in categorical data type because there is one special Pokémon (Minior) that have 2 capture rates depending on its form.

By doing direct encoding to numerical (integer) data types, the feature will represent actual values of Pokémon capture rate rather than encode it to specific number such as mean encoding.

In [None]:
# check data type of capture_rate after feature encoding

pokemon_df.info()



capture_rate feature has been encoded from categorical data type to numerical data type. By using capture_rate in numerical data type, this feature now can be used in clustering model.



## Correlation Analysis

In [None]:
# heatmap correlation

corr = pokemon_df.corr()

mask = np.triu(np.ones_like(corr, dtype=bool))

sns.heatmap(corr, cmap = 'Blues', annot=True, mask = mask)
plt.show()

All stats features have positive correlation, and all of them have negative correlation. This is a good sign for effective clustering, since all features have moderate correlation to each other (with the exception of speed and defense).

The capture rate is inversely correlated to all other stats, meaning that the stronger the Pokémon (HP, attack, defense, special attack, special defense, and speed), the lower the capture rate. This aligns with game balance where more powerful Pokémon are harder to catch, adding challenge and strategy to the game. This ensures a balanced gameplay experience where powerful Pokémon require more effort to obtain.

## Outlier Handling

In [None]:
# function to check histogram, distribution plot, and boxplot

def check_plot(df, variable):
    # check distribution plot from variable in df.

    # figure size and title
    plt.figure(figsize=(16, 4))
    plt.suptitle(f' Outlier Analysis for {variable} feature', fontsize=16, y=1.05)

    # histogram
    plt.subplot(1, 3, 1)
    sns.histplot(df[variable], bins = 30)
    plt.title('Histogram')

    # distribution (Q-Q) plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # box plot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()


In [None]:
# plot looping for distribution analysis each features

for col in selected_features:
    check_plot(pokemon_df, col)

All features, except capture_rate, have outliers at their maximum values, especially hp, defense, and sp_defense that have extreme high outliers value. IQR will be checked for each features for outlier handling.

In [None]:
# function to check upper IQR and lower IQR from columns

def find_outlier_boundary(df, variable):

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * 1.5)
    upper_boundary = df[variable].quantile(0.75) + (IQR * 1.5)

    return upper_boundary, lower_boundary

In [None]:
# dataframe to summarize IQR compared to minimum and maximum value for each columns

pd.DataFrame(data = {'Upper_IQR': [find_outlier_boundary(pokemon_df, col)[0] for col in selected_features],
                     'Maximum': [pokemon_df[col].max() for col in selected_features],
                     'Lower_IQR': [find_outlier_boundary(pokemon_df, col)[1]  for col in selected_features],
                     'Minimum': [pokemon_df[col].min() for col in selected_features]},
             index = [col for col in selected_features])

All features, except hp, have negative values in lower IQR. However, all features, except hp, have their minimum value greater than its lower IQR. Hence, only hp (with 1 hp value as minimum) will be changed to its lower IQR (5). The other features are still in their IQR.

However, all features, except capture_rate, have their maximum values higher than its upper IQR. Hence, the values that is higher than upper IQR will be replaced to its upper IQR, so the features will be more likely to have normal distribution for clustering.

In [None]:
# replace the outliers with upper and lower IQR

for col in selected_features:
    Population_upper_limit, Population_lower_limit = find_outlier_boundary(pokemon_df, col)

    pokemon_df[col]= np.where(pokemon_df[col] > Population_upper_limit, Population_upper_limit,
                       np.where(pokemon_df[col] < Population_lower_limit, Population_lower_limit, pokemon_df[col]))

In [None]:
# plot looping for distribution analysis after outlier treatment in features

for col in selected_features:
    check_plot(pokemon_df, col)

After performing outliers treatment by IQR methods, all features already free from outliers which clearly shown in the boxplots. All features distribution tend to have balanced (normal) distribution after reassigning their maximum values to their upper IQR, which depicted in histogram, probability plot, and boxplot.

K-means clustering algorithm will more likely to perform better after outlier handling, since the algorithm is heavily affected by mean of distance from each data point to its centroid.

## Feature Scaling

In [None]:
# feature standardization to ensure each features have similar scales

scaler = StandardScaler()
scaled_features = scaler.fit_transform(pokemon_df)

In [None]:
# comparing each features before and after feature scaling

scaled_features = pd.DataFrame(scaled_features,columns= selected_features)

for col in selected_features:
    plt.subplot(121)
    sns.kdeplot(pokemon_df[col])
    plt.title(f'original distribution of {col} feature')

    plt.subplot(122)
    sns.kdeplot(scaled_features[col])
    plt.title(f'scaled distribution of {col} feature')
    plt.show()

All features have been scaled to have normal distribution with similar range. Hence the features will not overpowering each other during clustering process, as they have similar scales.

## Evaluate optimal cluster number by Elbow method, Silhoutte Score, and Gap Statistic method

To obtain the optimal number of clusters, we will do evaluation with Elbow method, Silhoutte Score, and Gap Statistic method.

### Elbow Method

In [None]:
# Elbow Method for optimal k (cluster number)

inertia = []
for n_clusters in range(1, 11):
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(scaled_features)
    inertia.append(kmeans.inertia_)

# Plotting the Elbow Curve

plt.plot(range(1, 11), inertia, marker='o', markersize= 12, linestyle='-', color='b')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()

In [None]:
# input elbow analysis results to dataframe

inertia_decrease = pd.Series(inertia) - pd.Series(inertia).shift(-1) / pd.Series(inertia) * 100 #untuk melihat pengurangan inertia

# Calculate percentage decrease

inertia_values = pd.Series(inertia)
percentage_decrease = ((inertia_values.shift(1) - inertia_values) / inertia_values.shift(1)) * 100

# present elbow analysis results in dataframe

inertia_table = pd.concat([inertia_decrease, percentage_decrease], axis=1)
inertia_table.reset_index(inplace=True)
inertia_table.columns = ['Cluster Number', 'Inertia Decrease', 'Percentage Inertia Decrease (%)']
inertia_table['Cluster Number'] += 1

inertia_table

Based on elbow analysis depicted in graph and table, after k = 4, there is no significant decrease in inertia (below 10% of inertia decrease after cluster number = 4). Hence, the optimal number of cluster from elbow analysis is 4.

### Silhouette Score Method

In [None]:
# range of cluster number for experiment

min_clusters = 2
max_clusters = 10

# Calculate silhouette score for different numbers of clusters

silhouette_scores = []

for n_clusters in range(min_clusters, max_clusters+1):
    kmeans = KMeans(n_clusters=n_clusters,  init='k-means++', max_iter=300, n_init=10, random_state = 42)
    cluster_labels = kmeans.fit_predict(scaled_features)
    silhouette_avg = silhouette_score(scaled_features, cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Find the optimal number of clusters with the highest silhouette score
optimal_num_clusters = np.argmax(silhouette_scores) + min_clusters

range_n_clusters = range(2,11)

plt.plot(range_n_clusters, silhouette_scores, marker='o', markersize= 12, linestyle='-', color='b')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
plt.show()

print(f"Optimal number of clusters: {optimal_num_clusters} with silhouette score average {round(max(silhouette_scores),3)}")

In [None]:
# silhouette score results visualization

silhouette_scores = []

for n_clusters in range(2, 11):

    # Create KMeans instance for different number of clusters
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10, max_iter=300, random_state=42)

    # Visualize silhouette score
    visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
    visualizer.fit(scaled_features)
    visualizer.show()

    # calculate average of silhouette score
    silhouette_scores.append(visualizer.silhouette_score_)

Based on silhouette score analysis depicted in graph and table, the optimal number of cluster is 2, with silhouette score 0.307.

### Gap statistic

In [None]:
# Function for calculate gap statistic to find cluster

def calculate_gap(data, k_max=10, n_refs=20):


    # Calculate inertia for the original data
    kmeans = KMeans(n_clusters=k_max, init='k-means++', n_init=10, max_iter=300, random_state=42)
    kmeans.fit(data)
    wcss_obs = kmeans.inertia_

    # Calculate inersia for reference (random) datasets
    wcss_refs = []
    for _ in range(n_refs):
        random_data = make_blobs(n_samples=len(data), random_state=42)[0]
        kmeans = KMeans(n_clusters=k_max, init='k-means++', n_init=10, max_iter=300, random_state=42)
        kmeans.fit(random_data)
        wcss_refs.append(kmeans.inertia_)

    # Calculate Gap Statistic
    gap = np.mean(np.log(wcss_refs)) - np.log(wcss_obs)

    # Calculate standard deviation of inersia for reference datasets
    sd_refs = np.std(np.log(wcss_refs))

    # Calculate standard error
    se = sd_refs / np.sqrt(n_refs)

    return gap, se

In [None]:
# Plot the gap score

gap_values = []
errors = []
for k in range(1, 11):
  gap, se = calculate_gap(scaled_features, k_max=k)
  gap_values.append(gap)
  errors.append(se)

plt.figure()
plt.plot(range(1, 11), gap_values, marker='o', markersize= 12)
plt.xlabel("Number of clusters")
plt.ylabel("Gap Score")
plt.title('Gap Analysis For Optimal k')
plt.show()

In [None]:
# Optimal cluster number with Gap Statistic method

optimalK = OptimalK(parallel_backend='joblib', random_state=42)
n_clusters = optimalK(scaled_features, cluster_array=np.arange(1, 11))
n_clusters

Based on gap analysis, the optimal number of cluster is 9.

As summary, the number of optimal clusters are different from Elbow, Silhouette score, and Gap Statistic method, with 4, 2, 9, respectively.

We will see the cluster effectiveness by comparing each methods through the aid of PCA visualization.

In [None]:
cluster_values = [2, 4, 9]

# Create subplots
fig, axs = plt.subplots(1, len(cluster_values), figsize=(15, 5))

# Iterate over each number of clusters
for i, n_clusters in enumerate(cluster_values):
    # Initialize KMeans with specified number of clusters
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(scaled_features)

    # Assign cluster labels to the original dataset
    pokemon_df['cluster'] = kmeans.labels_

    # Apply PCA for dimensionality reduction
    pca = PCA(n_components=2)
    principal_components = pca.fit_transform(scaled_features)

    # Create a DataFrame with the principal components and cluster labels
    pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
    pca_df['cluster'] = pokemon_df['cluster']  # Adding cluster labels

    # Plot clusters in 2D using PCA components
    for cluster in pca_df['cluster'].unique():
        cluster_data = pca_df[pca_df['cluster'] == cluster]
        axs[i].scatter(cluster_data['PC1'], cluster_data['PC2'], label=f'Cluster {cluster}')

    axs[i].set_title(f'Cluster Number = {n_clusters}')
    axs[i].set_xlabel('Principal Component 1')
    axs[i].set_ylabel('Principal Component 2')
    axs[i].legend()

plt.tight_layout()
plt.show()

We will not use the number of cluster recommendation from Gap Statistic (9), since the data distribution is heavily scattered and overlapped, implied that the data is not clustered effectively.

Both the number of clusters recommended by Elbow method (4) and Silhouette score method (2) are possible to use, because the data is categorized properly in each cluster.

The data distribution of cluster number = 2 is better than cluster number = 4, as we can see there are some data points that overlapped in cluster 2 and 3 in 4 clusters region.

There are pros and cons to determine whether we use 2 or 4 clusters. We will compare the interpretation using 2 and 4 clusters.

Let's see the median and data distribution for each clusters in 2 cluster first.

## Clustering Result using 2 Clusters

In [None]:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state = 42)
kmeans.fit(scaled_features)

pokemon_df['cluster'] = kmeans.labels_

In [None]:
# overview the median of stats and capture rate for each clusters

pokemon_df.groupby('cluster').agg({'hp': 'median',
                                   'attack': 'median',
                                   'defense': 'median',
                                   'sp_attack': 'median',
                                   'sp_defense': 'median',
                                   'speed': 'median',
                                   'capture_rate': ['median', 'count']})

## Cluster Interpretation for 2 Cluster Type

The data is distributed properly into 2 categories, with distinct differences between them.

<b>Cluster 0 (Late Game Pokémon): </b>

Pokémon that belongs to this cluster have low capture rate (median of 45), indicating that they are extremely difficult to catch. This difficulty is related to their relatively powerful stats (around 80).

With the lowest capture rate, Pokémon in this cluster represent significant challenges, but offer substantial rewards. This encourages a deeper understanding of game mechanics, as they need to employ more advanced strategies. Their balance high stats across the board make them highly desirable for competitive play and endgame content.

Pokémon in this group usually found late in the game, such as in secret places or post-game areas, and even as secret or final boss.

<b>Cluster 1 (Early Game Pokémon): </b>

Pokémon that belongs to this cluster have high capture rate (median of 190), which is easy to catch. They also have relatively low values in all stats (around 50). Players do not need to worry much if they encountered Pokémon in this category, since they are not too strong. It is also worth trying to catch Pokémon, because they have extreme high of capture rate (median of 190). Players can use this advantage to complete their Pokémon entries.

Pokémon found in this category tend to be found in the first portion of the game where players can learn and explore the basic of battles without worrying of overwhelming challenge, usually in more common areas.



## Clustering Result using 4 Clusters

In [None]:
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state = 42)
kmeans.fit(scaled_features)

pokemon_df['cluster'] = kmeans.labels_

In [None]:
# overview the median of stats and capture rate for each clusters

pokemon_df.groupby('cluster').agg({'hp': 'median',
                                   'attack': 'median',
                                   'defense': 'median',
                                   'sp_attack': 'median',
                                   'sp_defense': 'median',
                                   'speed': 'median',
                                   'capture_rate': ['median', 'count']})

## Cluster Interpretation for 4 Cluster Type

The data is distributed properly into 4 categories, with distinct differences between them.

<b>Cluster 0 (Middle Game - Speed Favored Pokémon): </b>

Pokémon that belongs to this cluster is also stronger than Pokémon in cluster 2, indicated by their moderate capture rate (median of 60) and higher distribution of stats (around 75). Pokémon in this group have extreme amount of speed (median of 90) and special attack (median of 80), while having low defense (median of 63) and special defense (median of 67).

Pokémon in this group may be suitable party for players that prefer aggressive, fast and risky approach in battle. Their lower defenses make them glass cannons, relying on quick strikes to win battles.

Pokémon in this group is more suitable to use their special attack, since they have more values in special attack rather than attack. Even though their attack is moderate, The combination of high speed and decent special attack allows these Pokémon to execute more moves per turn, potentially leading to higher damage-per-second (DPS), as they can perform more attack due to their high speed.

Pokémon in this group usually found in the mid game, where players start to facing tougher opponents and need nimbler party composition.

<b>Cluster 1 (Middle Game - Defense Favored Pokémon): </b>

Pokémon that belongs to this cluster is stronger than Pokémon in cluster 2, indicated by their moderate capture rate (median of 50) and higher distribution of stats (around 75). Pokémon in this group have immense amount of attack (median of 89), special defense (median of 80), while having low speed (median of 50).

The most notable strength in this cluster is the extremely high defense (median of 95), making in this group ideal for players that prefer defensive, safe and slow approach in battle, since Pokémon this group can withstand more lethal blows from foes in long battles.

Pokémon in this group have higher attack than special attack, indicating that they are more suitable in normal attack rather than magical attack in special attack.

Pokémon in this group usually found in the mid game, where players start to facing tougher opponents and need sturdier party composition.

<b>Cluster 2 (Early Game Pokémon): </b>

Pokémon that belongs to this cluster have high capture rate (median of 190), which is easy to catch. They also have relatively low values in all stats (around 50). Players do not need to worry much if they encountered Pokémon in this category, since they are not too strong. It is also worth trying to catch Pokémon, because they have extreme high of capture rate. Players can use this advantage to complete their Pokémon entries.

Pokémon found in this category tend to be found in the first portion of the game where players can learn and explore the basic of battles without worrying of overwhelming challenge, usually in more common areas.

<b>Cluster 3 (Late Game Pokémon): </b>

Pokémon that belongs to this cluster have low capture rate (median of 45), indicating that they are extremely difficult to catch. This difficulty is related to their relatively powerful stats (around 100).

With the lowest capture rate, Pokémon in this cluster represent significant challenges, but offer substantial rewards. This encourages a deeper understanding of game mechanics, as they need to employ more advanced strategies. Their balance high stats across the board make them highly desirable for competitive play and endgame content.

Pokémon in this group usually found late in the game, such as in secret places or post-game areas, and even as secret or final boss.

Both cluster 0 and 1 are categorized as middle game Pokémon, since they share similar overall stats. However, cluster 0 and 1 offered distinct playstyles for players. Players can choose Pokémon in cluster 0 if they prefer Pokémon with high agility, or Pokémon in cluster 1 if they prefer Pokémon with high durability.


## Summary

Both 2 and 4 clusters can be used by their own pros and cons.

The use of 2 clusters provides clear and simple interpretation in each clusters, as players can easily categorized them as 'Weak' Pokémon and 'Strong' Pokémon. However, using this approach may lead to oversimplicity, as players cannot identify 'Moderate' Pokémon between Weak and Strong Pokémon. This can lead to steep difficulty curves when players in the middle of transition from early to late game. We can use 2 clusters approach as an alternative if we prefer simplicity in Pokémon cluster.

The use of 4 clusters provides more information to players, which capable to tell more specific characteristics of Pokémon in each group. By using this approach, players can be better at strategize which Pokémon that is suitable for their playstyle.

Additionally, we have two more categories of Pokémon that belongs to middle game in 4 clusters type, as representation of 'Moderate' Pokémon. This options are not available if we use 2 clusters type approach.

Personally, 4 clusters approach is preferred it can give more valuable insights to players instead of just classifying Pokémon as weak and strong. In this approach, players can also consider which Pokémon is worth trying to catch, based on their playstyle preferences.

## Assign 4 Clusters Type Result to Original Dataset

In [None]:
df['cluster'] = kmeans.labels_

In [None]:
# add pokedex number and name for easier readibility
# selecting only clustered features for better clarity

selected_features = ['pokedex_number', 'name', 'hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'capture_rate', 'cluster']
selected_df = df[selected_features]

# Define cluster names instead of cluster number for better interpretation

cluster_names = {
    0: '(Agile) Middle Game',
    1: '(Durable) Middle Game',
    2: 'Early Game',
    3: 'Late Game'
}

# Replace cluster labels with descriptive cluster names

selected_df = selected_df.copy()
selected_df['cluster'] = selected_df['cluster'].map(cluster_names)

# show cluster samples for the first 20 Pokémon as an example

selected_df.sample(20, random_state=42)

## Checking Cluster Assignment

To check whether clustering is properly done, we will inspect the cluster for Pikachu and Raichu as an example. Pikachu is one of the initial Pokémon, while Raichu is the Pokémon evolution from Pikachu. Because Raichu is the evolution form from Pikachu, Raichu should have assigned in higher cluster than Pikachu, as Raichu has overall better stats.

In [None]:
selected_df[selected_df['name'] == 'Pikachu']

In [None]:
selected_df[selected_df['name'] == 'Raichu']

Pikachu is clustered to cluster 2 (Early Game), which belonged to early game Pokémon. On the other hand,  Raichu is in cluster 0 (Agile - Middle Game), which belonged to middle game Pokémon. Hence, Pokémon in the dataset have already clustered properly based on their stats and capture rate to each cluster.