<p align="center">
<img src="../images/pokemon_world.png" width="1000" height="500" />
</p>


Welcome to this notebook, where we embark on an exciting journey to construct a classification model aimed at discerning whether a Pokémon is legendary. Our tools of choice for this mission include:

- Pandas: Our trusty companion for data manipulation and loading.
- Plotly: An interactive plotting tool that will breathe life into our data visualizations.
- Pandas Profiling: A swift and automated approach for exploratory data analysis (EDA).
- PyCaret: A low-code machine learning library that simplifies our modeling process.


Join us as we delve into the world of Pokémon, leveraging data and machine learning to unravel the mysteries of legendary status!

# Import necessary libraries

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
#import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff

from ydata_profiling import ProfileReport
from pycaret.classification import ClassificationExperiment

from pathlib import Path


ModuleNotFoundError: No module named 'seaborn'

# Load data into a pandas dataframe

In [None]:

def load_data(filename: str, separator: str = ",") -> pd.DataFrame:
    """
    Loads data from the given filename in the '../data/' directory into a pandas DataFrame

    Args:
    filename (str): The name of the dataset file.
    separator (str, optional): The separator used in the CSV. Defaults to ','.

    Returns:
    pd.DataFrame: The loaded data in the form of a DataFrame.
    """
    # Get the current notebook directory
    notebook_dir = Path.cwd()
    
    # Move up one directory to get to the parent directory
    parent_dir = notebook_dir.parent
    
    # Build the full path to the data file
    full_path = parent_dir / "data" / filename
    
    try:
        dataframe = pd.read_csv(full_path, sep=separator)
        return dataframe
    except FileNotFoundError:
        print(f"File not found under the following path: {full_path}")
        return None

: 

In [None]:
pokemon_df = load_data(filename= "pokemon.csv")
pokemon_df

: 

# Data Analysis

In [None]:
def summarize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Returns a summary of the DataFrame with column name, data type, and number of missing values.

    Args:
    df (pd.DataFrame): The DataFrame to summarize.

    Returns:
    pd.DataFrame: Summary of the input DataFrame.
    """
    summary = [(col, df[col].dtype, df[col].isnull().sum()) for col in df.columns]
    summary_df = pd.DataFrame(summary, columns=['Column Name', 'Data Type', 'Number of Missing Values'])
    return summary_df

: 

In [None]:
summary_df = summarize_dataframe(pokemon_df)
summary_df

: 

Upon analyzing the dataset, we observe missing values in three columns: Type_2, Pr_Male, and Egg_Group_2. While missing data often requires careful scrutiny, it is essential to understand the context behind such omissions.

In many datasets, the absence of values might be indicative of data collection errors or other discrepancies. Under such circumstances, imputation becomes a vital step. Imputation strategies vary based on the nature of the variable:

- **Categorical Variables**: Typically, the mode (most frequent category) is used to fill in missing values.

- **Numerical Variables**: The mean or median can serve as a replacement, contingent upon the distribution of the data. For more advanced scenarios, leveraging supervised machine learning models to predict and replace missing values based on other column values can be considered.

However, in our specific context, these missing values are not arbitrary and are substantiated by domain knowledge:

- **Type_2**: Not every Pokémon possesses a secondary type, explaining the missing entries in this column.
- **Pr_Male**: Represents the probability of a Pokémon being male. The missing values here are justifiable since certain Pokémon species are gender-neutral.
- **Egg_Group_2**: Analogous to the Type_2 column, not all Pokémon belong to a secondary egg group.

Given this understanding, our approach to handling these missing values will be dictated by the inherent characteristics of the Pokémon dataset and the domain knowledge we possess.

In [None]:
def fill_type_2(df: pd.DataFrame) -> pd.DataFrame:
    """
    Fills missing values in the 'Type_2' column with 'No_Second_Type'.

    Args:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: The transformed DataFrame.
    """
    df['Type_2'] = df['Type_2'].fillna('No_Second_Type')
    return df

def fill_egg_group_2(df: pd.DataFrame) -> pd.DataFrame:
    """
    Fills missing values in the 'Egg_Group_2' column with 'No_Second_Egg_Group'.

    Args:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: The transformed DataFrame.
    """
    df['Egg_Group_2'] = df['Egg_Group_2'].fillna('No_Second_Egg_Group')
    return df

def fill_pr_male(df: pd.DataFrame) -> pd.DataFrame:
    """
    Fills missing values in the 'Pr_Male' column with -1.

    Args:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: The transformed DataFrame.
    """
    df['Pr_Male'] = df['Pr_Male'].fillna(-1)
    return df

def transformation_pipeline(df: pd.DataFrame, functions: list) -> pd.DataFrame:
    """
    Applies a series of transformation functions to the input DataFrame.

    Args:
    df (pd.DataFrame): The input DataFrame.
    functions (list): List of transformation functions to be applied.

    Returns:
    pd.DataFrame: The transformed DataFrame.
    """
    df_copy = df.copy()
    for func in functions:
        df_copy = func(df_copy)
    return df_copy

: 

In [None]:
transformed_pokemon_df = transformation_pipeline(pokemon_df, [fill_type_2, fill_egg_group_2, fill_pr_male])


: 

In [None]:
summary_transformed_df = summarize_dataframe(transformed_pokemon_df)
summary_transformed_df

: 

Once we replaced the missing values, we can start our exploratory data anaylsis. For that we are going to use Pandas Profiling.

In [None]:
profile = ProfileReport(transformed_pokemon_df, title="Profiling Report")
profile.to_notebook_iframe()

: 

In [None]:
def plot_correlation_heatmap(df: pd.DataFrame):
    """
    Plots a correlation heatmap for the features in the input DataFrame using Plotly.

    Args:
    df (pd.DataFrame): The input DataFrame.
    """
    # Calculate the correlation matrix
    correlation = df.corr()
    
    # Create the heatmap using Plotly
    heatmap = ff.create_annotated_heatmap(
        z=correlation.values,
        x=list(correlation.columns),
        y=list(correlation.index),
        annotation_text=correlation.round(2).values,
        showscale=True,
        colorscale='Viridis'
    )
    
    # Update layout for better visualization
    heatmap.layout.update({
        'title': 'Features Correlation Heatmap',
        'xaxis': {'side': 'bottom'}
    })
    
    heatmap.show()

: 

In [None]:
plot_correlation_heatmap(transformed_pokemon_df)

: 

Since in the task description a high enphasis was put on the type of pokemons, I have generated couple of nice graphs. 

In [None]:
def single_type_histogram(pokemon_df: pd.DataFrame):
    """
    Generates a histogram for Pokémon with a single type.

    Args:
    pokemon_df (pd.DataFrame): The input DataFrame containing Pokémon data.

    Returns:
    px.Figure: Figure object for the histogram.
    """
    
    single_type_df = pokemon_df[pokemon_df['Type_2'].isnull()]

    # Order by count and get top 3 types
    ordered_types = single_type_df['Type_1'].value_counts().index.tolist()
    top_3_types = ordered_types[:3]

    # Determine colors for bars
    colors = [px.colors.qualitative.Plotly[i] if t in top_3_types else 'lightgray' for i, t in enumerate(ordered_types)]

    # Plot for single type Pokémon
    fig_single_type = px.histogram(single_type_df, x='Type_1', title="Distribution of Pokémon with Single Type", 
                                category_orders={'Type_1': ordered_types}, color='Type_1',
                                color_discrete_sequence=colors)
    return fig_single_type

: 

In [None]:
fig_single_type= single_type_histogram(pokemon_df)
fig_single_type.show()

: 

In [None]:


def both_types_histogram(pokemon_df: pd.DataFrame):
    """
    Generates a histogram for Pokémon with both types.

    Args:
    pokemon_df (pd.DataFrame): The input DataFrame containing Pokémon data.

    Returns:
    px.Figure: Figure object for the histogram.
    """
    
    pokemon_df['Both_Types'] = pokemon_df.apply(lambda row: f"{row['Type_1']}/{row['Type_2']}" if pd.notnull(row['Type_2']) else None, axis=1)
    both_types_df = pokemon_df[pokemon_df['Both_Types'].notnull()]
    ordered_combinations = both_types_df['Both_Types'].value_counts().index.tolist()
    top_3_combinations = ordered_combinations[:3]
    colors = [px.colors.qualitative.Plotly[i] if t in top_3_combinations else 'lightgray' for i, t in enumerate(ordered_combinations)]
    
    fig = px.histogram(both_types_df, x='Both_Types', title="Distribution of Pokémon with Both Types", 
                       category_orders={'Both_Types': ordered_combinations}, color='Both_Types', color_discrete_sequence=colors)
    return fig

: 

In [None]:
fig_single_type = both_types_histogram(pokemon_df)
fig_single_type.show()

: 

In [None]:
def plot_avg_strength_radar(df: pd.DataFrame, stats: list) -> None:
    """
    Plots a radar chart comparing the average stats of legendary and non-legendary Pokémon.

    Args:
    df (pd.DataFrame): The input DataFrame containing Pokémon data.
    stats (list): The list of stats to be compared.

    Returns:
    None
    """

    # Ensure 'isLegendary' column is Boolean for accurate filtering
    df['isLegendary'] = df['isLegendary'].astype(bool)

    # Calculate the average strength of legendary and non-legendary Pokémon
    avg_legendary = df[df['isLegendary'] == True][stats].mean()
    avg_non_legendary = df[df['isLegendary'] == False][stats].mean()

    # Plot the average strengths on a radar chart
    fig_leg = go.Figure()

    # Add trace for Legendary Pokémon
    fig_leg.add_trace(go.Scatterpolar(
        r=avg_legendary.values.tolist(),
        theta=stats,
        fill='toself',
        name='Legendary',
        textfont=dict(color="black"),
        line=dict(color='blue'),
    ))

    # Add trace for Non-Legendary Pokémon
    fig_leg.add_trace(go.Scatterpolar(
        r=avg_non_legendary.values.tolist(),
        theta=stats,
        fill='toself',
        name='Non-Legendary',
        textfont=dict(color="black"),
        line=dict(color='red'),
    ))

    # Update layout for better visualization
    fig_leg.update_layout(
        polar=dict(radialaxis=dict(visible=True)),
        title='Average Strength of Legendary vs. Non-Legendary Pokémon',
        showlegend=True
    )
    
    fig_leg.show()

# Usage
stats = ['HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed']
plot_avg_strength_radar(pokemon_df, stats)

: 

In [None]:
def plot_strongest_legendaries(df: pd.DataFrame, stats: list) -> None:
    """
    Plots a radar chart comparing the stats of the three strongest legendary Pokémon.

    Args:
    df (pd.DataFrame): The input DataFrame containing Pokémon data.
    stats (list): The list of stats to be compared.

    Returns:
    None
    """

    # Ensure 'isLegendary' column is Boolean for accurate filtering
    df['isLegendary'] = df['isLegendary'].astype(bool)

    # Filter only legendary Pokémon
    legendary_df = df[df['isLegendary']]

    # Calculate total stats for each Pokémon
    legendary_df['Total_Stats'] = legendary_df[stats].sum(axis=1)

    # Sort the legendary Pokémon by total stats in descending order and take the top 3
    strongest_legendaries = legendary_df.sort_values(by='Total_Stats', ascending=False).head(3)

    # Plot the stats on a radar chart
    fig_leg = go.Figure()

    # Add a trace for each of the strongest legendary Pokémon
    for index, row in strongest_legendaries.iterrows():
        fig_leg.add_trace(go.Scatterpolar(
            r=row[stats].values.tolist(),
            theta=stats,
            fill='toself',
            name=row['Name'],
            text=[f'{stat}: {value}' for stat, value in zip(stats, row[stats].values.tolist())],
            hoverinfo='text'
        ))

    # Update layout for better visualization
    fig_leg.update_layout(
        polar=dict(radialaxis=dict(visible=True)),
        title='Stats of the Three Strongest Legendary Pokémon',
        showlegend=True
    )
    
    fig_leg.show()

: 

In [None]:
stats = ['HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed']
plot_strongest_legendaries(pokemon_df, stats)

: 

# Train legendary classifier using PyCaret

In [None]:
cls_exp = ClassificationExperiment() # we first create an instance of the classification experiment class

: 

In [None]:
cls_exp.setup(transformed_pokemon_df, target = 'isLegendary',ignore_features =['Name', 'color', 'Number', 'hasGender'], session_id = 123,  fix_imbalance= True)

: 

In [None]:
best_model = cls_exp.compare_models(turbo=True)

: 

In [None]:
print(best_model)

: 

In [None]:
# Plot the confusion matrix
cls_exp.plot_model(best_model, plot = 'confusion_matrix')

: 

Our model demonstrates a high degree of accuracy in identifying legendary Pokémon, with only three instances of misclassification where non-legendary Pokémon were incorrectly identified as legendary.

Importantly, there were no occurrences where legendary Pokémon were misidentified as non-legendary. This distinction is critical given the objective of our task: to prioritize the capture of legendary Pokémon due to their superior strength and rarity. 

The model's current configuration is favorable as it errs on the side of over-identification rather than under-identification of legendary Pokémon, thus reducing the likelihood of missing an opportunity to capture a legendary Pokémon.

In [None]:
cls_exp.plot_model(best_model, plot = 'feature')

: 

# Fine-tuning

Eventhough the model is really good, we can try to improve it even further by fine-tuning it.

In [None]:
tuned_best_model = cls_exp.tune_model(best_model, optimize="F1",n_iter=10)

: 

# Misclassified Legendaries

In [None]:
def misclassified_legendaries(trained_model, df: pd.DataFrame, legendary_status: bool, predicted_status: bool) -> pd.DataFrame:
    """
    Identifies misclassified Pokémon based on their legendary status and predicted status, using a trained model.
    Returns a DataFrame of misclassified Pokémon.

    Args:
    trained_model: The trained PyCaret model.
    df (pd.DataFrame): The input DataFrame containing Pokémon data.
    legendary_status (bool): The actual legendary status to filter misclassified Pokémon.
                             - True: to filter Pokémon that are legendary.
                             - False: to filter Pokémon that are not legendary.
    predicted_status (bool): The predicted legendary status to filter misclassified Pokémon.
                             - True: to filter Pokémon predicted as legendary.
                             - False: to filter Pokémon predicted as non-legendary.

    Returns:
    pd.DataFrame: The DataFrame of misclassified Pokémon.
    """
    # Create a copy of the DataFrame to ensure the original DataFrame remains unchanged
    new_data = df.copy()
    
    # Drop the 'isLegendary' column from the DataFrame copy
    new_data.drop('isLegendary', axis=1, inplace=True)
    
    # Use the trained model to make predictions on the DataFrame copy
    predictions = cls_exp.predict_model(trained_model, data=new_data)
    
    # Add a new column 'predictions' to the original DataFrame, containing the prediction labels
    df['predictions'] = predictions['prediction_label'].values
    
    # Filter and return the DataFrame based on the legendary_status and predicted_status arguments
    misclassified = df[(df["isLegendary"] == legendary_status) & (df["predictions"] == predicted_status)]
    
    return misclassified

: 

In [None]:
misclassified_legendaries(best_model,transformed_pokemon_df, legendary_status=False, predicted_status=True )

: 

In [None]:
cls_exp.save_model(best_model, 'pokemon_classifier',)

: 

: 