#Importing Libraries and Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [None]:
dataset = pd.read_csv('Tabular Dataset.csv')

#The Tabular Data

In [None]:
dataset.head()

Unnamed: 0,National\r\nDex,Pokemon\r\nName,HP,Attack,Defense,Special\r\nAttack,Special\r\nDefense,Speed,Total,Ability I,Ability II,Hidden Ability,Weight (kg),Height (m),Pokedex Color,Egg Group I,Egg Group II,Type I,Type II
0,1.0,Bulbasaur,45,49,49,65,65,45,318,Overgrow,,Chlorophyll,6.9,0.7,Green,Monster,Grass,Grass,Poison
1,2.0,Ivysaur,60,62,63,80,80,60,405,Overgrow,,Chlorophyll,13.0,1.0,Green,Monster,Grass,Grass,Poison
2,3.0,Venusaur,80,82,83,100,100,80,525,Overgrow,,Chlorophyll,100.0,2.0,Green,Monster,Grass,Grass,Poison
3,3.1,Venusaur (Mega),80,100,123,122,120,80,625,Thick Fat,,,155.5,2.4,Green,Monster,Grass,Grass,Poison
4,4.0,Charmander,39,52,43,60,50,65,309,Blaze,,Solar Power,8.5,0.6,Red,Monster,Dragon,Fire,


This data has all the Tabular information about every Pokemon that we wished to use for the project.

- The 'National Dex' number is used to ID every Pokemon, as well as their 'Pokemon Name' for convenience.
- Their 'HP', 'ATK', 'DEF', 'SP.ATK', 'SP.DEF', and 'SPEED' are all stat values of the Pokemon, which can be used to determine whether specific higher or lower stats can provide information on a Pokemon's typing.
- The 'Total' stats are the combined value of the previous 6 stats to see if higher total stats are more common with certain types.
- Every Pokemon has an 'Ability I', which ingame is a passive power that the Pokemon has, and in our case we are trying to find whether certain abilities commonly lead to the same types.
- Some Pokemon also have an 'Ability II' and/or a 'Hidden Ability' which is of the same premise.
- The official 'Weight' and 'Height' of the Pokemon to see wether they impact the Pokemon types.
- The 'Pokedex Color' is the official ingame classification of what colour a Pokemon is, irregardless of whether the Pokemon is visually predominantly that colour.
- 'Egg Group I' and 'Egg Group II' are further classifications, based on which Pokemoon are compatible for breeding.
- 'Type I' and 'Type II' are the one or two Pokemon types that we are hoping to predict and learn more about.

In [None]:
type_colour_map = {
    'Grass' : '#71c558',
    'Fire' : '#ea7a3c',
    'Water' : '#539ae2',
    'Bug' : '#94bc4a',
    'Normal' : '#aab09f',
    'Dark' : '#736c75',
    'Poison' : '#b468b7',
    'Electric' : '#e5c531',
    'Ground' : '#cc9f4f',
    'Ice' : '#70cbd4',
    'Fairy' : '#e397d1',
    'Steel' : '#89a1b0',
    'Fighting' : '#cb5f48',
    'Psychic' : '#e5709b',
    'Rock' : '#b2a061',
    'Ghost' : '#846ab6',
    'Dragon' : '#6a7baf',
    'Flying' : '#7da6de'
}

type_order = ['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Dark', 'Poison', 'Electric', 'Ground', 'Ice', 'Fairy', 'Steel', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Dragon', 'Flying']

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Pokedex Colour vs Type

The colour of a Pokemon could be a good example of how we can learn a Pokemon's type I or type II. By learning which colours predominantly make up a type, we should be able to make predictions on future Pokemon's types.

In [None]:
colour = dataset['Pokedex Color'].values.tolist()
typeI = dataset['Type I'].values.tolist()
typeII = dataset['Type II'].values.tolist()

colour_map = {
    'Black' : '#000000',
    'Blue' : '#0000FF',
    'Brown' : '#B16620',
    'Gray' : '#AAAAAA',
    'Green' : '#00FF00',
    'Pink' : '#FF9AED',
    'Purple' : '#FF00FF',
    'Red' : '#FF0000',
    'White' : '#FFFFFF',
    'Yellow' : '#FFFF00'
}


dataframe = {
    'Pokedex Colour': colour,
    'Type I' : typeI,
    'Type II' : typeII
}

pd_df = pd.DataFrame(dataframe)

pd_df_count_typeI = pd_df.groupby(['Type I', 'Pokedex Colour']).size()
pd_df_count_typeI = pd_df_count_typeI.to_frame(name='Count').reset_index()

pd_df_count_typeII = pd_df.groupby(['Type II', 'Pokedex Colour']).size()
pd_df_count_typeII = pd_df_count_typeII.to_frame(name='Count').reset_index()


fig_typeI = px.bar(
    pd_df_count_typeI,
    x='Type I',
    y='Count',
    color='Pokedex Colour',
    title='Pokedex Colour vs Type I',
    color_discrete_map=colour_map,
    category_orders={'Type I': type_order}
)

fig_typeII = px.bar(
    pd_df_count_typeII,
    x='Type II',
    y='Count',
    color='Pokedex Colour',
    title='Pokedex Colour vs Type II',
    color_discrete_map=colour_map,
    category_orders={'Type II': type_order}
)




fig_typeI.show()
fig_typeII.show()

There are a few interesting things we can pull from this data, notably that most Type I seem to have a predominant colour. for example, 'Grass' is frequently green, 'Normal' is brown, 'Electric is yellow, and 'Poison' is purple. There are also some connections that may make less sense, such as bugs being surprisingly predominantly red despite the ingame colour icon being green, and the higher than expected number of blue rock types.

Type II Colours seem to be all over the place in comparison. a high number of green dark types, blue ground types, and psychic seeming ot have a wide mix of colours. These will be harder to make accurate predictions with, but can still be used to a lesser extent.

There is some imbalance in the dataset. for example, flying is the lowest Type I with only 10 Pokemon, while in Type II, flying has 136 Pokemon, over double the next highest which is psychic at 52. Thankfully flying is the only major outlier.

#Number of Egg Groups vs Number of Types

Pokemon Egg Groups can tell you which Pokemon they can breed with ingame. with this in mind, it is safe to assume that similar Pokemon with Similar egg groups likely have similar types, however, some Pokemon can have 2 egg groups, while some have 1, much like how some Pokemon have 2 types while some have 1. Here we decided to work out whether Pokemon with a second egg group, were more likely to have a second typing.

In [None]:
# Step 1: Create the num_types and num_egg_groups columns
num_types = dataset['Type II'].apply(lambda x: 1 if pd.isna(x) else 2)
num_egg_groups = dataset['Egg Group II'].apply(lambda x: 1 if pd.isna(x) else 2)

df = {
    'Type Count' : num_types,
    'Egg Group Count' : num_egg_groups
}

df = pd.DataFrame(df)
# Step 2: Count combinations of num_types and num_egg_groups
type_egg_group_counts = df.groupby(['Type Count', 'Egg Group Count']).size()
type_egg_group_counts = type_egg_group_counts.to_frame(name='Count').reset_index()

# Step 3: Plot with Plotly Express
fig = px.bar(
    type_egg_group_counts,
    x='Type Count',
    y='Count',
    color='Egg Group Count',
    title='Number of Egg Groups vs Number of Types'
)

fig.show()


Unfortunately, this exploratory analysis does not give us much information. There are significantly less pokemon with 2 egg groups but that does not seem to impact how likely a pokemon is to having a second typing.

for single type pokemon, 26.49% of the pokemon have a secondary egg group, and for dual type pokemon, 25.4% of the pokemon have a secondary egg group. this difference is pretty much indistinguishable

#Total vs Type

By determining the Total stat median, min and max, we should be able to find trends matching with the typing, with some types being significantly weaker, and some being stronger.

In [None]:
stat_total = dataset['Total']
typeI = dataset['Type I']
typeII = dataset['Type I']

df = {
    'Stat Total' : stat_total,
    'Type I' : typeI,
    'Type II' : typeII
}

df = pd.DataFrame(df)

fig_typeI = px.violin(df, x='Type I', y='Stat Total', color='Type I', color_discrete_map=type_colour_map, category_orders={'Type I':type_order})
fig_typeI.show()

This data doesnt give us too much information, however, we can tell that dragon has the highest stat total at 780, while water has the lowest at 175. Dragons also have the highest median of 580, so pokemon with high stat totals are more likely to be dragon types.
It is interesting to see that almost every type fits into one of two catagories, either elongated rhombus shapes, spiking at the top and bottom, with a single gradual grow to the mean, and a single decrease after, or they are shaped with two bulbs, where the type seems to average out before growing again afterwards.
The only exception to this rule is fighting, which seems to be in a rhombus shape, yet as it approaches the median, it seems to bulb out, suggesting a lot of fighting types have stats around that area.

#Total vs Number of Types

Some pokemon have higher stat totals than others, we are hoping that there will be a distinct difference in higher stat pokemon having multiple types so that we are able to more accurately predict if a pokemon has one or two types.

In [None]:
# Step 1: Create the num_types and num_egg_groups columns
num_types = dataset['Type II'].apply(lambda x: 1 if pd.isna(x) else 2)
stat_total = dataset['Total'].apply(lambda x: '175-296' if x>174 and x<297
                                              else '297-417' if x>296 and x<418
                                              else '418-538' if x>417 and x<539
                                              else '539-659' if x>538 and x<660
                                              else '660-780' if x>659 and x<781
                                              else '780+')

df = {
    'Type Count' : num_types,
    'Stat Total' : stat_total
}

df = pd.DataFrame(df)
# Step 2: Count combinations of num_types and num_egg_groups
type_egg_group_counts = df.groupby(['Type Count', 'Stat Total']).size()
type_egg_group_counts = type_egg_group_counts.to_frame(name='Count').reset_index()

# Step 3: Plot with Plotly Express
fig = px.bar(
    type_egg_group_counts,
    x='Type Count',
    y='Count',
    color='Stat Total',
    title='Stat Total vs Number of Types'
)

fig.show()


As with the number of egg groups, the total stats does not give much of an indicator over the number of types apart from extreme values. It appears that pokemon in the lower fifth values (175-296) are more that twice as likely to be monotype, while pokemon in the higher fifth (660-780) are slightly more likely to be dual-type.

17% of all monotypes are in the lower fifth while 8% of dual types are in the lower fifth

2% of monotypes are in the upper fifth while 5% of dual types are in the upper fifth

#Highest Stat vs Type

Analysing which stat is the highest can hopefully show some trends with certain types being more likely to have a stat higher than the others.

In [None]:
stats = dataset.iloc[:, 2:8]
typeI = dataset['Type I']
typeII = dataset['Type II']

max_stats = pd.DataFrame(
    {
        'Highest Stat' : stats.idxmax(axis=1),
        'Type I' : typeI,
        'Type II' : typeII
    }
)

stat_typeI_count = max_stats.groupby(['Highest Stat', 'Type I']).size()
stat_typeI_count = stat_typeI_count.to_frame(name='Count').reset_index()
fig_typeI = px.bar(
    stat_typeI_count,
    x='Type I',
    y='Count',
    color='Highest Stat',
    title='Highest Stat vs Type I',
    category_orders={'Type I': type_order}
)


stat_typeII_count = max_stats.groupby(['Highest Stat', 'Type II']).size()
stat_typeII_count = stat_typeII_count.to_frame(name='Count').reset_index()
fig_typeII = px.bar(
    stat_typeII_count,
    x='Type II',
    y='Count',
    color='Highest Stat',
    title='Highest Stat vs Type II',
    category_orders={'Type II': type_order}
)

fig_typeI.show()
fig_typeII.show()


At first, this information does not seem to provide much information, most types seem to have two or even three types that are almost tied for the highest stat, (the notable exception being fighting type, where attack is significantly higher).

Looking deeper though, the most information seems to come from which stats are *least* commonly the highest stat. The prime example of this is Electric type, where not a single pokemon has Special Defence as its highest stat. Bug and Rock types are very unlikely to have high HP, and ground types typically do not have high Special Attack.

Other interesting notes are surprisingly average amount of high attack psychic types (usually known for having high special attack instead)

#Size vs Type

By clustering the Height and Weight of every pokemon in a type, we hope to find patterns, perhaps rock or steel types will weighh the most, while fairies will on average weigh the least, and electric types might be smaller as they are typically faster and more agile.

In [None]:
size_df = {
    'Weight(kg)' : dataset['Weight (kg)'],
    'Height(m)' : dataset['Height (m)'],
    'Type I' : dataset['Type I'],
}

size_df = pd.DataFrame(size_df)

fig = px.scatter(size_df, x='Height(m)', y='Weight(kg)', facet_col='Type I', color='Type I',
                 title='Height vs. Weight by Type I (Faceted)',
                 color_discrete_map=type_colour_map,
                 category_orders={'Type I':type_order})

# LAYOUT COMMANDS #############################################################
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1]))
fig.for_each_xaxis(lambda xaxis: xaxis.update(title_text=''))
fig.update_layout(
    xaxis_title='Height(m)',
    title_x=0.5
)
###############################################################################

fig.show()


We can see that most types seem to have a few outliers, especially when it comes to weight, with most types clustering small, and the occasional instance hundreds of kg above the rest, the biggest case of this is poison, where the outlier is 'Eternamax Eternatus', a giant boss Pokemon in the games, and thus is much larger than every other pokemon, as well as having a high weight to pair with it.

Some other pokemon match, or even beat Eternamax Eternatus, in weight, but it is the largest in height by a lot, and it sticks out from the other relatively small and light poison types.

This information doesnt give us much to work with as a lot of pokemon are clusteres closer to the 0,0 mark, but we can still see that steel have a few more pokemon of higher weight that other types, and flying has the no outliers but this is likely due to having very few Type I flying types

#HP vs Type

Comparing different HP stats with types will hopefully provide information... (ill add something here when I can be buggered (aka not right now))

In [None]:
hp_df = pd.DataFrame(
    {
        'HP' : dataset['HP'],
        'Type I' : dataset['Type I'],
        'Type II' : dataset['Type II']
    }
)

fig_typeI = px.violin(hp_df, x='Type I', y='HP', color='Type I', color_discrete_map=type_colour_map, category_orders={'Type I' : type_order})
fig_typeI.show()

fig_typeII = px.violin(hp_df, x='Type II', y='HP', color='Type II', color_discrete_map=type_colour_map, category_orders={'Type II' : type_order})
fig_typeII.show()

We unfortunately dont get much information out of this, apart from the occasional outlier, the majority of types have similar average HP stats. The only amusing result is type I bug and type II Ghost having minimum HP stats of 1 thanks to 'Shedinja' being the only pokemon with 1 hp.