<a href="https://colab.research.google.com/github/Zzzhenya/pokemon_data/blob/readme_edits/Pokemon%20Data%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Complete Pokemon Dataset

## Task

In [None]:
import pandas

Discover which pokemons would make the best team.

1. Have a look at the variables, understand what they are.
2. Formulate several research questions. Here is some inspiration:
    - How does height and weight of a Pokemon correlate with its various base stats?
    - What factors influence the Experience Growth and Egg Steps? Are these quantities correlated?
    - Which type is the strongest overall? Which is the weakest?
    - In what ways are legendary pokemon different from the rest?
3. For each question:
    - Make plots and/or compute statistics.
    - Write a short conclusion, refer to the justifications you found in the data.


## Dataset

_Source: https://www.kaggle.com/datasets/rounakbanik/pokemon_

This dataset contains information on all 802 Pokemon from all Seven Generations of Pokemon.

The information contained in this dataset include Base Stats, Performance against Other Types, Height, Weight, Classification, Egg Steps, Experience Points, Abilities, etc.

Variables:
- name: The English name of the Pokemon
- japanese_name: The Original Japanese name of the Pokemon
- pokedex_number: The entry number of the Pokemon in the National Pokedex
- percentage_male: The percentage of the species that are male. Blank if the Pokemon is genderless.
- type1: The Primary Type of the Pokemon
- type2: The Secondary Type of the Pokemon
- classification: The Classification of the Pokemon as described by the Sun and Moon Pokedex
- height_m: Height of the Pokemon in metres
    weight_kg: The Weight of the Pokemon in kilograms
- capture_rate: Capture Rate of the Pokemon
    baseeggsteps: The number of steps required to hatch an egg of the Pokemon
- abilities: A stringified list of abilities that the Pokemon is capable of having
- experience_growth: The Experience Growth of the Pokemon
- base_happiness: Base Happiness of the Pokemon
- against_?: Eighteen features that denote the amount of damage taken against an attack of a particular type
- hp: The Base HP of the Pokemon
- attack: The Base Attack of the Pokemon
- defense: The Base Defense of the Pokemon
- sp_attack: The Base Special Attack of the Pokemon
- sp_defense: The Base Special Defense of the Pokemon
- speed: The Base Speed of the Pokemon
- generation: The numbered generation which the Pokemon was first introduced
- is_legendary: Denotes if the Pokemon is legendary.




## Analysis

In [None]:
import pandas as pd
import seaborn as sns


# From https://drive.google.com/file/d/1ayvCVXaCHsXUsn8TDGRiay0v5SKyTRDH/view
df = pd.read_csv(
    "https://drive.google.com/uc?id=1ayvCVXaCHsXUsn8TDGRiay0v5SKyTRDH",
)
df.shape

In [None]:
pd. __version__ 
sns.__version__

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(25,10))
sns.heatmap(df.corr(method="kendall"), annot=True, fmt=".3f", ax=ax);

In [None]:
import matplotlib

In [None]:
matplotlib.__version__

In [None]:
df.columns



Type/Strength Matrix

1.   'against_bug',
2.   'against_dark', 
3. 'against_dragon',
4. 'against_electric',
5.  'against_fairy', 
6. 'against_fight', 
7. 'against_fire',
8. 'against_flying', 
9. 'against_ghost', 
10. 'against_grass', 
11. 'against_ground',
12. 'against_ice', 
13. 'against_normal', 
14. 'against_poison', 
15. 'against_psychic',
16. 'against_rock', 
17. 'against_steel', 
18. 'against_water', 


Base stats

* 'base_total',  = hp +	attack + defense +	sp_ttack +	sp_defense + speed	
* 'hp',
* 'attack',
* 'defense', 
* 'sp_attack', 
* 'sp_defense', 
* 'speed', 
* 'base_egg_steps', 
* 'base_happiness', 
* 'experience_growth', 
* 'capture_rate',

About the Pokemon


* 'name', 
* 'pokedex_number',
* 'japanese_name', 
* 'generation', 
* 'classfication', 
* 'is_legendary'
* 'abilities'
* 'type1', 
* 'type2', 
* 'height_m', 
* 'weight_kg',
* 'percentage_male', 











In [None]:

pokemon = df[['pokedex_number','name','japanese_name', 'generation','classfication',
                    'is_legendary','abilities','type1','type2','height_m','weight_kg','percentage_male',
                    'base_egg_steps','base_happiness','capture_rate','experience_growth',]]
pokemon

pokemon_with_stats = df[['pokedex_number','name','japanese_name', 'generation','classfication',
                    'is_legendary','abilities','type1','type2','height_m','weight_kg','percentage_male',
                    'base_egg_steps','base_happiness','capture_rate','experience_growth','hp',
                    'attack', 'defense', 'sp_attack', 'sp_defense','speed', 'base_total']]

In [None]:
pokemon.describe()


In [None]:
# hp
pokemon_with_stats.corr().loc['hp'].sort_values(ascending = False)
#Question: How does height and weight of a Pokemon correlate with its various base stats?
pokemon_with_stats.corr().loc['height_m'].sort_values(ascending = False)

In [None]:
#Question: How does weight variate against height
sns.lmplot(data = pokemon_with_stats, x ="height_m", y = "weight_kg")

print("Height in m is has a positive association with weight in kg.\nHeight has a negative assiciation with base happiness.")
print("Taller the pokemon, higher the weight. Taller the pokemon less happy they are.")

In [None]:
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "base_total",  hue = "is_legendary").set(title="weight_kg vs base_total")
pokemon_with_stats.corr().loc['base_total'].sort_values(ascending = False)
print("\nBase total has high positive correlation with, sp_attack,attack, sp_defense,defense, hp and speed.\nThis makes sense since base total is defined by the sum of these correlated fields.\n")
print("\nLegendary pokemon seems to have higher base total in general")

In [None]:

print("Weight unknown Pokemon:",pokemon_with_stats.weight_kg.isna().sum())
print("Height unknown Pokemon:",pokemon_with_stats.height_m.isna().sum())
# Question: Are the Pokemon with null weight the same pokemon who has null height?
#exploring the null weight/height pokemon
print(pokemon_with_stats.loc[pokemon_with_stats.weight_kg.isna()==True][["name","generation","is_legendary", "pokedex_number"]])
print(pokemon_with_stats.loc[pokemon_with_stats.height_m.isna()==True][["name","generation","is_legendary", "pokedex_number"]])
print("Most of the null weight/height pokemon are from 1st generation and are not legendary pokemon")

In [None]:
#Hypothesis 1: Higher weighted/Tall Pokemons have higher stat

#Exploring all pokemon stats against weight/height 
#'hp','attack','defense','sp_attack','sp_defense',
fig = plt.figure(figsize=(30, 12))
spec = fig.add_gridspec(ncols=5, nrows=2)

#weight
ax0 = fig.add_subplot(spec[0, 0])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "hp", hue = "is_legendary", ax=ax0).set(title="weight_kg vs hp")
ax1 = fig.add_subplot(spec[0, 1])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "attack",  hue = "is_legendary", ax=ax1).set(title="weight_kg vs attack")
ax2 = fig.add_subplot(spec[0, 2])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "defense",  hue = "is_legendary", ax=ax2).set(title="weight_kg vs defense")
ax3 = fig.add_subplot(spec[0, 3])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "sp_attack",  hue = "is_legendary", ax=ax3).set(title="weight_kg vs sp_attack")
ax4 = fig.add_subplot(spec[0, 4])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "sp_defense",  hue = "is_legendary", ax=ax4).set(title="weight_kg vs sp_defense")

#height
ax0 = fig.add_subplot(spec[1, 0])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "hp",  hue = "is_legendary", ax=ax0).set(title="height_m vs hp")
ax1 = fig.add_subplot(spec[1, 1])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "attack",  hue = "is_legendary", ax=ax1).set(title="height_m vs attack")
ax2 = fig.add_subplot(spec[1, 2])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "defense",  hue = "is_legendary", ax=ax2).set(title="height_m vs defense")
ax3 = fig.add_subplot(spec[1, 3])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "sp_attack",  hue = "is_legendary", ax=ax3).set(title="height_m vs sp_attack")
ax4 = fig.add_subplot(spec[1, 4])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "sp_defense",  hue = "is_legendary", ax=ax4).set(title="height_m vs sp_defense")


fig.suptitle('\nweight and height vs hp, attack, defense, sp_attack, sp_defense', fontsize = 18)

In [None]:
#'base_total','speed', "base_egg_steps", "base_happiness", "experience_growth"
fig = plt.figure(figsize=(30, 12))
spec = fig.add_gridspec(ncols=5, nrows=2)
ax0 = fig.add_subplot(spec[0, 0])
#weight
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "speed", hue = "is_legendary", ax=ax0).set(title="weight_kg vs speed")
ax1 = fig.add_subplot(spec[0, 1])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "base_total",  hue = "is_legendary", ax=ax1).set(title="weight_kg vs base_total")
ax2 = fig.add_subplot(spec[0, 2])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "base_egg_steps",  hue = "is_legendary", ax=ax2).set(title="weight_kg vs base_egg_steps")
ax3 = fig.add_subplot(spec[0, 3])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "base_happiness",  hue = "is_legendary", ax=ax3).set(title="weight_kg vs base_happiness")
ax4 = fig.add_subplot(spec[0, 4])
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "experience_growth",  hue = "is_legendary", ax=ax4).set(title="weight_kg vs experience_growth")

#height
ax0 = fig.add_subplot(spec[1, 0])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "speed",  hue = "is_legendary", ax=ax0).set(title="height_m vs speed")
ax1 = fig.add_subplot(spec[1, 1])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "base_total",  hue = "is_legendary", ax=ax1).set(title="height_m vs base_total")
ax2 = fig.add_subplot(spec[1, 2])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "base_egg_steps",  hue = "is_legendary", ax=ax2).set(title="height_m vs base_egg_steps")
ax3 = fig.add_subplot(spec[1, 3])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "base_happiness",  hue = "is_legendary", ax=ax3).set(title="height_m vs base_happiness")
ax4 = fig.add_subplot(spec[1, 4])
sns.scatterplot(data=pokemon_with_stats, x="height_m", y = "experience_growth",  hue = "is_legendary", ax=ax4).set(title="height_m vs experience_growth")


fig.suptitle('\nweight and height vs base_total, hp, attack, base_egg_steps, base_happiness', fontsize = 18)

In [None]:
#capture_rate'
sns.scatterplot(data=pokemon_with_stats, x="weight_kg", y = "capture_rate",  hue = "is_legendary").set(title="\nweight_kg vs capture_rate\n")

In [None]:
sns.scatterplot(data=pokemon_with_stats, x="generation", y = "capture_rate",  hue = "is_legendary").set(title="\ngeneration vs capture_rate\n")

In [None]:
#happiness of a pokemon - Friendship, also known as Happiness, is an important invisible stat in Pokémon Sword and Shield,
#which, as the name suggests, determines how friendly your Pokémon are with you.

print("Even though there is a positive correlation, it doesn't seem there is clear linear trend between base stats\n ('hp','attack','defense','sp_attack','sp_defense', 'base_total','speed')\nagainst the weight and height of a pokemon\n")
print("Other findings:\n")
print("Most pokemon have a height less than 4 m and weight less than 400 kg.\n")
print("However the hue distribution shows that there is a lesser number of legendary pokemon.\n")
print("Base total of legendary pokemon is generally higher.\n")
print("Base egg steps, base happiness and experience growth of pokemon are discrete values.\n")
print("Base happiness of most legendary pokemon is very low (0), not very friendly.\n")
print("Base egg steps of most legendary pokemon is vert high (more than 15000).\n")
print("experiene growth of most legendary pokemon is a fixed value (between 1.2 * 10^6 to 1.4 * 10^6).\n")

In [None]:
pokemon_with_stats.corr().loc['is_legendary'].sort_values(ascending = False)


In [None]:
#explore legendary pokemon
#explore non-legendary pokemon
#explore by classfication
#explore generation-wise distributions
#type 1/type 2

In [None]:
#generation-wise experience growth

sns.scatterplot(data=pokemon_with_stats, x="generation", y = "experience_growth", hue = "is_legendary"  ).set(title="\ngeneration vs experience_growth\n")

print("In pokemon generations 1,2,5,6,7 exprerience growth spans over 4 discrete values.\n")
print("In pokemon generations 3 and 4 exprerience growth spans over 6 discrete values.\n")

In [None]:
sns.scatterplot(data=pokemon_with_stats, x="generation", y = "experience_growth", hue = "is_legendary", alpha = 0.1  )

In [None]:
#experience growth vs type 1 of each generation
fig = plt.figure(figsize=(35, 12))
spec = fig.add_gridspec(ncols=5, nrows=2)

#type1
ax0 = fig.add_subplot(spec[0, 0])

sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==1], x="experience_growth", y = "type1", ax=ax0).set(title="generation 1 : experience_growth vs type 1")
ax1 = fig.add_subplot(spec[0, 1])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==2], x="experience_growth", y = "type1", ax=ax1).set(title="generation 2 : experience_growth vs type 1")
ax2 = fig.add_subplot(spec[0, 2])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==3], x="experience_growth", y = "type1", ax=ax2).set(title="generation 3 : experience_growth vs type 1")
ax3 = fig.add_subplot(spec[0, 3])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==4], x="experience_growth", y = "type1", ax=ax3).set(title="generation 4 : experience_growth vs type 1")
ax4 = fig.add_subplot(spec[0, 4])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==5], x="experience_growth", y = "type1", ax=ax4).set(title="generation 5 : experience_growth vs type 1")

#row 2
ax5 = fig.add_subplot(spec[1, 0])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==6], x="experience_growth", y = "type1", ax=ax5).set(title="generation 6 : experience_growth vs type 1")
ax6 = fig.add_subplot(spec[1, 1])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==7], x="experience_growth", y = "type1", ax=ax6).set(title="generation 7 : experience_growth vs type 1")

fig.suptitle("\nExperience-growth vs type 1 variation of each generation", fontsize = 18)



In [None]:
#experience growth vs type 2 vs generation
fig = plt.figure(figsize=(35, 15))
spec = fig.add_gridspec(ncols=5, nrows=2)

#row 1
ax0 = fig.add_subplot(spec[0, 0])

sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==1], x="experience_growth", y = "type2", ax=ax0).set(title="generation 1 : experience_growth vs type 2")
ax1 = fig.add_subplot(spec[0, 1])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==2], x="experience_growth", y = "type2", ax=ax1).set(title="generation 2 : experience_growth vs type 2")
ax2 = fig.add_subplot(spec[0, 2])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==3], x="experience_growth", y = "type2", ax=ax2).set(title="generation 3 : experience_growth vs type 2")
ax3 = fig.add_subplot(spec[0, 3])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==4], x="experience_growth", y = "type2", ax=ax3).set(title="generation 4 : experience_growth vs type 2")
ax4 = fig.add_subplot(spec[0, 4])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==5], x="experience_growth", y = "type2", ax=ax4).set(title="generation 5 : experience_growth vs type 2")

#row 2
ax5 = fig.add_subplot(spec[1, 0])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==6], x="experience_growth", y = "type2", ax=ax5).set(title="generation 6 : experience_growth vs type 2")
ax6 = fig.add_subplot(spec[1, 1])
sns.scatterplot(data=pokemon_with_stats.loc[pokemon_with_stats.generation==7], x="experience_growth", y = "type2", ax=ax6).set(title="generation 7 : experience_growth vs type 2")

fig.suptitle("\nExperience-growth vs type 2 variation of each generation", fontsize = 18)

In [None]:
print("There seems to be more types options, both 1 and 2, in the middle eange of experience growth, except for the instance of type 2 in generation 7 pokemons.")

In [None]:
# generation-wise base egg steps
sns.scatterplot(data=pokemon_with_stats, x="generation", y = "base_egg_steps", hue = "is_legendary"  ).set(title="\ngeneration vs base_egg_steps\n")

In [None]:
sns.lmplot(data=pokemon_with_stats, x="generation", y = "pokedex_number").set(title="\ngeneration vs pokedex_number\n")

print("Pokedex number is given to the pokemon based on the order of generation. First generation pokemon are numbered first.")

In [None]:
df.name.isna().sum()
df.japanese_name.isna().sum()
df.pokedex_number.isna().sum()
df.type1.unique()
df.type2.isna().sum()
#Explore later - Who has type 2 power, who doesn't.
df.base_egg_steps.unique()
df.base_egg_steps.isna().sum()
df.base_happiness.isna().sum()
df.base_happiness.unique()
df.experience_growth.isna().sum()
df.experience_growth.unique()

# **Conclusion**


*   The data can be grouped into 3 sections: Pokemon information, strength matrix and stats

**Considering pokemon information:**


*   All Pokemon have a pokedex number, a name and a Japanese name.

*   Pokedex number is given to the Pokemon based on the order of generation. eg.: First generation Pokemon are numbered first and then the second generation...etc.

*   All Pokemon also have a type 1 power out of 18 ('grass', 'fire', 'water','bug', 'normal', 'poison', 'electric', 'ground', 'fairy', 'fighting', 'psychic','rock', 'ghost', 'ice', 'dragon', 'dark', 'steel', 'flying')

*   Not all Pokemon have a type 2 power.

*  Taller Pokemon(height_m) have higher weight (weight_kg), taller Pokemon have a lower base_happiness.

*  Some Pokemon do not have a male percentage, weight or height. The Pokemon who do not have a height are the same Pokemon who do not have a weight.

**Considering strength matrix:**

*   The columns with names starting with "against" describe the effect of each Pokemon against another Pokemon with each of type 1 powers. This shows that the game involves Pokemon battling against each other. 

*  This matrix probably help users or the game decision making system in choosing the opponent Pokemon to fight.(With advantage/disadvantage/fairness)

**Considering stats:** 

*  base_total has a positive association with hp, attack, defense, sp_ttack, sp_defense, and speed

*  'base_egg_steps', 'base_happiness','experience_growth', 'capture_rate' are discrete value fields. There is a small number of clearly defined levels/values for these.

> * base egg steps : 1280 , 2560 , 3840 , 5120 , 6400 , 7680 , 8960 , 10240 , 20480 , 30720
> * base happiness : 0 , 35 , 70 , 90 , 100 , 140
> * experience growth : 600000 , 800000 , 1000000 , 1059860 , 1250000 , 1640000  

*  Base_egg_steps value is very high in legendary Pokemon (more than 15000)

*  Base happiness of most legendary Pokemon is very low (0).

*  Even though there are  6 discrete experience growth values, generations 1, 2, 5, 6, 7 Pokemon only have 4 values. It is only in Pokemon of generations 3, 4 the other two values are in use.

*  Type1 and Type2 variation against experience growth, over generations is interesting. The common trend being few options are available for Type1,Type2 at lower experience growth levels. Though there is no clear trend of Type1,Type2 options against experience growth.

*  Dataset can be further analyzed considering legendary pokemon as a special scenario. 

**Overall:**

*  It can be concluded that some more game trends may be extracted by analyzing the dataset with more specific questions but all game rules might not be extracted from the data set. 

