<a href="https://colab.research.google.com/github/blackcrowX/Data-Analysis-Projects/blob/main/Python/pokemon8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">Exploratory Data Analysis into Pokemon</h1>

<img src="https://static.wikia.nocookie.net/logo-timeline/images/2/21/Pok%C3%A9mon_%28Print%29.svg/revision/latest?cb=20181024043055"/>

<p align="center"><em>Image taken from: https://logo-timeline.fandom.com/wiki/Pok%C3%A9mon/Other</em></p>


## Table of Contents

*   Introduction
*   Dataset
*   Setup

  1. [Import Libraries](#1)
  2. [Import Data](#2)

*   Data Cleaning

  3. [Review Dataframe](#3)
  4. [Review Info](#5)
  5. [Review Missing Values](#6)
  6. [Organise Columns](#8)
  7. [Adjust Index](#10)

*   Data Analysis

  8. [Frequency](#11)
  9. [The Strongest and The Weakest](#7)
  10. [The Fastest and The Slowest](#8)
  11. [Summary](#9)

*   Data Visualisation

  12. [Count Plot](#11)
  13. [Pie Plot](#12)
  14. [Box Plot and Violin Plot](#13)
  15. [Swarm Flot](#14)
  16. [Heat Map](#15)

*   Conclusion

# Introduction 


This data analysis case study will be on a dataset regarding pokemon. It contains data manipulations to try and find answers to questions using visuals of data and statistics. 

Considering how diverse Pokemon are, I was interested in analyzing this datset to learn how the game is balanced and to potentially identify the best Pokemon, if there exists one.



## Dataset

The dataset is a listing of all 898 Pokemon species, 1072 including alternate forms, as of 2021. It contains data about their number, name, first and second type, basic statistics, total statistics,  generation, and legendary status. The dataset was published by <a href="https://data.world/data-society/pokemon-with-stats">data.world</a>.

# Setup

## Step 1: Import Libraries

Import and configure libraries required for data analysis.

In [1]:
import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Step 2: Import Dataset
Import dataset as variable `df` into Python.

In [2]:
url = 'https://raw.githubusercontent.com/blackcrowX/Data-Analysis-Projects/main/Datasets/pokemon-stats-gen-1-8.csv'
df = pd.read_csv(url)

# Data Cleaning

## Step 3: Review Dataframe

Read the first five rows of the dataframe.

In [3]:
df.head()

Unnamed: 0,number,name,type1,type2,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,Mega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,3,Gigantamax Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False


Read the last five rows of the dataframe.

In [4]:
df.tail()

Unnamed: 0,number,name,type1,type2,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
1067,896,Glastrier,Ice,,580,100,145,130,65,110,30,8,True
1068,897,Spectrier,Ghost,,580,100,65,60,145,80,130,8,True
1069,898,Calyrex,Psychic,Grass,500,100,80,80,80,80,80,8,True
1070,898,Ice Rider Calyrex,Psychic,Ice,680,100,165,150,85,130,50,8,True
1071,898,Shadow Rider Calyrex,Psychic,Ghost,680,100,85,80,165,100,150,8,True


From the intial review of the head and tail of the dataframe there seem to be missing values (`NaN`) and the entries have a unique name instead of number value. 

## Step 4: Review Info
Check the info of the dataset.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1072 entries, 0 to 1071
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   number      1072 non-null   int64 
 1   name        1072 non-null   object
 2   type1       1072 non-null   object
 3   type2       574 non-null    object
 4   total       1072 non-null   int64 
 5   hp          1072 non-null   int64 
 6   attack      1072 non-null   int64 
 7   defense     1072 non-null   int64 
 8   sp_attack   1072 non-null   int64 
 9   sp_defense  1072 non-null   int64 
 10  speed       1072 non-null   int64 
 11  generation  1072 non-null   int64 
 12  legendary   1072 non-null   bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 101.7+ KB


The basic insight here is that the dataframe has 13 columns of which nine are of type integer, three are of type object and one is of type boolean. The amount of rows matches the amount of pokemon species including alternate forms. The column types match their respective value.

## Step 5: Review Missing Values
Check for `Null` or `NaN` values.


In [6]:
df.isnull().sum()

number          0
name            0
type1           0
type2         498
total           0
hp              0
attack          0
defense         0
sp_attack       0
sp_defense      0
speed           0
generation      0
legendary       0
dtype: int64

Since all Pokemon species have a primary type but not necessarily a secondary type, we'll fill in these missing values with a placeholder.

In [7]:
df['type2'].fillna(value='None', inplace=True)

With this all missing values have been filled and there are no further missing values in the dataframe.

## Step 6: Organize Columns

Rename columns type1 into primary_type and type2 into secondary_type.

In [8]:
df.rename(columns = {"type1":"primary_type", "type2":"secondary_type"}, inplace = True)

Check columns in the dataframe.

In [9]:
df.columns

Index(['number', 'name', 'primary_type', 'secondary_type', 'total', 'hp',
       'attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'generation',
       'legendary'],
      dtype='object')

## Step 7: Review Columns

Review columns of the dataframe.

In [11]:
print("Pokemon Names:", len(df["name"].unique()))
print("Pokedex Numbers:", len(df["number"].unique()))
print("Primary Types:", len(df["primary_type"].unique()))
print("Secondary Types:", len(df["secondary_type"].unique()))
print("Generations:", len(df["generation"].unique()))

Pokemon Names: 1072
Pokedex Numbers: 898
Primary Types: 20
Secondary Types: 19
Generations: 9


In [12]:
print("Total Stats:", min(df["total"].unique()), "-", max(df["total"]))
print("HP Stats:", min(df["hp"].unique()), "-", max(df["hp"]))
print("Attack Stats:", min(df["attack"].unique()), "-", max(df["attack"]))
print("Defense Stats:", min(df["defense"].unique()), "-", max(df["defense"]))
print("Special Attack Stats:", min(df["sp_attack"].unique()), "-", max(df["sp_attack"]))
print("Special Defense Stats:", min(df["sp_defense"].unique()), "-", max(df["sp_defense"]))
print("Speed Stats:", min(df["speed"].unique()), "-", max(df["speed"]))

Total Stats: 175 - 1125
HP Stats: 1 - 255
Attack Stats: 5 - 190
Defense Stats: 5 - 250
Special Attack Stats: 10 - 194
Special Defense Stats: 20 - 250
Speed Stats: 5 - 200


The unique length of the columns `name`, `number` and `legendary` are fine. The stats also appear to have a sensible range. But we have to further look into why the amount of primary types is higher than the amount of secondary types and why we have nine diffrent values for eight existing generations of pokemon.

In [13]:
print("Primary Type:",df["primary_type"].unique())
print("Secondary Type:",df["secondary_type"].unique())
print("Generations:",df["generation"].unique())

Primary Type: ['Grass' 'Fire' 'Water' 'Blastoise' 'Bug' 'Normal' 'Dark' 'Poison'
 'Electric' 'Ground' 'Ice' 'Fairy' 'Steel' 'Fighting' 'Psychic' 'Rock'
 'Ghost' 'Dragon' 'Flying' 'Graass']
Secondary Type: ['Poison' 'None' 'Flying' 'Dragon' 'Water' 'Normal' 'Psychic' 'Steel'
 'Ground' 'Fairy' 'Grass' 'Fighting' 'Electric' 'Ice' 'Dark' 'Ghost'
 'Rock' 'Fire' 'Bug']
Generations: [1 7 8 2 3 4 5 6 0]


Know we know that there are two incorrect value for the `primary_type` with `Blastoise` and `Graass` and one incorrect value for the `generation`  with `0`.
We will have to replace these values with their correct value.

In [14]:
df.loc[df['generation'] == 0]

Unnamed: 0,number,name,primary_type,secondary_type,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
950,808,Meltan,Steel,,300,46,65,65,55,35,34,0,True
951,809,Melmetal,Steel,,600,135,143,143,80,65,34,0,True
952,809,Gigantamax Melmetal,Steel,,600,135,143,143,80,65,34,0,True


In [38]:
df.generation = df.generation.replace(0, 7)

In [33]:
df.loc[df['primary_type'] == "Graass"]

Unnamed: 0,number,name,primary_type,secondary_type,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
978,830,Eldegoss,Graass,,460,60,50,90,80,120,60,8,False


In [39]:
df.primary_type = df.primary_type.str.replace("Graass", "Grass")

In [34]:
df.loc[df['primary_type'] == "Blastoise"]

Unnamed: 0,number,name,primary_type,secondary_type,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
15,9,Gigantamax Blasoise,Blastoise,Water,530,79,83,100,85,105,78,1,False


In [41]:
df.loc[15,["primary_type","secondary_type"]] = ["Water", "None"]

Check if the changes have been updated correctly.

In [42]:
print("Pokemon Names:", len(df["name"].unique()))
print("Pokedex Numbers:", len(df["number"].unique()))
print("Primary Types:", len(df["primary_type"].unique()))
print("Secondary Types:", len(df["secondary_type"].unique()))
print("Generations:", len(df["generation"].unique()))
print("Legendary:", len(df["legendary"].unique()))

Pokemon Names: 1072
Pokedex Numbers: 898
Primary Types: 18
Secondary Types: 19
Generations: 8
Legendary: 2


## Step 8: Adjust Index

Adjust dataframe index to `"name"`.

In [15]:
df.set_index("name")

Unnamed: 0_level_0,number,primary_type,secondary_type,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Bulbasaur,1,Grass,Poison,318,45,49,49,65,65,45,1,False
Ivysaur,2,Grass,Poison,405,60,62,63,80,80,60,1,False
Venusaur,3,Grass,Poison,525,80,82,83,100,100,80,1,False
Mega Venusaur,3,Grass,Poison,625,80,100,123,122,120,80,1,False
Gigantamax Venusaur,3,Grass,Poison,525,80,82,83,100,100,80,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,896,Ice,,580,100,145,130,65,110,30,8,True
Spectrier,897,Ghost,,580,100,65,60,145,80,130,8,True
Calyrex,898,Psychic,Grass,500,100,80,80,80,80,80,8,True
Ice Rider Calyrex,898,Psychic,Ice,680,100,165,150,85,130,50,8,True


# Data Analysis

## Step 8: Frequency

Now, let's see all unique types in Type 1 and Type 2.

In [16]:
print("Primary Type:",df["primary_type"].unique(), "=", len(df["primary_type"].unique()))
print("Secondary Type:",df["secondary_type"].unique(), "=", len(df["secondary_type"].unique()))

Primary Type: ['Grass' 'Fire' 'Water' 'Blastoise' 'Bug' 'Normal' 'Dark' 'Poison'
 'Electric' 'Ground' 'Ice' 'Fairy' 'Steel' 'Fighting' 'Psychic' 'Rock'
 'Ghost' 'Dragon' 'Flying' 'Graass'] = 20
Secondary Type: ['Poison' 'None' 'Flying' 'Dragon' 'Water' 'Normal' 'Psychic' 'Steel'
 'Ground' 'Fairy' 'Grass' 'Fighting' 'Electric' 'Ice' 'Dark' 'Ghost'
 'Rock' 'Fire' 'Bug'] = 19


## Step 9: Distribution over

# Data Visualisation

# Conclusion