# Machine Learning 2023-2024 - UMONS 
# Exploratory Data Analysis of the Pokemon dataset


The goal of the lab is to get more familiar with the pandas library in Python, which will allow you to manipulate dataframes, compute the statistics of its variables, and visualize them. Data exploration is an important step before using any of the Machine Learning model that you'll discover through the course. It will grant you a deeper understanding of the content of the dataset, which will ease any a posteriori manipulation.   

In this lab, we'll work with the 'Pokemon' dataset, which contains the attributes of several Pokemon across various generations:
- `#`: ID for each pokemon
- `Name`: name of each pokemon
- `Type 1`: each pokemon has a type, this determines weakness/resistance to attacks
- `Type 2`: second type for pokemons that are dual type
- `Total`: sum of all stats that come after this, a general guide to how strong a pokemon is
- `HP`: hit points, or health, defines how much damage a pokemon can withstand before fainting
- `Attack`: the base modifier for normal attacks (eg. Scratch, Punch)
- `Defense`: the base damage resistance against normal attacks
- `SP Atk`: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
- `SP Def`: the base damage resistance against special attacks
- `Speed`: determines which pokemon attacks first each round

**1. Import all necessary libraries**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use('fivethirtyeight') # Custom plot style

**2. Read the csv file 'Pokemon.csv' and load it into a Dataframe. Print the 10 first rows.** 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/bsouhaib/ML1-24/main/labs/lab1/data/Pokemon.csv')
df.head(n=10)

**3. Print technical informations of the Pokemon dataset using `.info()`.** 

In [None]:
df.info()

**4. Print the shape of the dataframe.** 

In [None]:
df.shape

**5. Drop the "#' column and set the dataframe index to the 'Name' column.**

In [None]:
df = df.drop(columns=['#'])  # drop the '#' column.
df = df.set_index('Name')  # Set 'Name' as index.
df.head()

**6. Check if there are any missing values in the dataframe, and count them per column. For non numerical variables, replace the missing values by 'Unknown'. Check that the dataframe does not contain missing values anymore.**

In [None]:
# Count missing values
df.isna().sum()

In [None]:
# Replace missing values for the categorical variable 'Type 2' with 'Unknown'.
df['Type 2'] = df['Type 2'].fillna('Unknown')

# Check that the dataframe does not contain any more missing values.
assert df.notna().all().all() # First all is for rows, second all is for columns.
df.isna().sum()

**7. Change the data types of the variables 'Type 1' and 'Type 2' and 'Generation' to categorical data**

In [None]:
df = df.astype({'Type 1': 'category', 'Type 2': 'category', 'Generation': 'category'})
df.dtypes

**8. Get general statistics (mean, standard deviation, ...) for the numerical variables of the dataset.** 

In [None]:
with pd.option_context('display.precision', 1): # We only display one decimal place
    display(df.describe())
 
np.sqrt((df['HP']**2).mean() - df['HP'].mean()**2)

**9. For the categorical variables, count the number of values per category, as well as the count of co-occurences, i.e. the times categorical variables occur simultaneously.**

In [None]:
df['Type 1'].value_counts()

In [None]:
df['Type 2'].value_counts()

In [None]:
df['Generation'].value_counts()

In [None]:
df[['Type 1', 'Type 2', 'Generation']].value_counts()

**10. Get all the attributes of 'Bulbasaur'**

In [None]:
df.loc['Bulbasaur']

**11. Sort the dataframe by increasing values of 'Attack' and decreasing values of 'Defense' (i.e. if two Pokemons have the same value for 'Attack', the one with higher 'Defense' should appear first).** 

In [None]:
df = df.sort_values(by=['Attack', 'Defense'], ascending=[True, False])
df.head()

**12. Create a dataframe containing all Pokemons of type 1 'Psychic' having more than 100 in 'Attack', less than 40 in 'Defense' and more than 45 in Speed.**

In [None]:
sub_df = df[
    (df['Type 1'] == 'Psychic') 
    & (df['Attack'] > 100) 
    & (df['Defense'] < 40) 
    & (df['Speed'] > 45)
]
sub_df.head()

**13. Create two new columns, 'AttackAll' and 'DefenseAll', which correspond to the sum of 'Attack' and 'Sp. Attack' and the sum of 'Defense' and 'Sp. Defense', respectively.** 

In [None]:
df['AttackAll'] = df['Attack'] + df['Sp. Atk']
df['DefenseAll'] = df['Defense'] + df['Sp. Def']
df.head()

**14. Create a new column 'AtkOverDef' corresponding to the ratio of 'AttackAll' over 'DefenseAll' for each Pokemon.** 

In [None]:
df['AttkOverDef'] = df['AttackAll'] / df['DefenseAll']
# Divisions by 0 are automatically mapped to 'inf'.
df.head()

**15. Change the column names to upper cases, and remove the '.' in the column names, as well as blanks.** 

In [None]:
# change into upper case
df.columns = df.columns.str.upper().str.replace('.', '', regex=False).str.replace(' ', '')  
df.head()

**16. Plot an histogram of the different 'TYPE1' categories. The figure must be 16 inch wide and 4 inch high. 
Use the matplotlib.pyplot library and the countplot method from the seaborn librabry. The counts should appear in increasing order.**

In [None]:
plt.figure(figsize=(16, 4))
sns.countplot(x='TYPE1', data=df, order=df.TYPE1.value_counts().index);

**17. Do the same as above, but for the 'TYPE2' categories.** 

In [None]:
plt.figure(figsize=(16, 4))
sns.countplot(x='TYPE2', data=df, order=df.TYPE2.value_counts().index);

**18. Plot the densities of the variables 'ATTACK', 'DEFENSE' and 'SPEED' onto three separates plots. Use the displot method of the library seaborn.**

In [None]:
for x in ['ATTACK', 'DEFENSE', 'SPEED']:
    sns.displot(x=x, kind='kde', data=df, height=3, aspect=2)
    #sns.displot(x=x, kde=True, data=df, height=3, aspect=2) # alternative

**19. Plot the density of the variable 'ATTACK' for Legendary and non Legendary pokemons. The two densities should appear on different facets of the same plot.**

In [None]:
sns.displot(data=df, x='ATTACK', kind='kde', col='LEGENDARY');

**20. Generate a scatter plot of the variable 'DEFENSE' on the y-axis, and the variable 'ATTACK' on the x-axis. Legendary and non-legendary Pokemons should be indicated using different colors.**

In [None]:
sns.scatterplot(data=df, x='ATTACK', y='DEFENSE', hue='LEGENDARY');

**21. Filter the dataframe to contain only Pokemons of generations 1 and 4. Using the filtered dataframe, generate a scatter plot of the variable 'TOTAL' on the y-axis, and the variable 'ATTACK' on the x-axis, by separating the two filtered generations. Note that, after filtering the dataframe, you can use the method `Series.cat.remove_unused_categories` to remove unused categories from the plot. The figure shoud be 8 inches high, and 8 inches wide.**

In [None]:
df_filtered = df[(df.GENERATION == 1) | (df.GENERATION == 4)].copy()
df_filtered.GENERATION = df_filtered.GENERATION.cat.remove_unused_categories()

plt.figure(figsize=(8, 8))
sns.scatterplot(x='ATTACK', y='TOTAL', data=df_filtered, hue='GENERATION');

**22. Create a histogram of the variable 'GENERATION'. Separate legendary and non-legendary Pokemons. The counts should appear on the same figure in decreasing order.**

In [None]:
sns.countplot(x='GENERATION', data=df, order=df.GENERATION.value_counts().index, hue='LEGENDARY');

**23. Generate a boxplot of the variable 'TOTAL' with the method boxplot from the library seaborn. How to interpret it?** 

In [None]:
plt.figure(figsize=(1, 4))
sns.boxplot(y='TOTAL', data=df);

The horizontal line within the box corresponds to the median of the variable 'TOTAL' (around 450). The extremities of the box correspond to the 25% quartile $Q_1=330$, and to the 75% quartile $Q_3=515$, i.e. 25% of all observations of the variable 'TOTAL' are below 330, while 75% of all observations are below 515. 

The 'T' lines correspond to the whiskers. The extremity of the upper whisker corresponds to the highest observation that is lower than $Q_3 + 1.5\text{IQR}= 792.5$, while the extremity of the lower whisker corresponds to the lowest observation that is higher than $Q_1 - 1.5\text{IQR}=52.5$, where $\text{IQR} = Q_3 - Q_1=185$ is the inter-quartile range. Here the highest observation for the variable 'TOTAL' is 780, while the lowest observation is 180.  Any value above $Q_3 + 1.5\text{IQR}$ or below $Q_1 - 1.5\text{IQR}$ is called an *outlier*, and appears as a circle on the boxplot. In our case, we don't have outliers for the variable 'Total'.

In [None]:
df.TOTAL.describe()

**24. Generate one boxplot of the variable 'TOTAL' per category of the variable 'GENERATION'. Separate legendary and non-legendary Pokemons. All boxplots must appear on the same plot.**

In [None]:
plt.figure(figsize=(6, 6))
sns.boxplot(x='LEGENDARY', y='TOTAL', data=df, hue='GENERATION');