# Data Visualization
Pokemon Edition

## Welcome

With our fundamentals covered, we can now turn to analyzing some data. You'll notice that the `data` folder contains [three files](https://github.com/hermish/hkn-workshops/tree/master/data-visualization/data).

1. The first one contains the pokemon characteristics (the first column being the id of the pokemon).
2. The second one contains information about previous combats. The first two columns contain the ids of the combatants and the third one the id of the winner. Important: The pokemon in the first columns attacks first.

Open [these files](https://github.com/hermish/hkn-workshops/tree/master/data-visualization/data) to get a feel for what the raw data looks like: our data is in a `csv` format which stands for comma-separated values. Basically, our data comes in a table, with entries are separated by commas.

## Reading in Data

To be able to do anything with this data, we first need to read the data into python. Luckily, this is such a common task that there are already many tools for working with, reading and visualizing data. Below, we import these functions and read in the data.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # plotting
import seaborn as sns # aesthetics
%config InlineBackend.figure_format = 'retina'

In [None]:
# Read pokemon statistics and print first 10 rows out of 800
data = pd.read_csv('data/pokemon.csv')
print(len(data))
data.head(10)

In [None]:
# Print summary for columns!
print(data.info())
# Display last 10 rows
data.tail(10)

## Matplotlib

Matplotlib is a python library that help us to plot data: the most basic plots are line, scatter and histogram plots.

- Line plots usually used when the x-axis is time
- Scatterplots are better when we want to check if there is correlation between two variables
- Histograms visualize the distribution of numerical data


Matplotlib allows us to customize the colors, labels, thickness of lines, title, opacity, grid, figure size... Basically, we can make any graph conceivable, though some are easier than others.

In [None]:
# LINE PLOTS
data['Speed'].plot(
    kind='line', # Type of plot
    color='blue', # Line color
    label='Speed', # Legend name
    linewidth=1,
    alpha = 0.5, # Transparency
    grid = True
)
data['Defense'].plot(
    color='green',
    label='Defense',
    linewidth=1,
    alpha = 0.5,
    grid = True
)

plt.legend(loc='upper right') # legend locations
plt.xlabel('Pokemon Number') # x-axis label
plt.ylabel('Stat') # y-axis label
plt.title('Speed and Defense across Pokemon') # Title
plt.show()

In [None]:
# SCATTER PLOT
data.plot(
    kind='scatter', # Type of plot
    x='Attack', # x-axis column
    y='Defense', # y-axis column
    alpha = 0.5, # Transparency
    color = 'red'
)
plt.xlabel('Attack') # x-axis label
plt.ylabel('Defense') # y-axis label
plt.title('Attack-Defense Relationship') # Title
plt.show()

In [None]:
# HISTOGRAM
data['Speed'].plot( # Choosing what to plot
    kind = 'hist', # Type of plot
    bins = 50,
    figsize = (15, 15) # Size of graph
)
plt.show()

## Working with Data

The table from which we are getting the data from is called a DataFrame. Each of the columns can be thought of as a list or series of data, which can accessed similar to the way we access elements in a list. Below we print out the first defense of the first 10 pokemon and the attack for the first 5.

In [None]:
print(data['Defense'].head(10))
print(data['Attack'].head(5))

We can then filter easily the data to show all the pokemon with more than 200 defense. We do this by checking if the defense of each row is greater than 200 and then only asking for the rows for which this is true. It turns out that there are only 3 pokemon like this! Change the condition to see what else you can discover.

In [None]:
strong_defenders = data['Defense'] > 200
data[strong_defenders]

## Data Analysis

Now we can analyze of the data on the pokemon—the easiest way to get an idea of how how data looks is to plot some of it and get a few facts about it. Luckily for us, it is really easy to get the following from our table:

- count: number of entries
- mean: the average numberr, over all pokemon
- std: standard deviation, measures how spread out the data is
- min: minimum entry
- 25%: first quantile (25% are below)
- 50%: median or second quantile (50% are below)
- 75%: third quantile (75% are below)
- max: maximum entry

In [None]:
data.describe()

From this overview, we can easily see the minimimum and maximum for each of these statistics. To visualize this, we often use a boxplot, which allows us to easily picture the average, spread and outliers. The black bars the top and bottom show the maximum and minimum data points. The red line in the middle is at the mean and the box shows the distance from the 25th to the 75th percentile.

In [None]:
data.boxplot(
    column='Attack'
)
plt.show()

Notice that some of our pokemon are legendary while others are not. We might guess that the attack of legendary pokemon are greater than non-legendary pokemon. We could test our hypothesis by creating a boxplot for the attacks of each group. We can easily do this using the `by` keyword.

In [None]:
data.boxplot(
    column='Attack', # Column plotting
    by='Legendary' # Group
)
plt.show()

# More

Finally, there's a lot more interesting figures we can make using python. For example, when we plotted attack and defense they seemed to be linked or *correlated*.

In general, we can test how strongly two variables are correlated they are using the "Pearson product-moment correlation coefficient." This number will always be between +1 and -1. More extreme numbers (near $\pm1$) mean the correlation is strong, while numbers closer to zero indicate that the variables are less strongly linked.

We can visualize how all the variables are correlated by plotting these values in a heat map. The more strongly correlated values are more strongly colored.

In [None]:
f, ax = plt.subplots(figsize=(6, 6))
sns.heatmap(
    data.corr(),
    annot=True,
    linewidths=.5,
    fmt= '.1f',
    ax=ax
)
plt.show()

## Make your own Graphs!