# Visualizing Data with Pandas, Matplotlib, and Seaborn

[Matplotlib](http://matplotlib.org) is basic plotting library for Python inspired by Matlab. 
[Seaborn](http://stanford.edu/~mwaskom/software/seaborn) is built on top of matplotlib with integrated analysis and specialized plots.

Also see [the full gallery of Seaborn](http://stanford.edu/~mwaskom/software/seaborn/examples/index.html) or [Matplotlib](http://matplotlib.org/gallery.html).

We're going to be using data on football player attributes from FIFA 2017. Why? Well, because it's fun, it's highly multi-dimensional, and we can process the data in an infinite number of ways. The [dataset](https://www.kaggle.com/artimous/complete-fifa-2017-player-dataset-global) has some other parts too, in case you're interested in continuing poking around this data. 


In [None]:
#disable some annoying warning
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

#plots the figures in place instead of a new window
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
from pandas.tools.plotting import radviz, parallel_coordinates, andrews_curves
import numpy as np

## Load Data

Here we're going to use the gapminder data, from the famous work of Hans Rosling. 


In [None]:
fifa = pd.read_csv('data/fifa_player_data.csv.gz', compression='gzip')

In [None]:
sorted(list(fifa.columns))

In [None]:
fifa.head(5)

In [None]:
# We can also sort the data in any number of ways
fifa.sort_values('Acceleration', ascending=False).head(5)

In [None]:
# We can also make the sort work on numerous values
fifa.sort_values(['Acceleration', 'Curve'], ascending=False).head(5)

In [None]:
# We can also query our data
fifa.query('Club_Position == "ST"').head()

## Grouping

We can even group items much like we'd do in SQL, perform calculations on some aggregated property, then plot values, just to get an idea of larger scale trends

In [None]:
fifa.groupby('Nationality').agg({'Ball_Control':'max'}).sort_values('Ball_Control', ascending=False).plot(kind='bar', figsize=(30, 5))

<p style="color:#f1c40f; font-size: 2em">Exercise 1</p>

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">Which <strong>Club_Position</strong> has the players with the highest average <strong>Speed</strong>?</p>

In [None]:
# Answer 1

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">Which <strong>Club_Position</strong> has the players with the highest average <strong>Speed</strong> variance?</p>

In [None]:
# Answer 2

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">Which <strong>Club</strong> has the players with the highest average <strong>Speed</strong>?</p>

In [None]:
# Answer 3

### Histograms

In [None]:
# We can plot the distributions for all of our features, just to see how everything looks
fifa.hist(bins=20, figsize=(20,20));

In [None]:
# We can plot the distribution for just one of our features, and also change the number of bins.
fifa.Penalties.plot(kind='hist', bins=20)

In [None]:
plt.figure(figsize=(20,6))
sns.countplot('Club_Position', data=fifa, palette='viridis')

This data is a at first glance at little strange. Unfortunately, in FIFA, many players in a squad will not be in the first team. Therefore we have many players classified as 'Sub' or 'Res'. We also have some CF (Centre Forwards) but this is the same as a 'ST' (Striker). 

** Many of our analyses may not care about the position, but in the ones that do, we should ideally clean our data. **

In [None]:
fig = plt.figure(figsize=(30, 7))
g = sns.countplot('Nationality', data=fifa, palette=sns.color_palette(['#2ecc71']), order=fifa.groupby('Nationality').size().sort_values(ascending=False).index)
# This rotates the labels, otherwise, we can't see anything!
plt.setp(g.get_xticklabels(), rotation=90)
g.figure.get_axes()[0].set_yscale('log')

In [None]:
# seaborn also offers not just a histogram but also an kernel density enstimation
sns.distplot(fifa.Rating,bins=50, color='#2ecc71')

<p style="color:#f1c40f; font-size: 2em">Exercise 2</p>

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">EASY: Plot the distributions for <strong>Acceleration</strong>.</p>

In [None]:
# Answer 4

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">ADVANCED: Plot two histograms, one on top of the other for <strong>Acceleration</strong> of Italian Players vs Brazilian players.</p>

In [None]:
# Answer 5

### Box Plots

In [None]:
#group by continent
fifa.boxplot('Rating', by='Nationality', rot=90, figsize=(70, 10))

In [None]:
# Or, we can use seaborn, which has a nicer styling by default.
fig = plt.figure(figsize=(30, 7))
g = sns.boxplot(x="Nationality", y="Acceleration", data=fifa, palette=sns.color_palette(['#2ecc71']))
plt.setp(g.get_xticklabels(), rotation=90);

### Violin Plots

In [None]:
# Violin plots do a better job of showing us the number of support points across the distribution
fig = plt.figure(figsize=(30, 7))
g = sns.violinplot(x="Club_Position", y="Acceleration", data=fifa, palette=sns.color_palette(['#2ecc71']))

### Swarm Plots

These plots the actual points, with some jitter to help us see how many points (roughly are around particular values).

In [None]:
fig = plt.figure(figsize=(30, 7))
g = sns.swarmplot(x="Club_Position", y="Acceleration", data=fifa.sample(1000), palette=sns.color_palette(['#2ecc71']))

In [None]:
# Swarm Plots can also be overlaid on top of other plots.
# We use a sample here, otherwise there are many points to plot.
_sample = fifa.sample(2000)

fig = plt.figure(figsize=(30, 7))
g = sns.boxplot(x="Nationality", y="Acceleration", data=_sample, palette=sns.color_palette(['#2ecc71']))
g = sns.swarmplot(x="Club_Position", y="Acceleration", data=_sample, palette=sns.color_palette(['#2ecc71']))
plt.setp(g.get_xticklabels(), rotation=90);

### Boxen Plots

In [None]:
# Boxen Plots do an equally good job of showing us how many points on the Y axis by having a rectangle that scales
# by the number of points in each bin.
fig = plt.figure(figsize=(30, 7))
g = sns.boxenplot(x="Club_Position", y="Acceleration", data=fifa, palette=sns.color_palette(['#2ecc71']))

In [None]:
# Let's group by Country and Show Position Differences
fig = plt.figure(figsize=(30, 7))
g = sns.boxenplot(x="Nationality", y="Finishing", hue="Club_Position", data=fifa[fifa.Nationality.isin(['Germany', 'Sweden', 'France', 'Switzerland', 'Israel'])], palette='viridis')

<p style="color:#f1c40f; font-size: 2em">Exercise 3</p>

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">Easy: Create a plot grouped by Club_Position showing the distribution of acceleration across each Nationality?</p>

In [None]:
# Answer 6a

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">ADVANCED: How can we sort the values of the box plots so that they are ordered by mean?</p>

In [None]:
# Answer 6b

## Plotting Joint Distributions

We can also look at the correlations between variables in the traditional scatter plot. We can create them in a number of ways.

### Scatter Plots

In [None]:
# The first way is using Matplotlib
plt.figure(figsize=(20,6))
plt.scatter(x=fifa['Acceleration'],y=fifa['Speed'], alpha=0.2)
plt.title('Acceleration vs Speed')
plt.xlabel('Acceleration')
plt.ylabel('Speed')

We can also create one image with many plots inside. Here we'll create a number of subplots using matplotlib.

There are other ways of doing this too using subplot2grid for instance, but this is the easiest for now.

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(10,10))

# We can access each subplot via a 2D Axes Array.
# An easier way in which to access the axes is to use axes.ravel() which will turn the 2D array into a 1D array
# and allow you to reference the first plot with axes[0], second with axes[1] and so on.

axes[0,0].scatter(x=fifa['Acceleration'],y=fifa['Speed'])
axes[0,0].title.set_text('Acceleration vs Speed')
axes[0,0].set_ylabel('Speed')
axes[0,0].set_xlabel('Acceleration')

axes[0,1].scatter(x=fifa['Acceleration'],y=fifa['Age'])
axes[0,1].title.set_text('Acceleration vs Age')
axes[0,1].set_xlabel('Acceleration')
axes[0,1].set_ylabel('Age')

axes[1,1].scatter(x=fifa['Acceleration'],y=fifa['Balance'])
axes[1,1].title.set_text('Acceleration vs Balance')
axes[1,1].set_xlabel('Acceleration')
axes[1,1].set_ylabel('Balance')


axes[1,0].scatter(x=fifa['Acceleration'],y=fifa['Stamina'])
axes[1,0].title.set_text('Acceleration vs Stamina')
axes[1,0].set_xlabel('Acceleration')
axes[1,0].set_ylabel('Stamina')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15,15))
g = sns.scatterplot("Ball_Control", "Dribbling", hue='Club_Position', size='Rating', data=fifa, color="#555555")

### Joint Plots

Seaborn offers a lot of functionality out of the box.
For instance, the joint plot gives us both histograms for the distributions of each variable, and also the joint distribution in the form of a scatter plot.

In [None]:
sns.jointplot("Ball_Control", "Dribbling", data=fifa, kind="reg", color="#555555", size=7)

<p style="color:#f1c40f; font-size: 2em">Exercise 4</p>

<br/>
<div style="background: #f1c40f; padding: 10px; color: #2c3e50"><p>EASY: Which features are more correlated? </p>
<ul>
<li><strong>Reactions</strong> and <strong>Speed</strong>;</li>
<li><strong>Reactions</strong> and <strong>Curve</strong>; or</li>
<li><strong>Reactions</strong> and <strong>Composure</strong></li>
</ul>
</div>

In [None]:
# Answer 7


## Multivariate Distributions

So far, we've seen how we can we can visualize one or two variables and their distributions, but we often have many more variables. Therefore we'd like to be able to visualize more at once to discover areas of interest.

### Scatter Plot Matrices

In [None]:
# We can do this pretty easily with Seaborn, however, for very large data frames, 
# this is computationally expensive to plot. So we'll focus on one country, Spain!
sns.pairplot(fifa[fifa.Nationality=='Spain'][['Composure', 'Long_Shots', 
                                              'Curve', 'Volleys', 'Jumping', 
                                              'Heading', 'Club_Position']], hue='Club_Position')

<p style="color:#f1c40f; font-size: 2em">Exercise 5</p>

<br/>
<div style="background: #f1c40f; padding: 10px; color: #2c3e50"><p>EASY: This is a great example of a potentially nice visualization can be messed up by a terrible colour map.<br/>
How would you go about improving this visualization?</p>
</div>

In [None]:
# Answer 8

Plotting individual points is cool, but there is a lot of noise in these plots. There are other techniques you can use to increase the saliency of content in the visualization. 

A key question to always ask yourself is, what do I really want to see? In this instance, I want to see which values are more correlated with each other. Therefore having all the points, while nice, will slow down my visualization, and add noise.

So, with that in mind, let's look at visualizing this data in a more clean way using Kernel Density Estimations.

### KDE Matrices

In [None]:
g = sns.PairGrid(fifa[fifa.Nationality=='Spain'][['Composure', 'Long_Shots', 'Curve', 'Volleys', 'Jumping', 'Heading']])
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=6);

In [None]:
g = sns.PairGrid(fifa[fifa.Nationality=='Spain'][['Composure', 'Long_Shots', 'Curve', 'Volleys', 'Jumping', 'Heading', 'Club_Position']].sample(500))
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)
g.map_diag(plt.hist);

<div style="background: #f1c40f; padding: 10px; color: #2c3e50"><p>EASY: Switch out my variables for some that you are interested in.<br/></p>
</div>

In [None]:
# Answer 9

### Parallel Coordinates

My favourite! Perfect for visualizing many variables at once.

In [None]:
spanish_players = fifa[(fifa.Nationality  == 'Spain') & (fifa.Club_Position.isin(['RF', 'CB', 'GK']))]
string_columns = [fifa.columns[idx] for idx, data_type in enumerate(fifa.dtypes) if data_type == 'object']
string_columns.pop(string_columns.index('Club_Position'))

In [None]:
plt.figure(figsize=(30, 6))
g = parallel_coordinates(spanish_players.drop(string_columns + ['Contract_Expiry'], axis=1), 'Club_Position', colormap='viridis')
plt.legend(loc=(1,1))
#plt.setp(g.get_xticklabels(), rotation=90)

You can play with these if you want :D

### RadVis 

Sometimes confusing, but I think they are neat!

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=2, figsize = (20,8))

plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.edgecolor'] = '#ecf0f1'

countries = [['France', 'Spain'], ['England', 'Germany']]
positions = ['RM', 'LM', 'CAM']
n_rows = 2
n_cols= 2

for row in range(n_rows):
    for col in range(n_cols):
        ax = axes[row, col]
        country=countries[row][col]

        players_no_gk_sub_res = fifa[(fifa.Nationality  == country) & (fifa.Club_Position.isin(positions))]
        radviz(players_no_gk_sub_res[['Shot_Power', 'Agility', 'Speed', 'Acceleration', 'Vision', 'Ball_Control', 'Club_Position']], 'Club_Position', colormap='viridis', ax=ax)
        
        ax.set_title(country)
#         if(row != 0 & col !=1):
#             legend = ax.legend()
#             legend.remove()

fig.tight_layout()

## Global Correlation Detection

While the techniques we've shown above are good for a few data points, they don't scale so well. What we often do instead is go for a global view of the data first in order to understand the relationships at a macro level.

This is a very fast process to carry out in Pandas.

In [None]:
# Correlation Tables for all variables can be quickly computed across a large dataframe
fifa.corr()

<p style="color:#f1c40f; font-size: 2em">Exercise 6</p>

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">MEDIUM: So, you can quickly calculate all feature correlations using Pandas, but reading all these numbers is a cognitively demanding task. How can we visualize such data? Hint, have a look at the seaborn gallery :)</p>

In [None]:
# Answer 10

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">MEDIUM: Remember in our lectures how we showed it was easier for people to compare heatmaps when related information is placed together? <br/>What component in seaborn could help you do this?<br/>Can you find the surprising relationship using this plot?</p>

In [None]:
# Answer 11

> You're almost on the cusp of data science now :D

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">ADVANCED: At the macro level, we always lose information, since we lose granularity. Can you partition the data on some level to see how correlations change?</p>

In [None]:
# Answer 11
# Hint, you can for instance, query the dataframe by some Nationalities, and plot each correlation heatmap in 
# a separate plot.

## Faceted Visualizations

How you split the data can also provide many insights, and there are many ways in which we could split this data (which is why I chose it :)).

In [None]:
# or with linear regression
g = sns.FacetGrid(fifa, col="Club_Position", size=2, aspect=1, col_wrap=5, sharex=True, sharey=False)
g.map(plt.hist, 'Acceleration', bins=20).add_legend()

> Cool, right? You can play with this more

In [None]:
g = sns.FacetGrid(fifa[fifa.Nationality.isin(['Spain', 'Italy', 'England', 'France'])], col="Club_Position", 
                  hue="Nationality", size=3, aspect=1, col_wrap=5, sharex=True, sharey=False)
g.map(plt.scatter, 'Acceleration', 'Ball_Control').add_legend()

## Building a Scouting System

** Clustering with Seaborn (with scipy in the back office) **

Seaborn also has some great tools available to help in some common tasks such as data clustering.

With this in mind, we'll have a look at how we can use seaborn to scout players to find cheaper alternatives for the expensive ones :)

In [None]:
# Pivot tables allow us to change how our data is formatted.
# Here I want to have a row per player, where I have 4 features
pivoted = fifa[fifa.Club_Position == 'ST'].pivot_table(index=['Name'], values=['Ball_Control', 'Reactions', 'Strength', 'Acceleration'])

In [None]:
pivoted.head()

In [None]:
g = sns.clustermap(pivoted, z_score=1, figsize=(10,60), center=0, cmap="RdBu")
# plt.setp(g.ax_heatmap.get_yticklabels(), rotation=0);

In [None]:
# We'll remove the string columns, since Parallel Coordinates etc. will not work with them
string_columns = [fifa.columns[idx] for idx, data_type in enumerate(fifa.dtypes) if data_type == 'object']
string_columns.pop(string_columns.index('Name'))
filtered_data = fifa[fifa.Name.isin(['Romelu Lukaku', 'Artem Dzyuba', 'Diego Costa', 'Christian Benteke', 'Negredo', 'Nino', 'Alexandre Lacazette'])].drop(string_columns + ['Contract_Expiry'], axis=1)

In [None]:
fig = plt.figure(figsize=(30, 7))
pc = parallel_coordinates(filtered_data, 'Name', colormap='viridis')
plt.setp(pc.get_xticklabels(), rotation=90);

In [None]:
plt.figure(figsize=(10, 6))
radviz(filtered_data[['Stamina', 'Agility', 'Speed', 'Acceleration', 'Strength', 'Ball_Control', 'Aggression', 'Name']], 'Name',colormap='viridis')
plt.legend(loc=(1,0.6)) 

## Styling Matplotlib Plots

By default, matplotlib is undoubtedly ugly. But, all is not lost, you can clean up matplotlib plots 

In [None]:
pd.set_option("display.max_columns",None)
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.edgecolor'] = '#ecf0f1'
plt.rcParams['grid.color'] = '#ecf0f1'
plt.rcParams['axes.labelcolor'] = '#7f8c8d'
plt.rcParams['text.color'] = '#7f8c8d'

In [None]:
plt.figure(figsize=(10, 6))
radviz(filtered_data[['Stamina', 'Agility', 'Speed', 'Acceleration', 'Strength', 'Ball_Control', 'Aggression', 'Name']], 'Name', colormap='Spectral')
plt.legend(loc=(1,0.6)) 

> Wow, I hear you say, how beautiful :D

## Your Own Exploratory Analysis

<p style="color:#f1c40f; font-size: 2em">Exercise 7</p>

<p style="background: #f1c40f; padding: 10px; color: #2c3e50">OPTIONAL: Conduct your own interesting analysis.</p>