## Tutorial 3. Exploratory data analysis : *the beauty of seaborn*


Created by Emanuel Flores-Bautista 2019  All content contained in this notebook is licensed under a [Creative Commons License 4.0](https://creativecommons.org/licenses/by/4.0/). The code is licensed under a [MIT license](https://opensource.org/licenses/MIT).

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets
import matplotlib as mpl
from math import pi

#Setting all the plots in the notebook
%matplotlib inline

#Make the figure format appear as svg
%config InlineBackend.figure_format = 'svg' 

A key aspect of data analysis is *finding a story within a data set*. Whether you are trying to analyze a dataset for scientific research or trying to develop a data-driven story for a grant blog post, or whatever the purpose it may be, exploratory data analysis is the best way to start getting an intuition behind the data you're working with. 

In this tutorial we will walk through with more depth some of the plots you can make in both Matplotlib and specially Seaborn. I will show you what I think are some great plotting tools you can have towards making meaningful exploratory data analysis. 

### Visualizing distributions

We're going to start by looking at different methods to visualize distributions. Let's start with the `plt.hist` function of matplotlib. In order to do that we're going to draw 1000 random samples from the Gaussian distribution, with a mean ( $\mu$ ) = 5 and a standard deviation ($\sigma$) = 2 . 

In [None]:
np.random.seed(42)
x = np.random.normal(loc = 5, scale = 2, size = 1000) #gaussian distro, mean = 5, std = 2


plt.hist(x)
plt.xlabel('values')
plt.ylabel('frequency');


We can easily control the plotting styles using the [sns.set_style](https://seaborn.pydata.org/tutorial/aesthetics.html) function of the Seaborn library. For example, we can plot behind a "darkgrid".

In [None]:
sns.set_style('darkgrid')

In [None]:
plt.hist(x)
plt.xlabel('values')
plt.ylabel('frequency');

Cool right? We can see that the font also changed!

I personally prefer to use a white background. I also made a simple function to set the plotting options in this workshop. Let's load the TCD_19 plotting options! 

In [None]:
import TCD19_utils as TCD_19

TCD_19.set_plotting_style_2()

In [None]:
plt.hist(x)
plt.xlabel('values')
plt.ylabel('frequency');

Now, we have different fontsizes and other cool features to optimize visualization. 

In the previous tutorial I mentioned that violin plots make a kernel density estimation(KDE) of the distribution. If you want a deeper intuition about what a KDE means please refer to the great [Jake Van der Plas' Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html). KDE plots are a nice way to visualize continuous distributions. The Seaborn library has some nice implementations in both the `sns.distplot` and `sns.kdeplot` functions. [Here is a nice walkthrough](https://seaborn.pydata.org/tutorial/distributions.html) of the different options to visualize distributions in Seaborn. 

In [None]:
fig, axes = plt.subplots(4, 1, figsize = (8,8))

sns.distplot(x, ax = axes[0]);
sns.distplot(x, rug = True, hist = False, ax = axes[1])
sns.rugplot(x, ax = axes[2]);#Plot datapoints in an array as sticks on an axis.
sns.kdeplot(x, shade = True, ax= axes[3]);

Now, let's draw some samples from the Laplace distribution.

In [None]:
y = np.random.laplace (loc = 5, scale = 2, size = 1000)

In [None]:
sns.kdeplot(y, shade = True)

Now let's concatenate both arrays and visualize their joint distribution. 

In [None]:
data = np.vstack((x, y))

In [None]:
data = data.T

We can also visualize 2-D distributions using the `sns.kdeplot` function. I personally love this visualization. In this plot, the color represents the joint probability density, that is the more points in a particular region, the more purple (in this case) the contour will appear. The `n_levels` parameter controls the number of contours the plot will show. 

In [None]:
plt.figure(figsize = (8,4))

sns.kdeplot(x,y, n_levels= 15, cmap = 'viridis_r', shade = False)

plt.ylim(-5, 13);

In [None]:
plt.figure(figsize = (8,4))

sns.kdeplot(x, y, n_levels= 15, cmap = 'viridis_r', shade = False)

plt.scatter(x, y, alpha = 0.1, c = 'purple')
plt.ylim(-5, 13);

We can also shade the 2-D KDE.

In [None]:
plt.figure(figsize = (8,4))

sns.kdeplot(x,y, n_levels= 15, cmap = 'viridis_r', shade = True)
plt.ylim(-5, 13);

To change the directions towards other types of data, let's look back at our tips dataset.

In [None]:
tips = sns.load_dataset("tips")
tips.head()

In [None]:
tips.groupby('day').count()

The `sns.FacetGrid` is a great figure to separate a dataset into different categories.

In [None]:
ax = sns.FacetGrid(tips, col="sex", row = 'time', hue="smoker", palette = 'Set2_r')
ax.map(plt.scatter, "total_bill", "tip", alpha=.7)
ax.add_legend()
plt.subplots_adjust(hspace=1.3, wspace=1);

Another very flexible grid framework is the [`sns.catplot`](https://seaborn.pydata.org/generated/seaborn.catplot.html).

In [None]:
g = sns.catplot(x="tip", y="smoker",hue="sex", row="time",
                 data=tips,orient="h", height=2.4, aspect=3,
                palette="Set3", kind="box")

In [None]:
# Load the example Titanic dataset
titanic = sns.load_dataset("titanic")

# Set up a grid to plot survival probability against several variables
g = sns.PairGrid(titanic, y_vars="survived",
                 x_vars=["class", "sex", "who", "alone"],
                 height=5, aspect=.5)

# Draw a seaborn pointplot onto each Axes
g.map(sns.pointplot, scale=1.3, errwidth=4, color="xkcd:plum")
g.set(ylim=(0, 1))
sns.despine(fig=g.fig, left=True)

To wrap up our distribution visualization ride, let's look at the [`boxenplot`](https://seaborn.pydata.org/examples/large_distributions.html) which is a mixture of a violin and a box plot.

In [None]:
palette = sns.cubehelix_palette(10, reverse = True)

In [None]:
diamonds = sns.load_dataset("diamonds")# use the diamond dataset
diamonds.head()

In [None]:
clarity_ranking = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]

sns.boxenplot(x="clarity", y="carat",
              palette = palette, order=clarity_ranking,
              scale="linear", data=diamonds)

In [None]:
# Load the example planets dataset
planets = sns.load_dataset("planets")
planets.head(3)

### Please, don't use barplots and pie charts!
* Radar charts and treemaps

In [None]:
!{sys.executable} -m pip install squarify

In [None]:
import squarify    

In [None]:
state_problems = pd.read_csv('../data/190220_SEPLAN-estructuras_problem.csv')

In [None]:
priority_problems = state_problems[state_problems.Prioridad == 1]

In [None]:
ejes = np.unique(priority_problems['Ejes sectoriales'].values)
n = len(ejes)
ejes

In [None]:
palette= sns.color_palette('Greens', n_colors = n)[::-1]

In [None]:
plt.figure(figsize = (8, 6))
priority_problems['Ejes sectoriales'].value_counts().plot(kind = 'barh', color = palette,width = 1, alpha = 0.8)

plt.xlabel('Frecuencia', fontsize = 18);

In [None]:
ejes_count = priority_problems.groupby('Ejes sectoriales').count()['Prioridad']

In [None]:
norm = mpl.colors.Normalize(vmin=min(ejes_count.values), vmax=max(ejes_count.values))
colors = [mpl.cm.Greens(norm(value)) for value in ejes_count.values]

In [None]:
plt.figure(figsize=(14,8))
squarify.plot(label=ejes_count.keys(), sizes=ejes_count.values, color = colors, alpha=.6)

plt.axis('off');

In [None]:
(47 / ejes_count.values.sum())*100

In [None]:
categories = list(ejes_count.keys())
N = len(categories)

In [None]:
values = list(ejes_count.values)
values.append(values[0])

In [None]:
values

In [None]:
values_sum = np.sum(values[:-1])

In [None]:
porcentajes= [(val/values_sum)*100 for val in values]

In [None]:
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
#angles

In [None]:
sns.set_style('whitegrid')

In [None]:
##radar chart
plt.figure(1, figsize=(7, 7))

# Initialise the spider plot
ax = plt.subplot(111, polar=True)
 
# Draw one ax per variable + add labels labels yet
plt.xticks(angles[:-1], categories, color='grey', size=12)
 
# Draw ylabels
#ax.set_rlabel_position(0)

#Set first variable to the vertical axis 
ax.set_theta_offset(pi / 2)

#Set clockwise rotation
ax.set_theta_direction(-1)

#Set yticks to gray color 
plt.yticks([3,6], ["3","6"], color="grey", size=10)
plt.ylim(0,9)
#plt.yscale('log')
 
# Plot data
ax.plot(angles, porcentajes, linewidth=1,color = 'lightgreen')
 
# Fill area
ax.fill(angles, porcentajes, 'lightgreen', alpha=0.3);

### Scatter plots

Another pretty cool plot from seaborn is  the `sns.scatterplot`. We can map a colormap to a variable and dot size to another.  

In [None]:
TCD_19.set_plotting_style_2()

In [None]:
#Initialize a new palette 
cmap = sns.cubehelix_palette(10, as_cmap=True)

plt.figure(figsize = (8,6))

ax = sns.scatterplot(x="distance", y="orbital_period",
                     hue="year", size="mass",
                     palette=cmap, sizes=(10, 200),
                     data=planets)

In [None]:
dots = sns.load_dataset("dots")
dots.head()


In [None]:
# Define a palette to ensure that colors will be
# shared across the facets
palette = dict(zip(dots.coherence.unique(),
                   sns.color_palette("rocket_r", 6)))

# Plot the lines on two facets
sns.relplot(x="time", y="firing_rate",
            hue="coherence", size="choice", col="align",
            size_order=["T1", "T2"], palette=palette,
            height=5, aspect=.75, facet_kws=dict(sharex=False),
            kind="line", legend="full", data=dots)

In [None]:
# Load an example dataset with long-form data
fmri = sns.load_dataset("fmri")
fmri.tail()

In [None]:
# Plot the responses for different events and regions
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             data=fmri, palette= 'Set2_r')

In [None]:
flights = sns.load_dataset("flights")
flights.head()


In [None]:
flights = flights.pivot("month", "year", "passengers")
flights.head()

In [None]:
ax = sns.heatmap(flights, robust = True, cmap = 'inferno_r')

In [None]:
plt.figure(figsize = (8,6))
sns.heatmap(flights, annot=True, fmt="d",robust = True, cmap = 'inferno_r')

In [None]:
month_correlation_mat = flights.T.corr()

In [None]:
sns.clustermap(month_correlation_mat, cmap = 'inferno_r', robust = True)