# Pandas & Seaborn

<img src='helpers/pandas.png' width=400>

Pandas is Python software library for manipulating and analyzing data.  

It may be one of the most widely used tools for data munging
* present data in nice formats
* multiple convenient methods for filtering data
* work with a variety of data formats (CSV, Excel, …)
* convenient functions for quickly plotting data

# Example Dataset: FMRI time series

This example FMRI dataset is taken from https://github.com/mwaskom/Waskom_CerebCortex_2017 and is one of Seaborn's built-in toy datasets.

If you are curious about further analysis, see the following article related to the data:
* Michael L. Waskom, Michael C. Frank, Anthony D. Wagner. "Adaptive Engagement of Cognitive Control in Context-Dependent Decision Making." Cerebral Cortex, Volume 27, Issue 2, February 2017, Pages 1270–1284, https://doi.org/10.1093/cercor/bhv333

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
fmri = sns.load_dataset("fmri")

In [None]:
fmri

We can use matplotlib to make any of the plots that are in this notebook.  (Remember that pandas and seaborn plotting routines are based on matplotlib).

In [None]:
plt.plot(fmri['timepoint'],
         fmri['signal'],
         'ko')

Of course, there's more structure to the data than what's visible here.  We have different temporal evolution depending on the values in the 'subject', 'event', and 'region' columns.

In [None]:
fmri['subject'].unique()

In [None]:
fmri['event'].unique()

In [None]:
fmri['region'].unique()

In [None]:
fmri.loc[(fmri['subject']=='s0') & 
         (fmri['event']=='stim') & 
         (fmri['region']=='parietal')].sort_values(by='timepoint')

In [None]:
fmri_s0 = fmri.loc[(fmri['subject']=='s0') & 
                   (fmri['event']=='stim') & 
                   (fmri['region']=='parietal')].sort_values(by='timepoint').copy()
fmri_s1 = fmri.loc[(fmri['subject']=='s1') & 
                   (fmri['event']=='stim') & 
                   (fmri['region']=='parietal')].sort_values(by='timepoint').copy()              

In [None]:
plt.plot(fmri_s0['timepoint'],
         fmri_s0['signal'],
         'k')
plt.plot(fmri_s1['timepoint'],
         fmri_s1['signal'],
         'b')

Pandas dataframes have built-in plotting methods for making visualizations which can be a bit easier to work with out-of-the-box than matplotlib.

In [None]:
fmri_s1.plot(x = 'timepoint', 
             y = 'signal')

In [None]:
fmri_s1.plot(x = 'timepoint', 
             y = 'signal', 
             color = 'k')

And what if we want to plot all the data?

In [None]:
# not quite this easy!
fmri.plot(x='timepoint',
          y='signal')

What's with the zig-zags?

By default, pandas will make a line plot connecting the points, and since the points are plotted out of numerical order, the connecting lines zigs back and forth in the x and y direction.

In [None]:
fmri.plot(x='timepoint',
          y='signal',
          kind='scatter')

The `kind` parameter makes it very easy to make a variety of different elementary plots:

* `line` : line plot
* `bar` : vertical bar plot
* `barh` : horizontal bar plot
* `hist` : histogram
* `box` : boxplot
* `kde` : kernel density estimation plot
* `density` : same as kde
* `area` : area plot
* `pie` : pie plot
* `scatter` : scatter plot
* `hexbin` : hexbin plot

In [None]:
fmri_s0.plot(x='timepoint',
          y='signal',
          kind='scatter')

In [None]:
fmri_s0.plot(x='timepoint',
          y='signal',
          kind='bar')

In [None]:
fmri_s0.plot(x='timepoint',
          y='signal',
          kind='barh')

It can be useful to take advantage of the fact that Pandas plots can be tailored with matplotlib commands!

In [None]:
fmri_s0.plot(x='timepoint',
             y='signal',
             color='k')
fmri_s1.plot(x='timepoint',
             y='signal',
             color='b')

In [None]:
ax = fmri_s0.plot(x='timepoint',
             y='signal',
             color='k')
fmri_s1.plot(x='timepoint',
             y='signal',
             color='b',
             ax=ax)

In [None]:
fig,ax = plt.subplots(1,1,figsize=(7,5))
fmri_s0.plot(x='timepoint',
             y='signal',
             color='k',
             ax=ax)
fmri_s1.plot(x='timepoint',
             y='signal',
             color='b',
             ax=ax)
ax.legend(['s0','s1'])
ax.set_title('Subjects s0 and s1')
plt.show()

# Seaborn

If Matplotlib 'tries to make easy things easy and hard things possible,' Seaborn tries to make a well-defined set of hard things easy too.

https://seaborn.pydata.org

<img src='helpers/seaborn.png' width=700>
          
* Built on top of matplotlib and closely integrated with pandas data structures.
* Used for making statistical graphics and using visualization to quickly and easily explore and understand data.
* The style settings can also affect matplotlib plots, even if you don't make them with seaborn.

"lineplot" will draw a line plot with the possibility of semantic groupings. (https://seaborn.pydata.org/generated/seaborn.lineplot.html)

In [None]:
sns.lineplot(data=fmri_s0,x='timepoint',y='signal')

Check it out now too for all the subjects:

In [None]:
sns.lineplot(data=fmri,x='timepoint',y='signal')

What is the above showing? -> It's actually showing the mean and a 95% confidence interval.

In [None]:
sns.lineplot(data=fmri.loc[fmri['subject']=='s0'],x='timepoint',y='signal')

Wait... the above has mean too?

Yes -> it has all the events and region info.

In [None]:
sns.lineplot(data=fmri_s0,x='timepoint',y='signal')

Seaborn does make it easy to "split" the visualizations up using colors (hues), columns, and rows.

In [None]:
sns.lineplot(data=fmri, x='timepoint', y='signal', hue='subject')

In [None]:
sns.lineplot(data=fmri, x='timepoint', y='signal', hue='event')

In [None]:
sns.lineplot(data=fmri, x='timepoint', y='signal', hue='event', style='region')

Here again we note that the styling of a visualization, despite being useful to break data up into different pieces for comparison, does merit some thought, focus, and sometimes experimental improvisation to get into more useful forms.

In [None]:
sns.lineplot(data=fmri, x='timepoint', y='signal', style='event', hue='region')

The above is easier for comparison since the colors are not right next to each other.

Statistical note:  by default the lineplot will show mean and 95% confidence interval.  You can also use standard error, standard deviation, and percentile interval.  See https://seaborn.pydata.org/tutorial/error_bars.html

In [None]:
sns.lineplot(data=fmri, 
             x='timepoint', y='signal', 
             style='event', hue='region', 
             errorbar=None)

The above can be useful if you want to clear the plot of error markings and simply focus on the trend in mean.

In [None]:
sns.lineplot(data=fmri, 
             x='timepoint', y='signal', 
             style='event', hue='region', 
             errorbar=('se',1))

"relplot" is useful for drawing relational plots (like line and scatter plots), onto a FacetGrid (separating values of a given variable along columns or rows).

https://seaborn.pydata.org/generated/seaborn.relplot.html

In [None]:
sns.relplot(data=fmri, 
            x='timepoint', y='signal', 
            hue='event', 
            col='region')

In [None]:
# Remember that I only retained parietal as a region in fmri_s0.

sns.relplot(data=fmri_s0, 
            x='timepoint', y='signal', 
            hue='event', 
            col='region')

In [None]:
sns.relplot(data=fmri, 
            x='timepoint', y='signal', 
            hue='event', 
            col='region', 
            kind='line')

Just like with Pandas, you can use matplotlib commands in tandem with Seaborn plotting, as long as the data structures are consistent. (Sometimes you have to make sure whether you are dealing with figure and axes objects or whether you can simply execute the pyplot methods.)

In [None]:
ds0 = fmri.loc[(fmri['subject'] == 's0') &
                 (fmri['region']=='parietal')].sort_values(by='timepoint',ignore_index=True)
dstim = fmri.loc[(fmri['subject'] == 's0') &
                 (fmri['event'] == 'stim') &
                 (fmri['region']=='parietal')].sort_values(by='timepoint',ignore_index=True)
dcue = fmri.loc[(fmri['subject'] == 's0') &
                 (fmri['event'] == 'cue') &
                 (fmri['region']=='parietal')].sort_values(by='timepoint',ignore_index=True)
dstim['event_diff'] = (dstim['signal'] - dcue['signal'])

In [None]:
fig,ax = plt.subplots(2,1)
sns.lineplot(data=dstim, x='timepoint', y='signal', ax=ax[0])
sns.lineplot(data=dstim, x='timepoint', y='event_diff', ax=ax[1])
ax[1].set_ylabel('difference in stim vs cue')
plt.show()

In [None]:
fig,ax = plt.subplots(2,1)
sns.lineplot(data=ds0, x='timepoint', y='signal', hue='event', ax=ax[0])
ax[1].fill_between(dstim['timepoint'],dstim['event_diff'])
ax[1].set_ylabel('difference in stim vs cue')

In [None]:
plt.plot(dstim['signal'], dstim['event_diff'])

To color this by time, we can build it up in pieces and assign the pieces different colors.

In [None]:
import matplotlib.cm as cm

In [None]:
cm.jet(2)

In [None]:
dstim['timepoint'].values

In [None]:
[cm.jet(i) for i in dstim['timepoint'].values]

In [None]:
for i in dstim['timepoint'].values[:-1]:
    i0 = dstim.loc[dstim['timepoint'] == i]
    i1 = dstim.loc[dstim['timepoint'] == i+1]
    plt.scatter(i0['signal'], i0['event_diff'],color=cm.jet(i*10))
    plt.plot([i0['signal'],i1['signal']], [i0['event_diff'],i1['event_diff']], color=cm.jet(i*10))

# Multivariate Analysis with Seaborn
## Using the Penguins Dataset

"Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network."
-- https://github.com/allisonhorst/palmerpenguins

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
penguins.info()

In [None]:
penguins

### Histograms for numerical data, conditioned on categorical values

In [None]:
sns.histplot(data=penguins, 
             x="flipper_length_mm")

In [None]:
sns.histplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species")

In [None]:
sns.histplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species", 
             multiple="stack")

In [None]:
sns.histplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species", 
             multiple="fill")

In [None]:
sns.histplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species", 
             multiple="dodge")

## FacetGrid

The Seaborn FacetGrid "class maps a dataset onto multiple axes arrayed in a grid of rows and columns that correspond to levels of variables in the dataset. The plots it produces are often called 'lattice', 'trellis', or 'small-multiple' graphics." -- https://seaborn.pydata.org/generated/seaborn.FacetGrid.html

If you want to use facet plots, I highly recommend reading the above page for more information.

In [None]:
g = sns.FacetGrid(penguins, col="species")
g.map_dataframe(sns.histplot, x="flipper_length_mm")

"catplot" is useful for drawing categorical plots onto a FacetGrid.  https://seaborn.pydata.org/generated/seaborn.catplot.html

In [None]:
sns.catplot(data=penguins,
            x="flipper_length_mm",
            hue="species",
            col="species")

In [None]:
sns.catplot(data=penguins,
            x="flipper_length_mm",
            hue="species",
            col="species",
            kind='box')

In [None]:
sns.catplot(data=penguins,
            x="flipper_length_mm",
            hue="species",
            row="species",
            kind='box', height=1, aspect=4)

"displot" is useful for drawing distribution plots onto a FacetGrid.  https://seaborn.pydata.org/generated/seaborn.displot.html

In [None]:
sns.displot(data=penguins,
            x="flipper_length_mm",
            hue="species",
            col="species")

In [None]:
sns.displot(data=penguins,
            x="flipper_length_mm",
            hue="species",
            col="species", kind='kde')

You can just use the kdeplot too.

In [None]:
sns.kdeplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species")

In [None]:
sns.kdeplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species", 
             multiple="stack")

In [None]:
sns.kdeplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species", 
             multiple="fill")

In [None]:
sns.kdeplot(data=penguins, 
             x="flipper_length_mm", 
             hue="species", 
             multiple="layer")

In [None]:
g = sns.FacetGrid(penguins, col="species")
g.map_dataframe(sns.kdeplot, x="flipper_length_mm")

## Pairplot

Extremely useful for getting a snapshot all at once of the relations between pairs of variables.
https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
sns.pairplot(data=penguins)

In [None]:
sns.scatterplot(data=penguins, x='bill_length_mm', y='bill_depth_mm')

In [None]:
sns.scatterplot(data=penguins, x='body_mass_g', y='flipper_length_mm')

## Regression

"lmplot" and "regplot" are useful for looking at regression analysis.
* "regplot" will plot data and a linear regression model fit.
* "lmplot": regplot + FacetGrid.  Plot data and regression model fits across a FacetGrid.

In [None]:
sns.regplot(data=penguins, x='body_mass_g', y='flipper_length_mm')

In [None]:
sns.lmplot(data=penguins, x='body_mass_g', y='flipper_length_mm')

In [None]:
# this will give an error! because it is lmplot that's designed for faceting.

sns.regplot(data=penguins, x='body_mass_g', y='flipper_length_mm', hue='species')

In [None]:
sns.lmplot(data=penguins, x='body_mass_g', y='flipper_length_mm', hue='species')

In [None]:
sns.lmplot(data=penguins, x='body_mass_g', y='flipper_length_mm', col='species')

In [None]:
sns.lmplot(data=penguins, x='body_mass_g', y='flipper_length_mm', hue='species', col='sex')

In [None]:
sns.lmplot(data=penguins, x='body_mass_g', y='flipper_length_mm', col='species', hue='sex')

### Residuals

Plotting residuals (difference of data relative to the regression fit) allows us to see if there are missed patterns in the regression fit.

In [None]:
import numpy as np

In [None]:
# generate 100 points from a normal 
# distribution that has mean = 0 and std dev = 3.5
np.random.seed(42)
noise = np.random.normal(0,3.5,100)

x = np.linspace(0,10,100)
y = x**2 + noise

In [None]:
sns.regplot(x=x,y=y)

In [None]:
sns.residplot(x=x,y=y)

In [None]:
sns.regplot(x=x,y=y,order=2)

In [None]:
sns.residplot(x=x,y=y,order=2)

In [None]:
sns.regplot(data=penguins, x='body_mass_g', y='flipper_length_mm')

In [None]:
sns.residplot(data=penguins, x='body_mass_g', y='flipper_length_mm')

In [None]:
sns.jointplot(data=penguins, x='body_mass_g', y='flipper_length_mm')

In [None]:
sns.jointplot(data=penguins, x='body_mass_g', y='flipper_length_mm', kind='resid')

In [None]:
g = sns.FacetGrid(penguins, col="species")
g.map_dataframe(sns.residplot, x="flipper_length_mm", y="bill_depth_mm")

In [None]:
g = sns.FacetGrid(penguins, col="species", hue='species')
g.map_dataframe(sns.regplot, x="flipper_length_mm", y="bill_depth_mm")

# Please help us evaluate these workshops:

[Link to review](https://forms.gle/LTmgZ89K8J8Z8BdZ9)