<a href="https://colab.research.google.com/github/connorgrannis/nch_python_workshop/blob/master/Week3_plotting_with_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# __Plotting with Python__

While graphing in SPSS is easy, it doesn't always produce the best looking graphs and it's fairly rigid on the types of graphs it can create.

There are several graphing packages in Python, but the main ones we're going to talk about are *matplotlib* and *seaborn*.

### __Matplotlib__
Matplotlib is like the NumPy of graphing; almost all the other graphing libraries are based on matplotlib. While it's not as user-friendly as some of the others, it's helpful to know what's going on under the hood. to have greater control of libraries like Seaborn that are built off of it.

Almost all of the functions we'll be using are contained in the *pyplot* class, so we'll import matplotlib like this:

```
import matplotlib.pyplot as plt
```

At the end of last week we used Pandas to create a scatter plot of the Titanic data comparing `Age` and `Fare`.  Let's load the titanic set and do the same graph using matplotlib.

In [0]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

titanic = pd.read_csv('https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv')

In [0]:
# scatter using pandas
titanic.plot(kind='scatter', x='Age', y='Fare')

In [0]:
# scatter using matplotlib
plt.scatter(titanic['Age'], titanic['Fare'])

This is a good time to go over positional arguments vs. keyword arguments. 

In the first cell we specify each argument of the function and specify what value we want it to be: `kind='scatter'`, `x='Age'`, `y='Fare'`. We can place these arguments in any order we like within the function because we specified the keyword for each one. For example, 
```
titanic.plot(y='Fare', kind='scatter', x='Age')
```
will do the exact same thing.


This is not true for the second cell however. `plt.scatter()` expects the x variable in the first argument, and the y variable in the second argument. If we were to type out:
```
plt.scatter(titanic['Fare'], titanic['Age'])
```
this would plot the inverse of what we wanted, as the function still expects the x variable to be the first argument. These are "positional arguments" and if you don't specify the "keyword" it will follow the order you put the arguments in. You can use any function with either positional or keyword arguments.

Notice how the pandas plot added in axis labels, but matplotlib didn't.  We can add them in using the following syntax:

In [0]:
plt.scatter(titanic['Age'], titanic['Fare'])
plt.xlabel('Age')
plt.ylabel('Fare')

Here, we're using the `scatter` method of `plt`. But matplotlib can make all of the traditional graphs:
- scatter
- bar
- line
- box and whisker
- pie charts

and even some less traditional ones:
- 3dplots
- streamplots
- filled curves
- polar plots
- and [more](https://matplotlib.org/3.2.1/tutorials/introductory/sample_plots.html)



Your turn! 

Using the titanic dataset, show the distribution of `Age` using a `hisogram`, plot the distribution of `Age` by `Survived` using a `boxplot` and see if there are differences in ticket price (`Fare`) by `Survived` using a `bar` graph

In [0]:
# histogram


In [0]:
# boxplot


In [0]:
# bar


With all these options, why should we bother learning about seaborn or any other plotting packages? It's possible to change almost every aspect of the graph to get it to look exactly how you want it, but the default settings on other packages look much nicer and the input is cleaner. 


### Seaborn
Seaborn is one of our favorite graphing packages and looks much better by default.  Again, this is based on matplotlib and it's possible to get an identical graph using matplotlib, but seaborn makes it easier. We often use seaborn to produce the main graph and then use pyplot to tweak the axes, grids, titles, etc.

Let's create the same 3 graphs plus the original scatter using seaborn.  It's common to give seaborn the nickname `sns`.

In [0]:
import seaborn as sns
sns.scatterplot(x='Age', y='Fare', data=titanic)

In [0]:
# histogram


In [0]:
# boxplot


In [0]:
# bar


#### Errorbars
By default, seaborn uses a confidence interval of 95% for its error bars. You can change this to the standard error by changing the following syntax:


```
ci=68    # changes the error bars to standard error
```



In [0]:
sns.barplot(x='Survived', y='Fare', data=titanic, ci=68)

#### Modifying the figure
Now that we have a nice plot, we can use pyplot to easily tidy things up a bit and make our graph really easy to interpret.  Let's change what we're graphing and put age on the y-axis and if they survived or not on the x.

In [0]:
sns.barplot(x='Survived', y='Fare', data=titanic, ci=68)

plt.grid(axis='y', alpha=0.3)   # setting horizontal gridlines and making them semi-transparent
plt.ylim(0, titanic['Age'].max()+10)  # fixing the y-axis to start at 0 and end at ten over the max value
plt.xticks(ticks=range(2), labels=['No', 'Yes'])  # Changing the labels of the ticks on the x-axis
plt.title('Differences in ticket price in survivors and non-survivors', fontsize=12)   # making a title above the graph and setting the fontsize
plt.tight_layout()     # adjusts the padding on the plots to make everything fit better

Try commenting out some of the  above lines that modify different aspects of the histogram and running it again. See how the graph changes, try adding in your own modifications!

#### FacetGrids
Sometimes we want to graph several related variables at the same time.  FacetGrids allow us to create subplots within the same figure, which is a great way of displaying related data.

FacetGrids work a little differently than a regular graph.  First, we have to create the grid of subplots, and then we _map_ the plot of the data onto the grid we just made. Let's break that down step by step.

In [0]:
# Create subplots
fg = sns.FacetGrid(data=titanic, row='Survived', col='Sex')

This created a 2 by 2 grid because both `Survived` and `Sex` have two unique values.

In [0]:
# map plots to subplots
fg = sns.FacetGrid(data=titanic, row='Survived', col='Sex')
fg = fg.map(sns.scatterplot, 'Age', 'Fare', ci=68)

### Hue
Another, more common way, the combining graphs is using `hue`.  We can show the above four graphs in a single plot.

In [0]:
sns.barplot(x='Survived', y='Age', ci=68, hue='Sex', data=titanic)
plt.xticks(ticks=range(2), labels=['No', 'Yes'])

In [0]:
sns.scatterplot(x='Age', y='Fare', ci=68, hue='Sex', data=titanic)

### Linear Regression Scatter plots
There are two convenient ways to add a regression line to a scatter plot:
* regplot
  * performs a single linear regression fit and plot
* lmplot
  * combines regplot and FacetGrid to allow variable subsets and multiple regression lines

In [0]:
sns.regplot(x='Age', y='Fare', data=titanic, ci=68, color="orange", line_kws={'color': 'cyan'})

In [0]:
sns.lmplot(x='Age', y='Fare', hue='Sex', ci=68, data=titanic)

### Combining graphs
You can also make multiple graphs before displaying them.  Matplotlib will add everything to the same set of axes until you execute the `plt.show()` command.  In Jupyter notebooks, this is executed when you run the cell.

Let's overlay a swarmplot on a bar graph.

In [0]:
# barplot
sns.barplot(x='Survived', y='Age', data=titanic)

# swarmplot
sns.swarmplot(x='Survived', y='Age', data=titanic)

I think we can all agree that this is a pretty terrible graph.  The point of combining these two graphs was to get a sense of the distribution, but we can't even see half of the scatter points.

Luckily, we can adjust the opacity of the bars and the dots; we can adjust the edges of the bars; and we can adjust the size of the scatter dots.  All of this will hopefully make this graph more useful.

Let's also trim down the size of our dataset just so we can visualize the data better for this example.  In the code below I've changed the `data` to be only the first 300 rows.

In [0]:
# swarmplot
sns.swarmplot(x='Survived', 
              y='Age', 
              data=titanic[:300],
              size=7,       # making the dot size slightly bigger
              alpha=.77,    # adjusting the opacity of the scatter
              )

# barplot
sns.barplot(x='Survived', 
            y='Age', 
            data=titanic[:300],
            linewidth=2.5,
            errcolor='0',   # adjusting the error bar definition
            edgecolor='.2', # adjusting the definition of the edges of the bars
            alpha=0.25,     # adjusting the opacity of the bars
            )

This is just a quick example of how you can begin to overlay and combine graphs. You can combine and overlay any number and combination of graphs, the possibilities are endless!

# Annotaing Graphs
Matplotlib and seaborn also provide functions that allow you to add shapes, text, and more on top of your graph to help you draw attention to a portion of the graph.

### Adding shapes


In [0]:
sns.scatterplot(x='Age', y='Fare', data=titanic)
plt.annotate(s=' ', xy=(35, 500), xytext=(35, 400), arrowprops=dict(facecolor='red', width=0.5, headwidth=5))
plt.annotate(s=' ', xy=(34, 505), xytext=(25, 500), arrowprops=dict(facecolor='red', width=0.5, headwidth=5))
plt.annotate(s=' ', xy=(37, 510), xytext=(45, 500), arrowprops=dict(facecolor='red', width=0.5, headwidth=5))

Your turn!

Change the arrow style to a  [FancyArrowPatch](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.annotate.html).  Also note the use of `dict()` in the `arrowprops` statement above.

Your turn again!

Drop the rows of the three highest `Fares`

Click the drop down arrow to the left of the next cell to see a possible solution

### Possible solution to dropping 3 highest values

In [0]:
# drop the three highest fares from above
print(titanic.shape)
rowstodrop = list(titanic.nlargest(3, 'Fare').index)
for row in rowstodrop:
  titanic.drop(row, inplace=True)
print(titanic.shape)

### Adding brackets and p-values to bar graphs

When showing group differences with bar graphs, it's common to do this with brackets. One way to do this is by putting your graph into PowerPoint, adding the brackets, grouping them so everything acts as a single object, and saving this as a picture to use later.  

An alternate way of doing this, is by having Python add the brackets and, optionally, the associated p-values into the graph.  This cuts out the middle man and gives you a graph that's ready to use.

Let's see how it works.

In [0]:
# bar plot that we want to annotate
sns.barplot(x='Survived', y='Fare', data=titanic, ci=68)

In [0]:
# empty lists to store the heights and centers of both graphs
heights = []
centers = []

# to save keystrokes and minimize room for error
x = 'Survived'
y = 'Fare'

# create figure
plt.figure()
splot = sns.barplot(x=x, y=y, data=titanic, ci=68)
plt.xticks(ticks=range(2), labels=["No", "Yes"])

# setting the upper range of the y-axis so we'll have room for brackets
plt.ylim(top=titanic[y].max())

# use "patches" to get attributes of each bar
for p in splot.patches:
  heights.append(p.get_height())
  # get the x-coordinate of where the bar starts and add half of the width to get the center
  centers.append(p.get_x() + p.get_width() /2.) # using 2. instead of 2 to make it a float
  
##########
# Let's see what we have so far
print(heights)
print(centers)
print(titanic[y].max())

In [0]:
# set the bracket height of an eighth of the above the 
lineheight = np.array(heights).max() + titanic[y].max()/8

# create brackets out of two vertical lines, and one horizontal line
plt.plot([centers[0], centers[0]], [lineheight-titanic[y].max()/20, lineheight], c='k')      # first tick
plt.plot([centers[0], centers[1]], [lineheight, lineheight], c='k')        # line
plt.plot([centers[1], centers[1]], [lineheight-titanic[y].max()/20, lineheight], c='k')      # end tick

# add text above line
plt.annotate("p-value = ??", (.5, lineheight+titanic[y].max()/20))

# add title
plt.title(f'{y} by {x}')

In [0]:
# put it all together
# empty lists to store the heights and centers of both graphs
heights = []
centers = []

# to save keystrokes and minimize room for error
x = 'Survived'
y = 'Fare'

# changing the graphing theme
plt.style.use("seaborn")

# create figure
plt.figure()
splot = sns.barplot(x=x, y=y, data=titanic, ci=68)
plt.xticks(ticks=range(2), labels=["No", "Yes"])

# setting the upper range of the y-axis so we'll have room for brackets
plt.ylim(top=titanic[y].max())

# use "patches" to get attributes of each bar
for p in splot.patches:
  heights.append(p.get_height())
  # get the x-coordinate of where the bar starts and add half of the width to get the center
  centers.append(p.get_x() + p.get_width() /2.) # using 2. instead of 2 to make it a float
  
# set the bracket height of an eighth of the above the 
lineheight = np.array(heights).max() + titanic[y].max()/8

# create brackets out of two vertical lines, and one horizontal line
plt.plot([centers[0], centers[0]], [lineheight-titanic[y].max()/20, lineheight], c='k')      # first tick
plt.plot([centers[0], centers[1]], [lineheight, lineheight], c='k')        # line
plt.plot([centers[1], centers[1]], [lineheight-titanic[y].max()/20, lineheight], c='k')      # end tick

# add text above line
plt.annotate("p-value = ??", (.4, lineheight+titanic[y].max()/20))

# add title
plt.title(f'{y} by {x}')

# display plot
plt.show()

###__Choosing Colourmaps__

You may have already noticed, but we've been playing around with the colours for some of the graphs we've shown. Matplotlib (and by extension seaborn and any other graphing library dependent on Matplotlib) has set "Colourmaps" or "cmaps" if you would like to use a colour palette across continous data. You can define your own colourmaps interactively in a Jupyter Notebook using the seaborn method `sns.choose_cubehelix_palette()`.

Try running the code cell below and changing the sliders to find a colourmap you like!

In [0]:
sns.choose_cubehelix_palette()

In [0]:
# by changing the `as_cmap` argument to True, it will instead plot a truly continuous set of colours instead of a discrete set

sns.choose_cubehelix_palette(as_cmap=True)

So what do we do once we've gotten it to look how we like it?

Notice how each of the sliders has a value next to it? We'll take each of those values and assign them to the arguments in the seaborn method: `sns.cubehelix_palette()` and then we'll have our own customized colourmap. Let's do that below:

In [0]:
# fill these in with your own custom values that you chose above
cmap = sns.cubehelix_palette(n_colors=?, start=?, rot=?, gamma=?, hue=?, light=?, dark=?, reverse=?, as_cmap=?)

Now we've defined our cmap, let's use it in a graph with the titanic dataset from earlier and instead of using the `hue` argument with something binary like 'Survived' try it with a continuous variable. Here I used 'SibSp' which is a measure of how many siblings and spouses were also aboard.

In [0]:
sns.scatterplot(x='Age', y='Fare', data=titanic, hue='SibSp', palette=cmap)

# Interactive Graphs
`Plotly` is another grapphing package that creates JavaScript based graphs and displays then in a webbrowser.  This allows us to create interactive graphs that can be fun and informative.  A good use-case for these graphs could be a dashboard that breaks down different aspects of your lab.

Note that `plotly` will NOT work with Spyder.  It WILL work with Jupyter notebooks, or if you execute it in a traditional `.py` script it will launch the graph in the default browser.

In [0]:
import plotly.graph_objects as go
fig = go.Figure(
    data=[go.Bar(y=[2,1,3])],
    layout_title_text = "Figure Title"
)
fig

On the graph above, hover your cursor over one of the bars to display the coordinates.

Drag your cursor over a section of the graph to zoom in to that area and double-click to zoom back to 100%

Another way to make graphs using plotly is by creating a blank graph and adding in `traces`, or graph instances.

This is a good method to use if you plan on adding graphs iteratively through a loop.

In [0]:
fig = go.Figure()
fig.add_trace(go.Bar(x=[1, 2, 3], y=[1, 3, 2]))
fig.show()

In [0]:
fig.add_trace(go.Bar(x=[4, 5], y=[5,3]))
fig.show()

Single-clicking on legend will filter OUT those elements from the graph.

### Subplots
Plotly can also make interactive subplots, but notice that unlike matplotlib and seaborn, the axes don't have to match.  Additionally, using the cursor to filter a view will only affect that subplot.

In [0]:
# making subplots
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Scatter(y=[4,2,1], mode="lines"), row=1, col=1)
fig.add_trace(go.Bar(y=[2,1,3]), row=1, col=2)
fig.show()

### Express
Plotly comes with a class called `express` that's syntax looks more familiar.

In [0]:
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris, x="sepal_width", y="sepal_length", color="species")
fig.show()

Just like the graph_object graphs, we can add traces to the express graphs

In [0]:
fig.add_trace(
    go.Scatter(x=[2,4], y=[4,8], mode='lines', line=go.scatter.Line(color='gray'),
               showlegend=False)
)
fig.show()

### Adjusting graph opacity


In [0]:
fig = go.Figure(go.Bar(x=[1, 2, 3], y=[1, 3, 2], opacity = 0.4))
fig.show()

Let's make the edge lines darker

In [0]:
fig.data[0].marker.line.width=4
fig.data[0].marker.line.color='black'
fig.show()

# Graphing with Coordinates

Here we present some other ways you can create interesting graphs using Python.
Below, we created these plots using only `plt.scatter()` using a freely available database on all reported car accidents in the US (1, 2). If you're interested, you can take a look at the code we used to create these [here](https://github.com/connorgrannis/nch_python_workshop/tree/master/Example%20code), under "accidents_plot.py". 

We've opted not to show how these were graphed in real time as there are about 3 million datapoints in total, so plotting takes a little time.

If the images below don't load, try loading this page in Google Chrome or double click this cell and the hyperlink will take you to a working image.

![2016](https://drive.google.com/uc?id=1Ghf4nphlq-5zJcaRJeaVadn4JI6f8OC_)

![2017](https://drive.google.com/uc?id=1c3kjtkldo3iulOCc5HqWm4P1p45oYjto)

![2018](https://drive.google.com/uc?id=1p6flaDZ30WAciIRC7n3_0JLJ_oF5ei1W)

![2019](https://drive.google.com/uc?id=110HBM9nY6FMT4P_W2UkA8K526sObkLyZ)


1. Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

2. Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.)

Here are some images of plots we've done with some of our lab's brain data.  The syntax is pretty specific to our lab directories and has PHI, so we're not posting the code here.  But if this is something that you want to try with your lab, you can reach out to us and we'd be more than happy to help.

![2D connectome plot](https://drive.google.com/uc?id=15nxsBIP8jUMOb3HEpylLUIBt3rZNBLE1)


Below is an interactive 3D version of the same plot above, you'll have to click the link and download the html file first, but then just double click it to view it in your browser:

[3D connectome plot](https://drive.google.com/uc?id=1HytuzRDtWH4jtT00vUz-BpxT1B8D8V0S)

Note that the interactive 3D brain is graphed partly using plotly.

# Review:
We've talked about how to graph data using pandas, matplotlib, seaborn, and plotly. 

You should now know how to:
- Make traditional graphs (scatter, bar, box)
- Adjust the opacity of data points
- Combine multiple graphs
- Modify axis labels
- Add shapes and text to graphs
- Make interactive graphs

We've also seen how Python can create nice graphs using coordinates, either longitude and latitude or brain coordinates.

# Practice #1
Based on the graphs above, there's a positive relationship between `Fare` and `Survived`, such that the more money a person spent on their ticket the more likely it was that they survived.  But could that be based on the location of their `cabin`?  Unfortunately, `cabin` has a lot of missing data, but we should still have a reasonable sample size.  Additionally, the `cabin` variable has the `section` and `room number`.  

Let's create a new column in the titanic dataset that describes which *section* their cabin was in.  In order to do this, you'll need to do the following:
1. Use the 'Cabin' series
1. Drop the NaNs
5. Only take the first cabin assigned to that person
77. Only keep the first character in their cabin assignment
5. Save it as a new column in the dataframe.

After you've created that new series:
1. Use groupby to group the dataset on the sections
2. Display a barplot of the fare for each section
3. Print the mean fare for each section using only one line of code

# Practice #2
Add `Sex` to the annotated bar graph from earlier and connect any bars that visually look different, since we haven't talked about performing stats yet (next week). Explore the [documentation](https://matplotlib.org/3.2.1/gallery/style_sheets/style_sheets_reference.html) to choose a different theme that you like.