## Day 2: Diving Deeper into Pandas
<img src="https://pbs.twimg.com/media/B_1KzLlUYAIFadB.jpg:large" width="200" height="200" />


Welcome to Day 2. Python programmers often organize their code into chunks called modules. A module contains a set of related commands to accomplish a task. Today we are going to learn two of the most useful modules for data manipulation and plotting, `pandas` and `seaborn`. 

First we will use a simple data set on mouse weights taken from males and females of different strains to illustrate how to obtain simple descriptive statistics, group the data, and plot. 

Then, we will looking some imaging statistics data from imaris and explore some more complex operations on the data.

## 2.1 Loading Data using `read_csv`

To start, we'll load up two modules with the `import` statement. The first is the `pandas` module, which will let us manipulate 2D tables. We refer to the `pandas` module here as `pd` as an abbreviation. The second module we load is the `numpy` module, mostly for the arithmetic functions built into it.

When you load a module using `import`, all of the functions available, such as `np.mean` are now accessible to you. Modules and import statements help programmers avoid naming conflicts because you can use short, straightforward names for functions and variables without worrying that they're already taken. Matlab does not have anything equivalent to Python's module system and therefore can be harder to read. 

First, let's load the data up using the `pd.read_csv` function. If you want to see where the dataset is, it's in the `day-2/data/` folder. The `read_csv` function is a part of the `pandas` module, so we have to include the `pd.` in front of it so the computer knows to look in the `pandas` module to find and use this function. By using `pd.read_csv`, we return what is called a `pandas` `DataFrame`.   Most of our manipulations and plotting are going to be done on this `DataFrame`.

In [None]:
import pandas as pd
import numpy as np

mousedata = pd.read_csv("data/mouseData.csv")
mousedata

A really useful function that we can immediate call on `mousedata` is `describe`. `describe` will return summary statistics on the numerical variables.

In [None]:
mousedata.describe()

## 2.2 A Quick Intro to functions

You can think of a function as a bit of reusable code. The important thing is that you need to define the inputs (what goes into the function) and the output (what comes out of the function).

Try and run the following function. What does it do?

In [None]:
##always begin with "def" when defining a new function, 
##have an interface defined in the "()", 
##and the definition ends in ":"
def square_x(x):
    out = x * x
    return out

square_x(40)

Let's look a bit more closely at how the `square_x` function is written. It begins with the word `def` (short for define) followed by the name of the function, a variable named in parenthesis, and a colon. The variable is the input to the function. The colon is also a necessary part of the function definition, and it begins the code block that defines what the function does.

The rest of the function consists of this code block. In Python, this block must be indented. The last line in the block contains the word 'return' followed by a variable name. This variable is the output of the function.

## Exercise

Make a new function called `cube_y` that takes `y` as an input, and returns the cube of `y`. 

Run `cube_y(2)` to test out your function.

In [None]:
## space for your answer here



For the most part, we actually will want multiple inputs to our function, so we can do this by supplying more inputs to our function interface.

In [None]:
def mult_xy(x, y):
    out = x * y
    return out

mult_xy(10, 5)

## 2.3 Grouping

Another great `pandas` function is `groupby()`. It will group a `DataFrame` by one or more columns, and let you iterate through each group. 

In [None]:
group_mouse = mousedata.groupby(['Sex'])
group_mouse

**Question**: why did the `groupby` only return `Weight`? Does it make sense to do `mean_x` on our `Strain` variable?

In [None]:
group_mouse.get_group('M')

What can you do with `groupby()`? One way to use it is to get *aggregate measures* based on group. For example, if we wanted to get the mean weight by gender, we can use the `apply` method on our data frames to return this. First we define a simple function called `mean_x` that returns the mean (we could have just used `np.mean` here, but it makes the code a little easier to understand).  

In [None]:
def mean_x(x):
    return np.mean(x)

Then we can use the `apply()` function to get the mean by sex.

In [None]:
mousedata.groupby(['Sex']).apply(mean_x)

Note the the `apply()` function takes a function as input and applies it to every element, in this case male and female weights.

## Exercise

Define a function to calculate the standard deviation (in numpy the function you need will be called `np.std`) and apply it to return the standard deviation of weights by `Strain`.

In [None]:
## Space for your answer here.

## 2.4 Plotting

Let's look at some ways to visualize our `DataFrame`. We are going to use a module called `seaborn` to do our plotting, because the default plot options are pretty good, so we have to do less customization.

Let's just plot the distribution of weights as a histogram. How many bins does our histogram have?

In [None]:
## import the two modules we need: matplotlib and seaborn
import matplotlib.pyplot as plt 
import seaborn as sns

##we need this line in our notebook to make matplotlib/seaborn work with Jupyter
%matplotlib inline

In [None]:
# Histogram of weights
sns.distplot(mousedata.Weight)

## Exercise

Look up the help for `sns.distplot`. Note that there is a long list of input variables that are set as equal to `None` or `False`. This means that these are optional input variables that, unless defined, will run at their default definitions.

To practice utilizing these optional inputs, change the number of bins to 40 in the plot.

In [None]:
## Space for your answer here.

help(sns.distplot)

## 2.5 Boxplots

Boxplots are super useful for looking at grouped means. Here we use the `sns.boxplot` function and group by `Sex`.

In [None]:
# Boxplot
sns.boxplot(x = "Sex", y="Weight", data=mousedata)

# Set title with matplotlib
plt.title('Mouse weight by sex')
plt.show()

## Exercise 

Create boxplots showing the weight data measured from the 2 different strains, B6 and D2. Make sure to add a title to your plot, such as "Weight by Strain".

In [None]:
##Space for your answer here


## 2.6 Faceting

Faceting is one of the most powerful ways of exploring data. For example, we can see whether there is a `Strain` by `Sex` effect by producing *conditional* boxplots.

In [None]:
g = sns.FacetGrid(mousedata, col="Strain", row="Sex")
g = g.map(sns.distplot, "Weight", bins = 20)

# A more complicated example

We're going to do much more manipulation and visualization with `pandas` using data taken from Imaris. Imaris is image analysis software with many sophisticated functions. Below is a confocal image taken of inner hair cells stained with antibodies against CtBP2 (a pre-synaptic ribbon marker), GluR2 (a post-synaptic receptor) and MyosinVIIA (which labels the entire hair cell). There are three color channels (red, green, and blue) which indicate the intensity of the staining for CtBP2, GluR2 and MyosinVIIa, respectively.


## The data

Up to 25 auditory nerve fibers synapse onto individual inner hair cells in
normal-hearing individuals. However, these synapses can be permanently lost due
to aging, exposure to noise or ototoxic drugs.  In experiments that study
hearing loss, we need a way of quantifying the number of synapses per inner
hair cell.

One approach is to dissect the cochlea out of the experimental animals and use
whole-mount immunohistochemistry to label the tissue with antibodies for
pre-synaptic ribbons (CtBP2), post-synaptic receptors (GluR2) and cytoskeleton
(Myosin VIIa). The distribution of these proteins can be captured by taking a series of
two-dimensional images at various depths in the tissue.  These images are then
"stacked" to create a three-dimensional image known as a Z-stack (since the
third dimension is commonly referred to as the Z-axis).

<table>
	<body>
		<tr>
			<td>1A. CtBP2 (pre-synaptic ribbon)</td>
			<td>1B. GluR2 (post-synaptic glutamate receptor)</td>
		</tr>
		<tr>
			<td><img src="../day-4/data/CtBP2.png" /></td>
			<td><img src="../day-4/data/GluR2.png" /></td>
		</tr>
	</body>
</table>

## The problem

A functional inner hair cell synapse requires both a pre-synaptic ribbon and a
post-synaptic glutamate receptor. The next step in our analysis is to determine
whether each CtBP2 puncta is near a GluR2 label. 

This dataset was analyzed using Imaris to identify all CtBP2 puncta (white dots
in fig. 2a). If you look closely at the composite (fig. 2b), you'll see that
not all puncta have a glutamate receptor patch next to them (fig. 2b)! We
should not be counting these for the purpose of analysis. So, we need to find a
way to detect these false hits and eliminate them.

<table>
	<body>
		<tr>
			<td>A. CtBP2 puncta</td>
			<td>B. CtBP2 puncta overlaid on GluR2</td>
		</tr>
		<tr>
			<td><img src="../day-4/data/CtBP2+points.png" /></td>
			<td><img src="../day-4/data/CtBP2+GluR2+points.png" /></td>
		</tr>
	</body>
</table>

One approach is to extract a fixed volume around each CtBP2 puncta (e.g., a 1um
cube) and quantify the amount of GluR2 label in the volume. But, we don't know
very much about the format of the data. We need to do a little exploration first.

We used Imaris to detect all the "spots" in the CtBP2 (red) channel and compute some statistics about these spots. We've extracted the statistics file from the Imaris file into `csv` format just to make things easier. Just know that there are routines to extract this information from the file. Today we will practice exploring and visualizing the Imaris data and on Day 4 we will return to the question of how to quantify the amount of GluR2 label in a fixed volume around each CtBP2 puncta.

In [None]:
point_stats = pd.read_csv("data/points_statistics.csv")

## 2.6 Exploring the Imaris Statistics

Because this data file was automatically generated by Imaris, we first need to figure out how it is organized.

We can start taking a look at the first few rows of our summary table using `point_stats.head()`. In general, this is a really good practice to get into; sometimes our data may have a header or not, and we may have loaded the data incorrectly.

In [None]:
##Show first few rows
point_stats.head()

What are some things we notice? Well, there appear to be some data that describe the entire sample (such as "Total Number of Spots") as well as data for localized points identified by Imaris in the red channel (such as "Area").  

In [None]:
##show last few rows
point_stats.tail()

In [None]:
##show dimensions of data frame
point_stats.shape

We can also see that attributes for the various traits describing a given spot (such as "Area" and "Volume") are not columns, but rather listed under the categorical column "Name." If we are curious to see this full list of names, use the unique() function: 

In [None]:
point_stats.Name.unique()

Looking back to the first few rows of data, it appears that ID_Object of -1 designates statistics that describe the entire sample. Let's confirm by viewing all rows with ID_Object of -1:

In [None]:
point_stats[point_stats["ID_Object"]==-1]

Let's look at all of the statistics that were collected for a single spot identified by Imaris, starting with the first:

In [None]:
point_stats[point_stats["ID_Object"]==0]

Let's look at the raw data for Diameter of spots in the X dimension ("Diameter X"):

In [None]:
point_stats[point_stats["Name"]=="Diameter X"].head(20)
#OR
point_stats[point_stats["ID_StatisticsType"]==237].head(20)

If we only want to view the ID_Object, Value, and Name columns from this `DataFrame`, we use the `loc()` function: 

In [None]:
point_stats.loc[:,["ID_Object", "Value", "Name"]]
#OR use iloc() to refer to the columns by their index value (i in iloc() is short for 'index')
point_stats.iloc[:,[1,3,6]].head(20)

## Exercise

Use our `mean_x` function to return the mean `Intensity Max X` across the dataset.

In [None]:
## Space for your answer here



## 2.7 Pivoting

Now let's create a DataFrame that is more intuitive in terms of viewing the statistics Imaris has collected for each identified spot in the red channel. We will call this DataFrame `point_stats_matrix`. To do this, use the `pivot()` function, which reshapes data based on column values. This function is extremely useful in transforming data from *long* format to *wide* format. 

The `pivot` method takes three arguments: `index`, which you can think of as being the rows of the data, `columns`, which specify what columns should exist in the data, and `values`, which are the actual numerical values we want in each Cell.

In [None]:
point_stats_matrix = point_stats.pivot(index='ID_Object', columns='Name', values='Value')
point_stats_matrix.head()

Remember that the statistics for the entire data set (including "Number of spots per time point" and "Total number of spots") have an ID_Object of -1. Let's remove this row:

In [None]:
point_stats_matrix = point_stats_matrix.drop(-1)
point_stats_matrix.head(20)

In [None]:
point_stats_matrix.describe()

## 2.8 Plotting our DataFrame for further exploration

Next let's try some simple visualization, starting with a histogram of area measurements for the spots. Remember that we already imported `seaborn` and `matplotlib` above in order to use plotting functions contained in these modules, so we don't need to import them again before using the functions below. As a reminder, when we imported, we abbreviated the `seaborn` module as `sns`. 

In [None]:
sns.distplot(point_stats_matrix.Area)

How about a boxplot of Area values?

In [None]:
sns.boxplot(y="Area", data=point_stats_matrix)

## Exercise

Take a look at the help for `lmplot` below and make a scatterplot comparing `Intensity Max Z` on the x axis against `Volume` on the y axis. 

In [None]:
help(sns.lmplot)



In [None]:
sns.lmplot(x='Intensity Max Z', y='Volume', fit_reg=False, data=point_stats_matrix)

## 2.9 Filtering

Next we will discuss filtering. The scatterplot you created in the exercise above shows a smattering of points with an unsually large volume. Perhaps we decide that we don't trust that these are isolated points and therefore should exclude these outliers from our dataset. To do this, we will create a DataFrame named `filtered_points` that only includes spots with a volume less than 0.8:

In [None]:
filtered_points=point_stats_matrix[point_stats_matrix.Volume <= 0.8]

Now create another scatterplot to confirm that your filter worked:

In [None]:
#space for new scatterplot



**Question**: what is the output of `point_stats_matrix.Volume <= 0.8`? Try it out by running the below cell. 
    
How does this help us select the rows we want out of `point_stats_matrix`? (Hint: Think about what `True` and `False` mean.)

In [None]:
point_stats_matrix.Volume <= 0.8

## Exercise

Filter the `points_stats_matrix` dataset to have `Intensity Center X` > 10000 and assign it to `psm10000`. (Because of the spaces, you will have to use points_stats_matrix['Intensity Center X'] to access the column).

Re-do the scatter plot of X and Y Intensity Centers to confirm that your filtering worked. 

In [None]:
##space for your answer here.

## Computing a new column based on other columns

Pandas gets extremely powerful in that you can add new columns based on calculations from other columns.

## 2.10 Getting data out

What if you want to save the `point_stats_matrix` DataFrame as its own csv file? Try running the code below. Where did it write the dataset?

In [None]:
point_stats_matrix.to_csv("data/point_stats-mod.csv")

There is also support for reading and writing Excel files if you need it: http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files

## 2.11 What you learned today

Congrats for getting this far! You have seen lots of features of Pandas and Seaborne that let you manipulate the data and visualize it. 

1. `group_by`
2. Filtering
3. Boxplots and Scatterplots
4. Faceting
5. Pivoting data