# Intro to Pandas and Matplotlib Part II

## Quiz 5 Answers

Question 1

In [None]:
def func1(x):
    return(x*x)
def func2(x):
    return(x*x*x)
def func3(x):
    return(func1(x), func2(x))
print(list(map(func3, range(1,3)))) 

Question 2

In [None]:
print(sum(filter(lambda x: x>3 and x<7, range(1,10))))

Question 3

In [None]:
from functools import reduce
x = [[1,3], [5,7], [6,4]]
print(reduce(min, map(max,x)))

Remember after activating your environment, to install pandas and matplotlib.

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

## Background

Last class period, we filtered a dataset for q-values and fold-change.  Today, we want to explore some additional features of the dataset that were unexplored by the authors.  Mainly if there are patterns in cis-regulatory features in the mRNAs selected by the authors.  Remember, how I said that selected mRNAs were from an experiment where the authors over-expressed a protein.  Well, without diving into the intricacies of this particular field of research, this protein is believed to regulate mRNAs via their 5'UTR.  In other studies, reasearchers have found that these mRNAs had GC-rich 5'UTRs or overly long UTRs.  Today we are going to see if any of those patterns hold-up in this mRNA set.

## Review

I have given you two datasets, one dataset is the original supplemental table from last week, the other dataset was downloaded from UCSC table browser (and then sort of cleaned up) and includes the 5'UTR length and GC content of genes.

In [None]:
genelist = pd.read_table('Supplement_table02.txt', sep='\t', index_col=0)

In [None]:
genelist.head()


### Your Turn

Pull out rows where the gene symbol is 'ACOT7'.  Create a mask to isolate rows that evaulate true for ACOT7.

Use the mask you create and index into the genelist dataframe.

## Axis attributes and Groupby

Notice how there are multiple rows where the gene name is the same?  We are going to average those rows this time using the groupby method.  First we are going to play with the axis attribute for mean and then apply only to genes that have the same name.

In [None]:
test_frame = pd.DataFrame([{'a':1, 'b':2, 'c':3}, {'a':2, 'b':4, 'c':6}])
test_frame

If we are to take the means of the dataframe, we can either do this row or column wise using the axis arguement.  Axis 0 calculates along the rows, axis 1 calculates along the columns. This statement is confusing as hell, so I refer you to stack overflow for a nice explanation. [axis definition in pandas](http://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean) 

![axis](Screen Shot 2016-10-05 at 12.32.56 PM.png)

In [None]:
test_frame.mean(axis=0)


In [None]:
test_frame.mean(axis=1)

If I want to filter my genelist dataframe so that I keep only rows that have more than 2 non-null values, how would I go about doing it?

Step 1: Check out the count method to return the count of non-null (aka anything that's not NaN) item across rows.  You will have to use the correct axis in order to return rows.

Step2: Use your answer from step 1 and create a boolean mask to pick out rows that are greater than 2 counts.

Step3: index into the genelist dataframe with your boolean mask from step 2 and save the object as filtered for the object variable name. 

## Groupby

In [None]:
genelist.groupby('Gene symbol')

Groupby returns an object and in order to view the object, you can use the groups attribute which returns a dictionary whose keys are unique values and whose dictionary values are the axis labels belonging to each group.

In [None]:
genelist.groupby('Gene symbol').groups

Once you've created a groupby object, you can use methods only on the groups.

In [None]:
genelist.ix[genelist['Gene symbol']=='ACOT7',:]

In [None]:
condensed = genelist.groupby('Gene symbol').mean()


Shows that Gene symbol is the new index.

In [None]:
condensed.ix['ACOT7',:]

## Merging two dataframes

We want to look at the UTRs of genes that are in our genelist.  We also want to see if our selected genes have UTRs that are different from the rest of the genelist.

There are many ways to smoosh together two dataframes (or more).  I will refer you to the documents which have pictures and stuff for all of the different ways to perform this operation [merge join concat](http://pandas.pydata.org/pandas-docs/stable/merging.html).  Today we will cover merging.

There are 4 types of merging in pandas, outer, inner, right, and left.

![merging](merge_methods.png)

We are interested in an inner join, where only the genes in common between the two dataframe will be kept, all others will be discarded.  So first we need to find at least one column in each dataframe that both dataframes have in common to merge on. And then we need to decide what arguements to geive the merge function, and by that I mean how to tell it which columns to merge on.

In [None]:
features = pd.read_table('longest_isoform_length_GC_UTR.txt', sep='\t', index_col=0)

In [None]:
features.shape

In [None]:
features.head()

In [None]:
condensed.shape

In [None]:
top_genes = pd.merge(condensed, features, how = 'inner', left_index = True, right_index=True)

In [None]:
top_genes

What happend if we do an outer merge? Redo the previous command and see what happens when you give the how arguement 'outer'.

## A brief introduction to matplotlib

For all of your matplotlib questions, here is a decent tutorial: [matplotlib tutorial](http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html) and a nice jupyter notebook [matplotlib notebook](http://nbviewer.jupyter.org/github/WeatherGod/AnatomyOfMatplotlib/blob/master/AnatomyOfMatplotlib-Part1-Figures_Subplots_and_layouts.ipynb)

We are going to create a distribution of both GC content and length within the same figure but as two separate subplots for the feature data using matplotlib.

In [None]:
fig = plt.figure()
#ax = fig.add_subplot(111) # I'll explain the "111" later. Basically, 1 row and 1 column.
#ax.set(xlim=[0.5, 4.5], ylim=[-2, 8], title='An Example Axes', ylabel='Y-Axis', xlabel='X-Axis')


In [None]:
fig = plt.figure() #think of this like a container to put all of your plots on
#ax1 = fig.add_subplot(221)
#ax2 = fig.add_subplot(222)
#ax3 = fig.add_subplot(223)
#ax4 = fig.add_subplot(224)
#ax1.plot([1,2,3],[4,5,6])

![pics of subplots](plot_subplot-grid_1.png)

Notice that subplot index starts at 1 and not 0 which is some weird shit that matplotlib decided would be a great idea in that it mimics how matlab indexes plots. Roll with it.  You can also call the plots using 0 based indexing using this sytax:

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8,8))
axes[0,0].set(title='Upper Left')
axes[0,1].set(title='Upper Right')
axes[1,0].set(title='Lower Left')
axes[1,1].set(title='Lower Right')

We are going to go through this next block of code line by line to figure out what's going on.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(15,4))
ax1 = top_genes['UTR_len'].plot(kind='kde', ax=axs[0])
ax1.set_title('UTR_len_distribution')
ax1.set_xlabel('number of bps')
ax2 = top_genes['GC_content'].plot(kind = 'kde', ax=axs[1])
ax2.set_title('%GC_distribution')
ax2.set_xlabel('% GC')

Now if we want to see whether there is a change in the distribution we can overlay the whole gene set.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(15,4))
ax1 = top_genes['UTR_len'].plot(kind='kde', ax=axs[0], legend=True)
ax1b = features['UTR_len'].plot(kind='kde', ax=axs[0], legend=True)
ax1.set_title('UTR_len_distribution')
ax1.set_xlabel('number of bps')
ax2 = top_genes['GC_content'].plot(kind = 'kde', ax=axs[1], legend=True)
ax2b = features['GC_content'].plot(kind = 'kde', ax=axs[1], legend=True)
ax2.set_title('%GC_distribution')
ax2.set_xlabel('% GC')
fig.savefig('test_figure.png')

## Apply method

Now we are going to cleanup the folding energy for the 5'UTR using the apply method.  We will apply it to a series, however, if given an axis arguement, it can also be applied to a dataframe.

In [None]:
features['hg19.foldUtr5.energy']

I want to get the maximum value out of each row for this column.  This would require a for loop in pyton in order to unpack the list of values in each row.  But, we are working in pandas and there should always (or most of the time) be a work around to get out of for loops.  Here's an example for one row of what we are trying to achieve:

In [None]:
pd.Series(features['hg19.foldUtr5.energy']['AAK1'].split(',')).max()

In [None]:
features['hg19.foldUtr5.energy'].apply(lambda item: pd.Series(item.split(',')).max())