In [None]:
# imports
from datascience import Table
import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')
from sklearn.cluster import KMeans

# Part 1
This lab, as well as following labs, will use the datascience API. For more information about the datascience Table API, see http://data8.org/datascience/tutorial.html#getting-started.

For part 1 of the lab, we will investigate the expression of 1000 genes in CD8+ T cells in mice infected with vesicular stomatitis virus infection over six days. This data is from the Immgen database, and can be can be explored at http://rstats.immgen.org/PopulationComparison/.

## Load in the Data

1. First, load in the data for '../data/lab2/lab2_immgen.csv' using the read_table function.

In [None]:
# load in data for part 1
table1 = Table.read_table('https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/data/lab2/lab2_immgen.csv')
table1

### What do the table headers mean?
For each gene, we are given a gene symbol and description. Other values include fold change (FC) and mean expression. We will discuss mean expression in part 2. 

A description of fold change: https://en.wikipedia.org/wiki/Fold_change


The data loaded in shows 1000 genes and their mean expression after 6 days of incubation. This expression value is shown in the column labeled 'mean_6d'. 

First, let's sort the genes by expression. First, we will sort in order of descending expression to find the 3 highest expressed genes.

In [None]:
# sort by mean_6d descending then choose top 3

# First, select the Gene symbol and mean expression columns
expression = table1.select(['Gene_Symbol', 'mean_6d'])

# Next, sort by mean expression. Set descending to true.
sortedDescending = expression.sort('mean_6d', descending = True)

# Select the top 3 expressed genes
print(sortedDescending.rows[0], sortedDescending.rows[1], sortedDescending.rows[2])

Here, we see the highest expressed genes are Cc15, Hist1h3b, and Hist2h3b.

## Question 1
Using the **sort()** functionality for table, what are the 3 lowest expressed genes and their expression values? 

In [None]:
# Answer here:

## Question 2
Still using the sort() function, find the genes with top 3 and lowest 3 fold counts (FC)? Are these the same as the top and lowest genes expressed above? Why or why not?

In [None]:
# Answer here:

## Question 3
A. Plot the naive and 6 day expressions, with naive (mean_naive) on the x axis and 6 day (mean_6d). 


B. What does this plot tell you about relative expression after 6 days of incubation?

**Answer here:**

In [None]:
plot_data = table1.select(['mean_naive', 'mean_6d'])

# Answer here: Plot the plot_data, using the '.scatter' function


# Part 2

For part 2 of the lab, we will gather timeseries data for the expression of 42 genes from 0 hours to 100 days. The goal of this exercise is to find relationships among genes using k-means clustering. A description of k-means clustering can be found here: 
We will be using the sklearn implementation: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# load in data for part 2
table2 = Table.read_table('https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/data/lab2/immgen_timeseries.csv')

# Show the table
table2

### What do the table headers mean?
In this table, 'mean_n' indicates mean gene expression for naive cells. 'mean_12h' indicates mean expression after 12 hours, etc.

Fit 10 clusters to the 42 genes available in immgen_timeseries.csv. You first must convert the dataframe into an np matrix. To do so, select relevant columns, then call `table.to_df().as_matrix()`.

In [None]:
# Select the expression columns for 0 to 100 days
columns = table2.select(['mean_n','mean_12h','mean_24h','mean_48h','mean_6d', 'mean_8d','mean_10d', 'mean_15d', 'mean_45d','mean_100d'])

# To use kmeans, we must format our data as a matrix. These functions let us extract a matrix of values from our table 
# of expression data.
matrix = columns.to_df().as_matrix()

# Run kmeans with 10 clusters
kmeans = KMeans(n_clusters=10, random_state=0).fit(matrix)

# Print out the cluster labels. Each label 0-9 represents a different cluster. Genes in the same cluster should have 
#'similar' expression patterns. 
kmeans.labels_

Now, we will plot time series data for cluster 0 over time from 0 to 100 days of incubation.

In [None]:
# Add a new column to our table that holds the cluster label kmeans assigned.
columns_and_labels = columns.with_column('cluster_id', kmeans.labels_.tolist())

In [None]:
# Get genes that were clustered into cluster 0. Drop the cluster column for plotting.
cluster_0 = columns_and_labels.where('cluster_id', 0).drop('cluster_id')

# Graphing a line for each gene is tricky. To do this, we will be converting our data to a matrix, 
# then using matplotlib. 'T' transposes our data.
x=cluster_0.to_df().as_matrix().T

# Plot each gene using matplotlib.
matplotlib.pyplot.plot(x)

## Question 4
Following the instrunctions from the cell above, plot and save clusters 2, 4, 7 and 8.

In [None]:
# Answer here:

## Question 5
What are some trends you can see between and across gene clusters?

**Answer here:**

## Question 6
What may be some factors that differentiate specific genes from one another?

**Answer here:**

## Question 7
On a scale from 1 to 10 (1 being worst, 10 being best), please rate this lab in terms of its:
1. Clarity
2. Difficulty
3. Length
4. Insight

## Bonus Problem 1

Look at some of the clusters and the genes in each cluster. Can you find any genes in cluster 0 that may have synergystic roles? (Hint: This will require you to join Gene Symbols with cluster ids.)

**Answer here:**

## Bonus Problem 2

Read over https://www.immgen.org/ImmGenPublications/ni.2536.pdf. Specifically, take a look at Figure 1c. These clusters in Figure 1c may be similar the the clusters we created above. What are the cluster categories the paper groups genes into? Can you guess which, if any, clusters we generated match with the clusters from the paper?

**Answer here:**