*This jupyter notebook is part of Arizona State University's course CAS 523 (Methods for Complex Systems Science: Statistics and Dimensionality Reduction) and was written by Bryan Daniels.  It was last updated August 31, 2022.*

*This assignment uses data gathered by Ying Wang and Robert E. Page, Jr. at Arizona State University.*

# Reducing the dimensionality of gene expression data using PCA

In this exercise, we will practice using Principal Components Analysis to extract useful insights from a large-dimensional set of gene expression data.  We will see how a scientific question can be more easily approached when we visualize the data in a lower-dimensional space.

This is part of a research project that I worked on together with Ying Wang, Rob Page, and Gro Amdam here at ASU, who are experts in honey bee physiology, behavior, and genetics.  Combining my expertise in physics and complex systems data analysis, this project is also a good example of the results of interdisciplinary collaboration.

## Get set up and load the data

**NOTE:** The dataset that we use in this assignment is not yet publicly available, but I have permission to share it with students.  For this reason, I do not include the data file in the public github repository.  Instead, it is available for download on Canvas.  Please follow the link on the Canvas assignment page to download the file `Wang_Page_nanostring_data_2016_Day_15.csv` and place it in the folder `data/WangPage2016/` in your own copy of the github repository.  Please do not share these data outside of this class.

Let's load some useful basic packages and functions first:

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams.update({'font.size': 18}) # increases font size on plots
from helpers.prettyPlotting import scatter1D # custom 1D scatter plot
from pathlib import Path # to handle file paths across all operating systems

We will use the scikit learn function `sklearn.decomposition.PCA` to perform PCA.  The documentation is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [None]:
from sklearn.decomposition import PCA

Now load the data:

In [None]:
dataPath = Path('data/WangPage2016/Wang_Page_nanostring_data_2016_Day_15.csv')
expressionData = pd.read_csv(dataPath).drop(columns=['Age','VG protein '])

## What do these data represent?

These measurements were taken in honey bees at a precise time during their development (15 days old) when some bees are starting to leave the nest to forage for food.  Interestingly, some bees become foragers at a much younger age, while others stay in the nest much longer to take care of younger bees.  This transition is relatively sudden, with few bees switching back to in-nest activities once they start foraging.  There seem to be two separate "types" of bees related to which tasks they perform.  This is similar to how different cell types perform different tasks in your body.

Our question: As in cells in human development, are different bee types (those that perform distinct functions) related to which genes are expressed?

My collaborators chose genes to measure that were suspected to be related to the behavioral transition to foraging.  These data represent how strongly these genes are expressed in individual honey bees.  (Specifically, these are [measurements of the amount of RNA](https://en.wikipedia.org/wiki/RNA-Seq) present for each of the genes of interest.  We have taken the logarithm of the raw data to more easily capture wide variations in expression.)

Let's first look at the form of the data we have:

In [None]:
expressionData

This is a `pandas` dataframe in which the columns represent the genes (90 of them) and the rows represent 16 individual bees whose gene expression was measured.

The default when printing a dataframe to the screen is to hide as many rows and columns as necessary to fit on a screen at once without a lot of scrolling.  To see the names of all the genes in the data, we can look at the `columns` attribute:

In [None]:
expressionData.columns

❓ **What is the dimension of each sample data point from this dataset?  That is, if I imagine plotting the gene expression profile of each bee as a point in space, what is the dimensionality of this space?** *Hint: I'm thinking here of each bee representing a single sample.*

✳️ **Answer:**

## Try visualizing in 2D

Due to the large dimensionality of the dataset, it can be difficult to decide which aspects to focus on for thinking about our question about bee types.  Which genes are important?

One way to start is to visualize the data in lower dimensions by focusing on one or a few genes of interest at a time.  An easy way to do this using `pandas` is to use the `plot.scatter` function, which takes the names of two columns and constructs a scatter plot.  For example, we can visualize the expression of the genes *vg* and *ILP-2* in our 16 bees:

In [None]:
expressionData.plot.scatter('vg','ILP-2');

❓ **Choose a few other pairs of genes and make 2D scatter plots.  What insight do you gain from these plots?  Are there disadvantages to this approach?** *Hint: Consider the number of possible pairs of genes.  Also consider what would happen if many genes made small contributions to distinguishing between two bee behavioral types.*

✳️ **Answer:** 

As I initially played around with these data, I happened to find that the pair of genes *vg* and *P110* made for an intriguing scatter plot.

❓ **Make a scatter plot for the genes *vg* and *P110*.  Just looking at this plot, can you construct a simple hypothesis about how the expression of these two genes might correspond to two bee types?**

✳️ **Answer:** 

# Use PCA to do dimensionality reduction

Instead of searching through many possible genes related to this transition, can we use dimensionality reduction to find one or a few dimensions that are particularly interesting?

Recall that Principal Components Analysis (PCA) is one way of picking out such dimensions: PCA chooses the dimensions with largest variance.  This could be useful for our question about bee types because, if gene expression varies with bee type, then we expect larger variance (and correlated variance) among the genes that define the distinct bee types.

The following code runs PCA on our `expressionData` dataframe, keeping only the 10 components with largest variance:

In [None]:
pca_results = PCA(n_components=10).fit(expressionData)

The results are stored as attributes of the `pca_results` object, which we explore below.  (If you are curious about what all is in there, recall how tab completion works in jupyter notebooks: you can type `pca_results.` followed by the tab key to see a list of the object's subparts.)

## How low-dimensional are the data?

As a first step for thinking about what PCA is doing, let's ask how much variance there is in the data along each of these first 10 components.  Specifically, we'll ask what proportion of the total variance lies along each principal component.  This is stored as `explained_variance_ratio_`:

In [None]:
pca_results.explained_variance_ratio_

By construction, the first components have the largest variance (or "explain" the most variance, in the common lingo).

A common way of visualizing this is to plot the total variance included as a function of the number of principal components kept.  The following code computes this "cumulative sum" and plots it:

In [None]:
var_explained_cumulative = pca_results.explained_variance_ratio_.cumsum()
plt.plot(np.arange(len(var_explained_cumulative))+1,var_explained_cumulative,'o:')
plt.xlabel('Number of principal components')
plt.ylabel('Proportion of\nvariance included')
plt.axis(xmin=1,ymax=1,ymin=0);

❓ **If I wanted to reduce the dimension of the dataset so that I included the minimum number of dimensions to capture 90% of the total variance, how many dimensions would I need?  In what sense is this a measure of how low dimensional the dataset is?** *Hint: It might help to think about extreme cases: What would have to be true about the data for 90% of the variance to be explained by a single dimension?  When would we have to keep all dimensions?*

✳️ **Answer:** 

## Interpreting the first principal component

For our question about bee types, it makes sense to focus on the first principal component (the one with largest variance): If the dissimilarity in bee behavior is connected strongly to gene expression, then we expect these large differences in behavior to correspond to large differences in gene expression.  We are looking for large variance!

The first principal component is stored in `pca_results` as `components_[0]` (I include the names of the genes here by creating a `pandas` series indexed by the names in `expressionData.columns`):

In [None]:
component1 = pd.Series(pca_results.components_[0],
                       index = expressionData.columns)

Let's see what the first component looks like.  Recall that a principal component is defined in terms of weights given to each of the original dimensions (each of the original genes, in this case):

In [None]:
component1

So the principal component is a list of length 90, with a weight for each gene (either positive or negative).

I typically find it useful to visualize things when possible.  Here's one way to visualize the principal component (I split into two plots for easier leigibility):

In [None]:
plt.figure(figsize=(15,2))  # set up a large plot area
component1[:45].plot.bar(); # plot the weights of the first 45 genes

In [None]:
plt.figure(figsize=(15,2))  # set up a large plot area
component1[45:].plot.bar(); # plot the weights of all genes past the first 45

How to interpret these results?  Most genes don't contribute much to the principal component (they have small weights), and a few contribute a lot.  One way to find the genes that contribute most is to sort by the absolute value of their weight:

In [None]:
abs(component1).sort_values()

So *hex 110* has the largest contribution, followed by *Hex70a*, and so on.

## Reducing data to a single dimension

Of course, the point of dimensionality reduction is that we can look at the data using these reduced coordinates.  In the extreme case, instead of the full dimensionality of the dataset, we can characterize each sample (each bee) by a *single* number.  This number is the "linear projection" of the full dimensional data onto the principal component—that is, we weight the gene expression of each bee by multiplying by the weights of the first principal component, then add them up to get a single value.

This projection, also called a "dot product", is accomplished by `np.dot`:

In [None]:
data_along_component1 = np.dot(expressionData,component1)

Projected along the first principal component, our dataset is reduced to 16 single numbers, one for each bee:

In [None]:
data_along_component1

Let's visualize this instead of trying to look at a list of numbers:

In [None]:
scatter1D(data_along_component1)
plt.xlabel('Distance along first component');

Or we might make a histogram:

In [None]:
plt.hist(data_along_component1,bins=10)
plt.xlabel('Distance along first component')
plt.ylabel('Number of bees')

❓ **Do you see any evidence that the gene expression corresponds to two distinct groups of bees along the principal component?  How might you distinguish this case from what we would expect if gene expression simply varied continuously with no distinct groups?** *Hint: The most naive expectation for continuously varying gene expression would be a bell-shaped normal distribution.*

✳️ **Answer:** 

# Separate bees into potential groups

We might separate bees into groups by setting a threshold along the first principal component.

❓ **What threshold value would you use to separate the bees into two distinct groups?** *Hint: Feel free to choose this by eye.*

✳️ **Answer:** 

Insert your threshold into the following code, which then splits the bees into two groups and assigns them colors based on which group they are in.

In [None]:
threshold = #insert your threshold here#
beesA = np.where(data_along_component1 > threshold)[0]
beesB = np.where(data_along_component1 < threshold)[0]

# make list of colors based on the group
colors = []
for i in range(16):
    if i in beesA: 
        colors.append('crimson')
    else: 
        colors.append('cornflowerblue')

Here's an example using the colors in a scatter plot (where red dots correspond to bees in group A, and blue to group B):

In [None]:
expressionData.plot.scatter('LOC410022','AKHR',
                            c=colors,s=100);

❓**Which genes do you expect to most correlated with the separation into the two groups?  Check your answer by using the above code to make a 2D colored scatter plot that shows the expression levels of these genes in the two groups.** *Hint: You have already calculated a sorted list of genes above...*

✳️ **Answer:** 

❓ **Do the two groups we noticed above in the scatter plot of *P110* versus *vg* (before we did PCA) also correspond to the two groups defined along the principal component?** 

✳️ **Answer:** 