# Principal Components Analysis (PCA)

PCA identifies the axes that correspond to the greatest variation in a dataset. Usually, most of the variation in a dataset can be summarized by a few principal components. Therefore, the structure of a dataset can be represented using only several principal components.

## 1. Import required Python libraries

<div class="alert alert-info">
- Click in the cell below
- Execute the cell by doing **one** of the following:
    - Type `Shift-Enter`
    - Choose the Cell &#x2192; Run Cells menu option
    - <img align="left" src="https://datasets.genepattern.org/images/jupyter-run.png"> &#x2190; Click the Run icon  on the navigation bar under the menu.

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import re
import urllib.request
from matplotlib.ticker import FuncFormatter

## 2. Sign in to GenePattern

<div class="alert alert-info">
- If you haven't yet logged in, enter your credentials into the cell below and click Login:

In [1]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://gp-beta-ami.genepattern.org/gp", "", ""))

## 3. Compute the principal components of the dataset

<div class="alert alert-info">
- Click and drag the following breast cancer dataset link to the **input filename** parameter below: [BRCA_HUGO_symbols.preprocessed.gct](https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/BRCA_HUGO_symbols.preprocessed.gct) 
- Notice we are clustering by **columns**, which correspond to samples. This means we will be observing which samples cluster with one another.
- Click **Run**

In [2]:
pca_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00017')
pca_job_spec = pca_task.make_job_spec()
pca_job_spec.set_parameter("input.filename", "")
pca_job_spec.set_parameter("cluster.by", "3")
pca_job_spec.set_parameter("output.file", "<input.filename_basename>")
genepattern.GPTaskWidget(pca_task)

When the job completes, you will see a new cell above with the title **#######.PCA**, where the ####### corresponds to the GenePattern job ID of your PCA analysis. You will also see 4 result files:

Filename | Description
:------------ | :-------------
`<filename>_s.odf` | the **s matrix (eigenvectors)**
`<filename>_t.odf` | the **t matrix (transformed original dataset)**
`<filename>_u.odf` | the **u matrix (eigenvalues)**
`gp_execution_log.txt` | the execution log - a record of the analysis run

## 4. Visualize the PCA results
To visualize the results of the PCA analysis, we will read the **s matrix** and **u matrix files** into Python array structures, and create graphs based on the arrays. We do not need the **t matrix** for this analysis.

### a. Read file results into Python variables

The GenePattern results are files on the GenePattern server. To read them into Python arrays, we will use the "`Send to code`" functionality in GenePattern Notebook.

<div class="alert alert-info">
1. In the cell above titled **#######.PCA**, you will see a filename that ends in `_s.odf`. To the right of this file, you will see the following icon: ![(i) icon](https://datasets.genepattern.org/images/gpnb-information-icon.png) Click this icon.
2. You will see a menu of several choices. Select `Send to Code`.
3. You will see a new code cell appear below the **#######.PCA** job results cell.
4. In this cell, you will see a Python variable name such as `brca_hugo_symbols_preprocessed_s_odf_1597528`
5. Select and copy this variable name.
6. In the cell below, paste the variable name into the input field for **gp s matrix file**.
7. Repeat the above steps for the filename above that ends in `_u.odf`.
8. Execute the cell below by clicking **Run**.

In [3]:
@genepattern.build_ui(name="Convert GenePattern ODF Matrix model result files to numpy arrays", 
                    description="Take as input the S and U matrices resulting from a GenePattern PCA job " +
                                "and convert then to numpy arrays", parameters={
    
    "output_var": {
        "hide": True
    }
})
def pca_results_to_arrays(gp_s_matrix_file, gp_u_matrix_file):
    s_matrix_array = gp_matrix_odf_to_nparray(gp_s_matrix_file)
    u_matrix_array = gp_matrix_odf_to_nparray(gp_u_matrix_file)
    return s_matrix_array, u_matrix_array
    
def gp_matrix_odf_to_nparray(gp_file):
    fh = gp_file.open()
    
    # convert bytes->string->nparray
    
    matrix_raw = fh.read()
    matrix_bytes = matrix_raw.decode("utf-8")

    # Remove header lines
    matrix_string = re.sub(".*\n", '', matrix_bytes, count=5, flags=0)
    matrix_string = re.sub("\t\n", '\n', matrix_string, count=0, flags=0)

    # The final split leaves an extra line, which must be removed
    matrix_list = matrix_string.split('\n')
    matrix_list.pop(len(matrix_list)-1)

    matrix_2dlist = [row.split('\t') for row in matrix_list]

    # Populate the new array with contents of the list:
    matrix_array = np.empty(shape=(len(matrix_2dlist),len(matrix_2dlist[0])))
 
    for r in range(len(matrix_2dlist)):
        for c in range(len(matrix_2dlist[0])):
            matrix_array[r][c] = matrix_2dlist[r][c]
    
    return(matrix_array)



### b. Read phenotype assignments to each sample

We will next read the file that contains the phenotype assignments (e.g., tumor, normal, etc.) for the samples in our dataset. These are in the [CLS](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#CLS) file format.

<div class="alert alert-info">
- Click and drag the file containing the phenotype descriptions to the **cls file url** parameter below: [BRCA_HUGO_symbols.preprocessed.cls](https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/BRCA_HUGO_symbols.preprocessed.cls) 
- Click **Run**

In [4]:
@genepattern.build_ui(name="Read a phenotype assignment file (cls format) from a url and return its data", 
                      description="Take as input the url to a cls file and return the data it contains: " +
                                    "number of samples, number of classes, class names, class assignments", 
parameters={
  "output_var": {
  "hide": True
    }
})
def read_phenotype_assignments(cls_file_url):
    cls_file = urllib.request.urlopen(cls_file_url)
    l1 = cls_file.readline()
    (num_samples, num_classes, one) = [int(i) for i in l1.split()]

    l2 = cls_file.readline()
    class_names = l2.split()
    class_names.pop(0)

    l3 = cls_file.readline()
    class_assignments = [int(i) for i in l3.split()]

    return (num_samples, num_classes, class_names, class_assignments)


### c. Set up Python variables for plotting

<div class="alert alert-info">
- Execute the cell below

In [18]:
# Extract the s and u matrices from the results
(s_matrix, u_matrix) = result_matrices

# The principal components are the transpose of the u matrix:
pc = u_matrix.transpose()

# Convert eigenvectors from an array to a list
# The eigenvector matrix only has entries on the diagonal. Extract these into a list to facilitate processing:
evectors = [s_matrix[x][x] for x in range(len(s_matrix))]

# Compute percentage contribution of each eigenvector
ev_total = sum(evectors)
ev_percents = evectors/ev_total

# The `class_data` variable contains the class information - parse it out into variables:
(num_samples, num_classes, class_names, class_assignments) = class_data

# Create color map for up to 6 classes:
colormap = ["#ff0000","#0000ff", "#00ff00", "#00ffff", "#ff00ff", "#ffff00"]
colors = [colormap[class_assignments[i]] for i in range(len(class_assignments))]

## Display scatter plot of first 2 principal components

<div class="alert alert-info">
- Execute the cell below

In [None]:
import plotly
plotly.offline.init_notebook_mode() # To embed plots in the output cell of the notebook
import plotly.graph_objs as go

traces=[]

for j in range(num_classes):
    traces.append(go.Scatter3d(
    x=[i for indx,i in enumerate(pc[0]) if class_assignments[indx] == j],
    y=[i for indx,i in enumerate(pc[1]) if class_assignments[indx] == j],
    z=[i for indx,i in enumerate(pc[2]) if class_assignments[indx] == j],
    mode='markers',
    name=str(class_names[j].decode("utf-8")),
    marker=dict(
        size='10',
        color=colormap[j]
        )
    ))
    data.append(traces[j])
    
layout=go.Layout(dict(height=1000, width=1000, 
            title='TCGA Breast Cancer vs Normal PCA',
            scene=dict(xaxis=dict(title="PC 1",visible=True),
            yaxis=dict(title="PC 2",visible=True),
            zaxis=dict(title="PC 3",visible=True))
            )
)

fig=dict(data=traces, layout=layout)
plotly.offline.iplot(fig)

## Display percentage of variance explained for each principal component

<div class="alert alert-info">
- Execute the cell below

In [None]:
plt.clf()

def percents(x, pos):
    'The two args are the value and tick position'
    return '%1.1f%%' % (x * 100)

formatter = FuncFormatter(percents)

plt.title("Variance Explained Per Principal Component")
x_vals = [i for i in range(num_samples)]
bars = plt.bar(x_vals, ev_percents, 0.8)

plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
plt.show()

## Extra credit

- Perform the PCA analysis on the following files, which consist of 38 samples comprising two leukemia subtypes, ALL and AML. The rightmost column indicates where you should drag their urls.

Filename | Description | Send to this notebook parameter
:------------ | :------------- | :-------------
[all_aml_preprocessed.gct](https://github.com/genepattern/example-notebooks/blob/master/2017-11-07_CCMI_workshop/all_aml_train.preprocessed.gct?raw=true) | Gene expression file | PCA analysis cell **input filename** parameter
[all_aml_train.cls](https://raw.githubusercontent.com/genepattern/example-notebooks/master/2017-11-07_CCMI_workshop/all_aml_train.cls) | Phenotype assignments file | "Read a phenotype" analysis cell **cls file url** parameter