# Application of matrix decomposition to biological data

So far, we've used PCA and ICA on not truly biological datasets, now we'll try a real biological dataset by obtaining the data from a public database.

## 1. Find the database and accession codes

At the end of most recent papers, they'll put a section called "**Accession Codes**" or "**Accession Numbers**" which will list an identifying 

### Example data accession section from a Cell paper

![Accession numbers in Cell journal](figures/accession_numbers_cell.png)

### Example data accession section from a Nature Biotech paper
![Accession codes in Nature Biotech journal](figures/accession_codes_buettner.png)

Let's do this for the Shalek2013 paper. Note: For some "older" papers, the accession code may not be on the PDF version of the paper but on the online version only. What I usually do then is search for the title of the paper and go to the journal website.

* What database was the data deposited to? 
* What is its' accession number?

## 2. Go to the data in the database

If you search for the database and the accession number, the first result will usually be the database with the paper info and the deposited data! Below is an example search for "Array Express E-MTAB-2805."

![Example search for "Array Express E-MTAB-2805"](figures/buettner_search_accession.png)

Search for its database and accession number and you should get to a page that looks like this:

![GEO overview page for Shalek 2013](figures/shalek2013_geo.png)

## 3. Find the gene expression matrix

Lately, for many papers, they *do* give a processed expression matrix in the accession database that you can use directly. Luckily for us, that's exactly what the authors of the Shalek 2013 dataset did. If you notice at the bottom of the page, there's a table of Supplementary files and one of them is called "`GSE41265_allGenesTPM.txt.gz`". The link below is the "(ftp)" link copied down with the command "`wget`" which I think of as short for "web-get" so you can download files from the internet with the command line.

Run the cell below to download the text file

In [1]:
! wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz

--2016-06-04 11:38:46--  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41265/suppl/GSE41265_allGenesTPM.txt.gz
           => 'GSE41265_allGenesTPM.txt.gz'
Resolving ftp.ncbi.nlm.nih.gov... 2607:f220:41e:250::12, 130.14.250.11
Connecting to ftp.ncbi.nlm.nih.gov|2607:f220:41e:250::12|:21... failed: Operation timed out.
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/series/GSE41nnn/GSE41265/suppl ... done.
==> SIZE GSE41265_allGenesTPM.txt.gz ... 1099290
==> PASV ... done.    ==> RETR GSE41265_allGenesTPM.txt.gz ... done.
Length: 1099290 (1.0M) (unauthoritative)


2016-06-04 11:40:07 (294 KB/s) - 'GSE41265_allGenesTPM.txt.gz' saved [1099290]



We can use the unix command "`ls`" (short for "listing") to look around the files that are available here and prove to ourselves that we actually have the file we just downloaded.

In [2]:
ls

2.1_Introduction.ipynb
2.2_Hierarchical_clustering.ipynb
2.3_Matrix_Decomposition.ipynb
2.4_Manifold_learning.ipynb
2.5_Compare_unsupervised.ipynb
2.6_Additional_reading.ipynb
2.9_Application_of_matrix_decomposition_to_shalek2013.ipynb
GSE41265_allGenesTPM.txt.gz
[1m[36mfigures[m[m/
[1m[36mpapers[m[m/


See, "`GSE41265_allGenesTPM.txt.gz`" is there!

Since the file ends in "`.gz`", this tells us its a "gnu-zipped" or "gzipped" ("gee-zipped") file, which is a specific flavor of "zipping" or compressing a file. We need to use a gnu-zipping-aware program to decompress the file, which is "`gunzip`" ("gnu-unzip").

Run the next cell to unzip the file

In [3]:
! gunzip GSE41265_allGenesTPM.txt.gz

Let's "`ls`" again to see what files have changed

In [4]:
ls

2.1_Introduction.ipynb
2.2_Hierarchical_clustering.ipynb
2.3_Matrix_Decomposition.ipynb
2.4_Manifold_learning.ipynb
2.5_Compare_unsupervised.ipynb
2.6_Additional_reading.ipynb
2.9_Application_of_matrix_decomposition_to_shalek2013.ipynb
GSE41265_allGenesTPM.txt
[1m[36mfigures[m[m/
[1m[36mpapers[m[m/


So now we have the unzipped version of the file, "`GSE41265_allGenesTPM.txt`". I wonder how much space they saved by zipping it?

Let's use the flags "`-l`" for "long listing" which will show us the sizes

In [5]:
ls -l

total 7456
-rw-r--r--   1 olga  staff     2084 May 24 16:04 2.1_Introduction.ipynb
-rw-r--r--   1 olga  staff   214633 Jun  3 12:09 2.2_Hierarchical_clustering.ipynb
-rw-r--r--   1 olga  staff   351682 Jun  3 12:16 2.3_Matrix_Decomposition.ipynb
-rw-r--r--   1 olga  staff   198471 May 24 19:29 2.4_Manifold_learning.ipynb
-rw-r--r--   1 olga  staff   163301 May 25 10:15 2.5_Compare_unsupervised.ipynb
-rw-r--r--   1 olga  staff     1532 May 24 16:04 2.6_Additional_reading.ipynb
-rw-r--r--   1 olga  staff     7137 Jun  4 12:16 2.9_Application_of_matrix_decomposition_to_shalek2013.ipynb
-rw-r--r--   1 olga  staff  2866331 Jun  4 11:40 GSE41265_allGenesTPM.txt
drwxr-xr-x  27 olga  staff      918 Jun  4 11:59 [1m[36mfigures[m[m/
drwxr-xr-x   5 olga  staff      170 May 24 16:03 [1m[36mpapers[m[m/


oof, this is in pure bytes and I can't convert to multiples of 1024 easily in my head (1024 bytes = 1 kilobyte, 1024 kilobytes = 1 megabtye, etc -  the 1000/byte is a lie that the hard drive companies use!). So let's use the `-h` flag, which tells the computer to do th conversion for us. We can combine multiple flags with the same dash, so

    ls -l -h

Can be shortened to:

    ls -lh

In [7]:
ls -lh

total 7464
-rw-r--r--   1 olga  staff   2.0K May 24 16:04 2.1_Introduction.ipynb
-rw-r--r--   1 olga  staff   210K Jun  3 12:09 2.2_Hierarchical_clustering.ipynb
-rw-r--r--   1 olga  staff   343K Jun  3 12:16 2.3_Matrix_Decomposition.ipynb
-rw-r--r--   1 olga  staff   194K May 24 19:29 2.4_Manifold_learning.ipynb
-rw-r--r--   1 olga  staff   159K May 25 10:15 2.5_Compare_unsupervised.ipynb
-rw-r--r--   1 olga  staff   1.5K May 24 16:04 2.6_Additional_reading.ipynb
-rw-r--r--   1 olga  staff   8.6K Jun  4 12:18 2.9_Application_of_matrix_decomposition_to_shalek2013.ipynb
-rw-r--r--   1 olga  staff   2.7M Jun  4 11:40 GSE41265_allGenesTPM.txt
drwxr-xr-x  27 olga  staff   918B Jun  4 11:59 [1m[36mfigures[m[m/
drwxr-xr-x   5 olga  staff   170B May 24 16:03 [1m[36mpapers[m[m/


Okay, the file is 2.7 megabytes, and in the "wget" command we saw that the file was 1 megabyte, so the gzipping *did* save half the space! I bet that adds up over all the millions of files that GEO hosts.

Anyways, let's get on with the analysis.

## 3. Reading in the data file

To read the gene expression matrix, we'll use "`pandas`" a Python package for "Panel Data Analysis" (as in panels of data), which is a fantastic library for working with dataframes, and is Python's answer to R's dataframes. We'll take this opportunity to import ALL of the python libaries that we'll use today.

We'll read in the data using `pandas` and look at the first 5 rows of the dataframe with the dataframe-specific function `.head()`. Whenever I read a new table or modify a dataframe, I **ALWAYS** look at it to make sure it was correctly imported and read in, and I want you to get into the same habit.

In [14]:
# Alphabetical order is standard
import numpy as np
import matplotlib as mpl
import pandas as pd
from sklearn.decomposition import PCA, FastICA

%matplotlib notebook

shalek2013_expression = pd.read_table('GSE41265_allGenesTPM.txt', 
                                      index_col=0)  # index_col=0 sets the first column as the row names 
shalek2013_expression.head()

Unnamed: 0_level_0,S1,S2,S3,S4,S5,S6,S7,S8,S9,S10,...,S12,S13,S14,S15,S16,S17,S18,P1,P2,P3
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
XKR4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019906,0.0
AB338584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B3GAT2,0.0,0.0,0.023441,0.0,0.0,0.029378,0.0,0.055452,0.0,0.029448,...,0.0,0.0,0.031654,0.0,0.0,0.0,42.150208,0.680327,0.022996,0.110236
NPL,72.00859,0.0,128.062012,0.095082,0.0,0.0,112.310234,104.329122,0.11923,0.0,...,0.0,0.116802,0.1042,0.106188,0.229197,0.110582,0.0,7.109356,6.727028,14.525447
T2,0.109249,0.172009,0.0,0.0,0.182703,0.076012,0.078698,0.0,0.093698,0.076583,...,0.693459,0.010137,0.081936,0.0,0.0,0.086879,0.068174,0.062063,0.0,0.050605


So we have 21 columns but pandas by default shows maximum of 20 so let's change the setting so we can see ALL of the samples instead of just skipping sample 11 (**S11**). We'll change this for rows, too, and why will become obvious in a second.

In [15]:
pd.options.display.max_columns = 21
pd.options.display.max_rows = 21
shalek2013_expression.head()

Unnamed: 0_level_0,S1,S2,S3,S4,S5,S6,S7,S8,S9,S10,S11,S12,S13,S14,S15,S16,S17,S18,P1,P2,P3
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
XKR4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019906,0.0
AB338584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B3GAT2,0.0,0.0,0.023441,0.0,0.0,0.029378,0.0,0.055452,0.0,0.029448,0.024137,0.0,0.0,0.031654,0.0,0.0,0.0,42.150208,0.680327,0.022996,0.110236
NPL,72.00859,0.0,128.062012,0.095082,0.0,0.0,112.310234,104.329122,0.11923,0.0,0.0,0.0,0.116802,0.1042,0.106188,0.229197,0.110582,0.0,7.109356,6.727028,14.525447
T2,0.109249,0.172009,0.0,0.0,0.182703,0.076012,0.078698,0.0,0.093698,0.076583,0.0,0.693459,0.010137,0.081936,0.0,0.0,0.086879,0.068174,0.062063,0.0,0.050605


Now we can see all the samples!

Let's take a look at the full size of the matrix with `.shape`:

In [17]:
shalek2013_expression.shape

(27723, 21)

Wow, ~28k rows! That must be the genes, while there are 18 single cell samples and 3 pooled samples as the columns.

## 4. Verify that the matrix conforms to machine learning standards

Okay so we have the genes as the rows and the samples as the columns. To make this compatible with machine learning algorithms, we need to transpose it so that the rows are the features (genes) and the columns are the samples (individual cells and bulk sequencing libraries). We'll do that with `.T`, and verify the shape, in addition to showing the top 5 rows.

I like to both print the shape of the matrix in addition to showing the "head" so I can keep track of the number of columns or rows.

In [18]:
shalek2013_expression = shalek2013_expression.T
print(shalek2013_expression.shape)
shalek2013_expression.head()

(21, 27723)


GENE,XKR4,AB338584,B3GAT2,NPL,T2,T,PDE10A,1700010I14RIK,6530411M01RIK,PABPC6,...,AK085062,DHX9,RNASET2B,FGFR1OP,CCR6,BRP44L,AK014435,AK015714,SFT2D1,PRR18
S1,0.0,0.0,0.0,72.00859,0.109249,0.0,0.0,0.0,0.0,0.0,...,0.0,0.774638,23.520936,0.0,0.0,460.316773,0.0,0.0,39.442566,0.0
S2,0.0,0.0,0.0,0.0,0.172009,0.0,0.0,0.0,0.0,0.0,...,0.0,0.367391,1.887873,0.0,0.0,823.89029,0.0,0.0,4.967412,0.0
S3,0.0,0.0,0.023441,128.062012,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.249858,0.31351,0.166772,0.0,1002.354241,0.0,0.0,0.0,0.0
S4,0.0,0.0,0.0,0.095082,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.354157,0.0,0.887003,0.0,1230.766795,0.0,0.0,0.131215,0.0
S5,0.0,0.0,0.0,0.0,0.182703,0.0,0.0,0.0,0.0,0.0,...,0.0,0.039263,0.0,131.077131,0.0,1614.749122,0.0,0.242179,95.485743,0.0


## 5. Filter on bad genes and bad cells

Okay, now we're ready to do some analysis! 

In [None]:
shalek2013_expression