# Visualizing ALDEx2 feature differentials using Qurro

In this example, we use transcriptomic data from [TCGA](https://portal.gdc.cancer.gov/repository). We downloaded 
100 gene expression files from lung squamous cell carcinoma (LUSC) primary tumors and 49 solid tissue normal expression files. We have pre-processed this data into a feature table for ease of use, but the gdc manifest file is also provided for your convenience as well as the script used to aggregate all the files.

**Please note that example is primarily intended to demonstrate how to use Qurro**, rather than to demonstrate "best practices" in using ALDEx2 or in analyzing transcriptomic data. We designed this tool in the context of microbiome / metabolome data, but it should be applicable to arbitrary compositional datasets :)

[1] Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., ... & Cancer Genome Atlas Research Network. (2013). The cancer genome atlas pan-cancer analysis project. *Nature genetics, 45*(10), 1113.


## Requirements:

This notebook requires **qurro**, **pandas (>= 0.24.0 and < 1.0.0)**, **gtfparse**, and **biom-format** to be installed on the Python side of things. The R package [ALDEx2](http://bioconductor.org/packages/release/bioc/html/ALDEx2.html) is also required for the `run_aldex.R` script.

## 0. Setting up
In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files.

In [73]:
# Clear the output directory so we can write these files there
!rm -rf output/*
# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist
# (if it does already exist, -p ensures that an error won't be thrown)
!mkdir -p output

## 1. Processing the feature table

The original feature table has over 60,000 features, which is too computationally expensive for a simple tutorial example like this. To speed things up, we'll filter this table to use the top 1000 genes by total abundance. (Again, this is just for demonstrative purposes -- in practice you should think carefully about when or when not to filter your data.)

In [3]:
import os
import re
import subprocess
import pandas as pd

In [75]:
feature_table = pd.read_csv(
    "input/TCGA_LUSC_expression_feature_table.tsv",
    sep="\t",
    index_col=0,
)
print(feature_table.shape)
feature_table.head()

(60483, 149)


Unnamed: 0_level_0,TCGA-43-5670-11A,TCGA-77-8008-11A,TCGA-18-3410-01A,TCGA-43-6771-11A,TCGA-66-2758-01A,TCGA-90-7767-01A,TCGA-66-2795-01A,TCGA-77-7138-01A,TCGA-39-5019-01A,TCGA-90-6837-11A,...,TCGA-56-7823-01B,TCGA-77-7142-11A,TCGA-22-5483-11A,TCGA-77-8153-01A,TCGA-85-A4JC-01A,TCGA-39-5040-11A,TCGA-43-6143-11A,TCGA-77-7335-11A,TCGA-85-7698-01A,TCGA-43-2581-01A
feature-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.13,4707,1016,2286,1238,2982,2795,1516,5507,5504,2079,...,1988,1059,1297,1970,2450,1432,1500,1790,3722,2471
ENSG00000000005.5,5,1,1,5,1,0,0,0,2,3,...,1,5,4,0,0,4,3,0,1,0
ENSG00000000419.11,2019,1731,2123,2121,3868,2722,2130,4834,4557,1264,...,1194,1449,1382,747,1741,3167,1454,1569,2477,1826
ENSG00000000457.12,1062,968,1883,655,800,1987,935,898,1014,840,...,449,883,614,678,305,705,917,658,961,470
ENSG00000000460.15,204,202,1923,205,706,2191,956,927,1347,264,...,748,159,180,887,275,181,206,153,581,770


In [76]:
feature_table_filt = feature_table.loc[
    feature_table.sum(axis=1).sort_values(ascending=False).head(1000).index, :
]
print(feature_table_filt.shape)
feature_table_filt.head()

(1000, 149)


Unnamed: 0_level_0,TCGA-43-5670-11A,TCGA-77-8008-11A,TCGA-18-3410-01A,TCGA-43-6771-11A,TCGA-66-2758-01A,TCGA-90-7767-01A,TCGA-66-2795-01A,TCGA-77-7138-01A,TCGA-39-5019-01A,TCGA-90-6837-11A,...,TCGA-56-7823-01B,TCGA-77-7142-11A,TCGA-22-5483-11A,TCGA-77-8153-01A,TCGA-85-A4JC-01A,TCGA-39-5040-11A,TCGA-43-6143-11A,TCGA-77-7335-11A,TCGA-85-7698-01A,TCGA-43-2581-01A
feature-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000168484.11,972916,843573,223,1850407,1139,135,488,2352,1482,1815346,...,386,846652,1635957,42,6,1734787,2453372,1001158,699,490
ENSG00000185303.14,975800,887399,10662,1711785,29460,458,3623,4214,178301,686350,...,501,860576,2331898,1074,8,1001212,1513129,967935,2460,7480
ENSG00000198804.2,550882,567646,391264,765533,222015,88019,119269,206227,522146,380501,...,141559,449164,439839,252347,252814,353950,881008,166618,243607,458559
ENSG00000122852.13,786296,663972,8201,1509224,15981,1544,2141,3157,151564,752807,...,432,663762,2079926,522,16,1166295,1096501,906560,780,7326
ENSG00000198886.2,315275,499467,179894,626067,194809,115463,104261,279515,396785,404277,...,262923,442133,429011,309030,218573,241436,696461,131715,252738,276778


We also want to strip everything after the period in the Ensemble IDs. For example, `ENSG00000000003.13` should be converted to `ENSG00000000003`.

In [77]:
feature_table_filt.index = [re.search("ENSG[0-9]*", x).group() for x in feature_table_filt.index]
feature_table_filt.index.name = "feature-id"
feature_table_filt.head()

Unnamed: 0_level_0,TCGA-43-5670-11A,TCGA-77-8008-11A,TCGA-18-3410-01A,TCGA-43-6771-11A,TCGA-66-2758-01A,TCGA-90-7767-01A,TCGA-66-2795-01A,TCGA-77-7138-01A,TCGA-39-5019-01A,TCGA-90-6837-11A,...,TCGA-56-7823-01B,TCGA-77-7142-11A,TCGA-22-5483-11A,TCGA-77-8153-01A,TCGA-85-A4JC-01A,TCGA-39-5040-11A,TCGA-43-6143-11A,TCGA-77-7335-11A,TCGA-85-7698-01A,TCGA-43-2581-01A
feature-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000168484,972916,843573,223,1850407,1139,135,488,2352,1482,1815346,...,386,846652,1635957,42,6,1734787,2453372,1001158,699,490
ENSG00000185303,975800,887399,10662,1711785,29460,458,3623,4214,178301,686350,...,501,860576,2331898,1074,8,1001212,1513129,967935,2460,7480
ENSG00000198804,550882,567646,391264,765533,222015,88019,119269,206227,522146,380501,...,141559,449164,439839,252347,252814,353950,881008,166618,243607,458559
ENSG00000122852,786296,663972,8201,1509224,15981,1544,2141,3157,151564,752807,...,432,663762,2079926,522,16,1166295,1096501,906560,780,7326
ENSG00000198886,315275,499467,179894,626067,194809,115463,104261,279515,396785,404277,...,262923,442133,429011,309030,218573,241436,696461,131715,252738,276778


There are 149 samples but 135 cases - some of the patients have both their primary tumor and normal present in the feature table. To better facilitate coparisons, we're going to only keep the normal samples for those patients who have both. In the barcode the last 3 digits represent the sample type. `01A` or `01B` correspond to tumor sample, while `11A` corresponds to solid tissue normal.

In [78]:
from collections import Counter

all_cases = [re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns]
duplicated_cases = [barcode for barcode, count in Counter(all_cases).items() if count > 1]
len(duplicated_cases)

14

In [79]:
samples_to_remove = []
for col in feature_table_filt:
    case, sample = re.search("(TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4})-([0-1]{2}[AB])", col).groups()
    if case in duplicated_cases and sample.startswith("01"):
        samples_to_remove.append(col)

Now that each patient only has one sample represented in the feature table, we can use case barcodes (TCGA-XX-YYYY) instead of sample barcodes (TCGA-XX-YYYY-ZZ). This will allow us to more easily compare across our metadata conditions.

In [80]:
feature_table_filt = feature_table_filt.drop(columns=samples_to_remove)
feature_table_filt.columns = [
    re.search("TCGA-[A-Za-z0-9]{2}-[A-Za-z0-9]{4}", x).group() for x in feature_table_filt.columns
]
print(feature_table_filt.shape)
feature_table_filt.head()

(1000, 135)


Unnamed: 0_level_0,TCGA-43-5670,TCGA-77-8008,TCGA-18-3410,TCGA-43-6771,TCGA-66-2758,TCGA-66-2795,TCGA-39-5019,TCGA-90-6837,TCGA-77-A5GB,TCGA-52-7810,...,TCGA-85-8582,TCGA-77-7142,TCGA-22-5483,TCGA-77-8153,TCGA-85-A4JC,TCGA-39-5040,TCGA-43-6143,TCGA-77-7335,TCGA-85-7698,TCGA-43-2581
feature-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000168484,972916,843573,223,1850407,1139,488,1482,1815346,4,68,...,156,846652,1635957,42,6,1734787,2453372,1001158,699,490
ENSG00000185303,975800,887399,10662,1711785,29460,3623,178301,686350,451,1731,...,236,860576,2331898,1074,8,1001212,1513129,967935,2460,7480
ENSG00000198804,550882,567646,391264,765533,222015,119269,522146,380501,163642,321512,...,198960,449164,439839,252347,252814,353950,881008,166618,243607,458559
ENSG00000122852,786296,663972,8201,1509224,15981,2141,151564,752807,265,947,...,168,663762,2079926,522,16,1166295,1096501,906560,780,7326
ENSG00000198886,315275,499467,179894,626067,194809,104261,396785,404277,172102,414434,...,174529,442133,429011,309030,218573,241436,696461,131715,252738,276778


In [81]:
feature_table_filt.to_csv(
    "output/TCGA_LUSC_expression_feature_table_filt.tsv",
    sep="\t",
    index=True,
)

## 2. Merging the sample sheet and some clinical/exposure information into a "sample metadata" file

We want to include several fields in the sample metadata:

1. Sample type (tumor or normal)
2. Race
3. Age at diagnosis
4. Gender
5. Cigarettes per day
6. Years smoked

Of course, there are certainly other sample metadata categories that could be worth including in the visualization -- again, this is just an example.

In [82]:
sample_sheet = pd.read_csv("input/gdc_sample_sheet.2020-02-13.tsv", sep="\t")
sample_sheet.head()

Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type
0,e4c62f17-d1e8-4543-9b7e-daa2b68306e0,bc5be208-5934-40dd-81df-567599ea2a51.htseq.cou...,Transcriptome Profiling,Gene Expression Quantification,TCGA-LUSC,TCGA-33-6737,TCGA-33-6737-01A,Primary Tumor
1,220a03f7-7ab6-4233-8f65-7ac5decca4b9,60040f95-8414-4956-bd8a-ec461a49207c.htseq.cou...,Transcriptome Profiling,Gene Expression Quantification,TCGA-LUSC,TCGA-18-3410,TCGA-18-3410-01A,Primary Tumor
2,8894c42e-ce65-4088-88e3-921ce7165261,950e2ba0-a247-4bd6-8092-f97cc4018a79.htseq.cou...,Transcriptome Profiling,Gene Expression Quantification,TCGA-LUSC,TCGA-33-A4WN,TCGA-33-A4WN-01A,Primary Tumor
3,daa44ce1-1671-46b9-aa48-2f4155f0ee49,a998a5b1-397d-4497-a58c-9b9e1c7f491e.htseq.cou...,Transcriptome Profiling,Gene Expression Quantification,TCGA-LUSC,TCGA-56-7579,TCGA-56-7579-01A,Primary Tumor
4,ef056c34-c2b9-47dd-afbf-ed81fc16dc74,840bb854-0669-485e-9d83-c4e1e4f10626.htseq.cou...,Transcriptome Profiling,Gene Expression Quantification,TCGA-LUSC,TCGA-34-5236,TCGA-34-5236-01A,Primary Tumor


In [83]:
sample_sheet_new = sample_sheet.set_index("Case ID", drop=True)
sample_sheet_new = sample_sheet_new[["Sample Type"]]
sample_sheet_new.head()

Unnamed: 0_level_0,Sample Type
Case ID,Unnamed: 1_level_1
TCGA-33-6737,Primary Tumor
TCGA-18-3410,Primary Tumor
TCGA-33-A4WN,Primary Tumor
TCGA-56-7579,Primary Tumor
TCGA-34-5236,Primary Tumor


In [84]:
clinical = pd.read_csv(
    "input/clinical.tsv", 
    sep="\t", 
    na_values=["--", "not reported"],
    index_col="submitter_id"
)
clinical = clinical[["age_at_diagnosis", "race", "gender"]]
clinical.head()

Unnamed: 0_level_0,age_at_diagnosis,race,gender
submitter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TCGA-43-5670,25652.0,white,male
TCGA-43-5670,25652.0,white,male
TCGA-66-2800,25810.0,,male
TCGA-66-2800,25810.0,,male
TCGA-66-2788,20637.0,,male


In [85]:
exposure = pd.read_csv(
    "input/exposure.tsv",
    sep="\t",
    na_values=["--", "Not Reported"],
    index_col="submitter_id"
)
exposure = exposure[["cigarettes_per_day", "years_smoked"]]
exposure.head()

Unnamed: 0_level_0,cigarettes_per_day,years_smoked
submitter_id,Unnamed: 1_level_1,Unnamed: 2_level_1
TCGA-43-5670,1.643836,10.0
TCGA-66-2800,4.109589,50.0
TCGA-66-2788,4.383562,40.0
TCGA-77-7338,2.260274,
TCGA-56-7222,1.260274,


In [86]:
sample_sheet_clinical_exposure = sample_sheet_new.join(clinical).join(exposure)
sample_sheet_clinical_exposure.index.name = "Sample ID"
sample_sheet_clinical_exposure = sample_sheet_clinical_exposure.loc[
    ~sample_sheet_clinical_exposure.index.duplicated(keep="first"), :
]
print(sample_sheet_clinical_exposure.shape)
sample_sheet_clinical_exposure.head()

(135, 6)


Unnamed: 0_level_0,Sample Type,age_at_diagnosis,race,gender,cigarettes_per_day,years_smoked
Sample ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
TCGA-18-3410,Primary Tumor,29827.0,,male,,
TCGA-18-3411,Primary Tumor,23370.0,,female,2.739726,
TCGA-18-3416,Primary Tumor,30435.0,,male,2.191781,
TCGA-18-4086,Primary Tumor,23731.0,,male,1.643836,
TCGA-18-5595,Primary Tumor,18611.0,,male,,


In [87]:
sample_sheet_clinical_exposure.to_csv("output/sample_metadata.tsv", sep="\t", index=True)

## 3. Running ALDEx2 on the count data

[ALDEx2](https://bioconductor.org/packages/release/bioc/html/ALDEx2.html) is a tool used for determining differentially abundant features from a compositional dataset.

For the sake of demonstration, we're going to compare the tumor samples to the normal samples (with the goal of finding differentially abundant features between these sample "types"). Of course, the way you run ALDEx2 may be more complicated if there are multiple sample metadata fields you'd like to include.

We have provided a script, `run_aldex.R`, to run ALDEx2 on our processed feature table. Note that this step can take some time, depending on the size of the input dataset.

[2] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. *Microbiome, 2*(1), 15.

In [88]:
!Rscript run_aldex.R

[1] "aldex.clr: generating Monte-Carlo instances and clr values"
[1] "operating in serial mode"
[1] "removed rows with sums equal to zero"
[1] "computing center with all features"
[1] "data format is OK"
[1] "dirichlet samples complete"
[1] "clr transformation complete"
[1] "aldex.ttest: doing t-test"
[1] "running tests for each MC instance:"
|------------(25%)----------(50%)----------(75%)----------|
[1] "aldex.effect: calculating effect sizes"
[1] "operating in serial mode"
[1] "sanity check complete"
[1] "rab.all  complete"
[1] "rab.win  complete"
[1] "rab of samples complete"
[1] "within sample difference calculated"
[1] "between group difference calculated"
[1] "group summaries calculated"
[1] "effect size calculated"
[1] "summarizing output"
[1] "ALDEx2 results written to output/TCGA_LUSC_aldex_results.tsv"


In [89]:
assert os.path.exists("output/TCGA_LUSC_aldex_results.tsv")

## 4. Converting the feature table from TSV to BIOM

Here, we're going to convert the original feature table from a tab-separated file (TSV) to a BIOM-formatted file. We need to do this because Qurro requires that the input table is in the BIOM format.

### 4.1. About the BIOM format
The BIOM format is a commonly used file format for storing and representing these sorts of feature tables. It's especially good at storing sparse datasets—that is, datasets containing a lot of zeroes. (This is useful for datasets obtained from metagenomic or marker gene sequencing, which are usually very sparse.)

[3] McDonald, D., Clemente, J. C., Kuczynski, J., Rideout, J. R., Stombaugh, J., Wendel, D., ... & Knight, R. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. *Gigascience, 1*(1), 2047-217X.

In [90]:
!biom convert \
    -i output/TCGA_LUSC_expression_feature_table_filt.tsv \
    -o output/TCGA_LUSC_expression_feature_table_filt.biom \
    --table-type="Gene table" \
    --to-hdf5

## 5. Optional: Mapping Ensembl gene identifiers to HGNC in order to get feature metadata

We might want to know which Ensembl features map to which HUGO features. Having this sort of information available in the Qurro visualization can be useful for a few purposes -- examples include checking the name of a gene that looks particularly interesting, or searching (using Qurro's filtering tools) for only genes that have a `gene_biotype` including the phrase `protein_coding`.

**However, feature metadata isn't required to run Qurro -- if you'd like, you're welcome to skip this section and go straight to the "Running Qurro" section.**

### 5.1. Downloading and unzipping the GTF file
First, we'll download the `GRCh38` GTF file. It's compressed (notice the `.gtf.gz` filetype), so we'll uncompress the file using `gunzip`.

In [1]:
!wget -P input/ ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
!gunzip input/Homo_sapiens.GRCh38.99.gtf.gz

--2020-02-28 22:39:53--  ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
           => 'input/Homo_sapiens.GRCh38.99.gtf.gz'
Resolving ftp.ensembl.org... 193.62.193.8
Connecting to ftp.ensembl.org|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-99/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.99.gtf.gz ... 46905912
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.99.gtf.gz ... done.
Length: 46905912 (45M) (unauthoritative)


2020-02-28 22:40:31 (1.29 MB/s) - 'input/Homo_sapiens.GRCh38.99.gtf.gz' saved [46905912]



### 5.2. Convert the GTF file into a simpler format
Next, we'll create a feature metadata file mapping the Ensembl ID to the gene name. Qurro expects that its sample / feature metadata files are just simple TSV files, where the first column is the sample or feature ID and other columns correspond to metadata fields.

To do the conversion, we're going to use the [`gtfparse`](https://github.com/openvax/gtfparse) Python library.

#### 5.2.1. Read the GTF file into a pandas DataFrame using `gtfparse`
This may take a few minutes, since the uncompressed GTF file is pretty big (about 1.2 GB).

In [4]:
from gtfparse import read_gtf

gtf = read_gtf(os.path.join("input", "Homo_sapiens.GRCh38.99.gtf"))

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_version', 'gene_name', 'gene_source', 'gene_biotype', 'transcript_id', 'transcript_version', 'transcript_name', 'transcript_source', 'transcript_biotype', 'tag', 'transcript_support_level', 'exon_number', 'exon_id', 'exon_version', 'protein_id', 'protein_version', 'ccds_id']


#### 5.2.2. Filter the DataFrame's rows

1. Only include rows where `source` ends with `havana` (e.g. the source is `havana` or `ensembl_havana`)
2. Only include rows where `feature` is `gene`

In [5]:
print("GTF file contains {} rows.".format(len(gtf.index)))

# Filter to just havana or ensembl_havana rows
# https://stackoverflow.com/a/12098586/10730311
gtf = gtf[gtf["source"].str.endswith("havana")]

# Filter to just rows where the feature is listed as gene
gtf = gtf[gtf["feature"] == "gene"]

print("Filtered GTF file contains {} rows.".format(len(gtf.index)))

GTF file contains 2905054 rows.
Filtered GTF file contains 51330 rows.


#### 5.2.3. Adjust the DataFrame's columns
Set `gene_id` as the first (index) column, and only include `gene_name` and `gene_biotype` columns. (There's no reason we can't include more columns; these are just two columns likely to be useful for exploratory purposes.)

In [6]:
gtf.set_index("gene_id", inplace=True, verify_integrity=True)
gtf = gtf.filter(items=["gene_name", "gene_biotype"], axis="columns")

#### 5.2.4. Output the DataFrame to a properly-formatted feature metadata file

Now that we have everything sorted out (rows and columns have been filtered, the first column corresponds to the gene IDs) we can output this to a TSV file!

In [7]:
gtf.to_csv(os.path.join("output", "gene_metadata.tsv"), sep="\t")

## 6. Running Qurro

Now that we have everything ready, we can run Qurro!

Note that if you skipped section 5 above (i.e. you didn't generate a `output/gene_metadata.tsv` file) you'll need to omit the `--feature-metadata output/gene_metadata.tsv \` line below.

In [8]:
!qurro \
    --ranks output/TCGA_LUSC_aldex_results.tsv \
    --table output/TCGA_LUSC_expression_feature_table_filt.biom \
    --sample-metadata output/sample_metadata.tsv \
    --feature-metadata output/gene_metadata.tsv \
    --output-dir output/qurro

Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_values, index=self.index, name=self.name)
Use a regular DataFrame whose col

In [9]:
assert os.path.exists("output/qurro/index.html")

### 6.1. Interacting with the Qurro visualization
Open the `output/qurro/index.html` page in the web browser of your choice (Firefox / Chrome recommended, but any modern browser should be ok). From here you can explore all the options Qurro has to offer!

The remainder of this section outlines one simple analysis you could do at this point.

#### 6.1.1. Selecting a differential field of interest
As an example, we're going to navigate to the `Differential` selection menu and select `rab:win:Primary:Tumor`. This column contains the median CLR values for each feature for the `Primary Tumor` samples. (See [the ALDEx2 documentation here](https://bioconductor.org/packages/devel/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.pdf) for details on what each of the `Differential` fields output by ALDEx2 mean -- this information is useful in guiding how you interpret these rankings. Note that `rab:win` is listed in the ALDEx2 documentation as `rab.win` -- periods have been replaced with colons due to [a technical issue](https://github.com/biocore/qurro#temporary-caveat).)

Notice how when you change the selected `Differential`, the y-axis values and x-axis orderings of each feature in the rank plot will be updated accordingly.

You may also want to check the box that says `Fit bar widths to a constant plot width?`, which will squish the rank plot so that it takes up less horizontal space on the screen.

#### 6.1.2. Auto-selecting extremely ranked features

Enter `5` in the `Autoselecting Features` section. The autoselection tools in Qurro let you easily take the log-ratio of very high and very low ranked features -- this is useful when we'd expect the selected ranking to do a good job "distinguishing" features based on their association with certain sample groups. As you might imagine, this is probably the case with `rab:win:Primary:Tumor`!

Click `Apply` and you should see the rank plot on the top left highlight the selected features on the rankings. The numerator features have been colored red, and the denominator features have been colored blue. (...We didn't realize until a while into developing this that a lot of these plots look like the [Tricolour](https://en.wikipedia.org/wiki/Flag_of_France).)

Anyway, selecting a log-ratio also updated the sample plot (on the top right of the screen). Let's play around with it!

![](imgs/aldex_screenshot_annotated.png)

#### 6.1.3. Adjusting the sample plot
We can use the controls below the sample plot to relate the currently selected log-ratio to our sample metadata -- in this case, how does this log-ratio look in relation to the tumor versus non-tumor samples?

To investigate that question, change the `x-axis` field to `Sample Type` and check the box that says `Use boxplots for categorical data?`. You should see a boxplot appear that shows a clear separation between `Primary Tumor` and `Solid Tissue Normal`, which makes sense!

#### 6.1.4. Interpreting log-ratios across groups

Notice how the log-ratio values for the `Primary Tumor` samples seem generally higher than those for the `Solid Tissue Normal` samples. We selected the log-ratio of the 5% highest and 5% lowest ranked features based on the `rab:win:Primary:Tumor` field -- therefore, we know that the numerator features in our log-ratio should be somewhat more frequent in `Primary Tumor` samples, and the denominator features should be somewhat less frequent in `Primary Tumor` samples.

Hopefully it should make sense, then, that this log-ratio is generally higher in `Primary Tumor` samples.

As an exercise for the reader, try setting the rank plot to use the `rab:win:Solid:Tissue:Normal` and re-applying the autoselection. How does the sample plot look now?

### 6.2. Finishing up
That's as far as this tutorial goes for now, but we encourage you to try out some of the other things you can do in Qurro! For example, try selecting a different field in the sample plot's x-axis, or searching for features by gene name.

As always, please feel free to [open an issue](https://github.com/biocore/qurro/issues) if you have any questions, comments, or suggestions.