In [11]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

from IPython.display import display
from ipywidgets import FloatProgress
import json
import sys
sys.path.insert(0, '../')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Reverse Gene Expression Score for Python

This notebook is a driver for the RGES test reimplemented in Python 3 as originally proposed by Chen et al<sup>1</sup>. This notebook takes the following files as input:

**Phenotype Signature**

A tab-separated format with the following columns:
- ```entrezgene```: Entrez gene ID numbers
- ```log2FoldChange```: Fold change for the upregulated genes.
- ```log2fc.y```: The **unsigned** fold change for the downregulated genes.

**Drug Profiles**

This file should be in [GCTX](https://clue.io/connectopedia/gctx_format) format. It should be the immediate output of ranking another ```GCTX``` with the [cmapR](https://github.com/cmap/cmapR) ```rank.gct``` method, which takes a ```GCTX``` file with differential expression data as input and returns a matrix of the same size with the columns containing ranks instead.

## Step 1: Import Pipeline Functions

The Python implementation of RGES contains 3 main components:
- ```DiffEx```: Implements a representation and processing functions for the phenotype signature data
- ```L1KGCT```: Implements a representation and processing functions for the LINCS ```GCTX``` ranked data
- ```Score```: Implements the scoring algorithm

In [4]:
from RGES.DiffEx import DiffEx
from RGES.L1KGCT import L1KGCTX
from RGES.Score import score

## Step 2: Load Data

The following code contains data to build the Python representations of the phenotype signature and LINCS drug profile data and perform some pre-processing.

### Data File Paths

In [22]:
PHENOTYPE_SIGNATURE_PATH = "/home/jovyan/oncogxA/Alex/l1k/DEG_SC_5um_entrezgene.txt"
DRUG_PROFILE_PATH = "/home/jovyan/oncogxA/Alex/l1k/10x_ilincs_sigs_top500_ranked_n500x978.gctx"
#DRUG_PROFILE_PATH = "/home/jovyan/oncogxA/Alex/l1k/LINCS_FULL_GEO_RANKED/GSE70138_2017-03-06_landmarks_ranked_n118050x972.gctx"

### Phenotype Signature and LINCS Drug Profile Objects

In [23]:
DE = DiffEx(PHENOTYPE_SIGNATURE_PATH)
LINCS = L1KGCTX(DRUG_PROFILE_PATH) 

### Phenotype Signature Preprocessing

This step moves all of the fold change data into the same column and makes sure the downregulated genes have negative log fold change.

In [24]:
merge_l2fc = lambda x: -1.0*x['log2fc.y'] if not np.isnan(x['log2fc.y']) else x['log2FoldChange']
DE.data['log2FoldChange'] = DE.data.apply(merge_l2fc, axis=1)

## Step 3: Computes Scores

The code below defines a function to iterate through the ```LINCS``` data file and compute an RGES score for each profile based on the ```DE``` data. Scores are stored in a ```dict``` object and written to the JSON file specified in ```OUTPATH```.

### Scores Output Path

In [25]:
OUTPATH = "LINCS_top500_scores.json"
#OUTPATH = "LINCS_FULL.json"

### Scoring Function

In [26]:
def score_lincs():
    prog_bar = FloatProgress(min=0, max=len(list(LINCS.data)))
    display(prog_bar)
    score_d = {}  #{drug_profile: score}
    for signame in list(LINCS.data):
        score_d[signame] = score(DE, LINCS, signame)
        prog_bar.value += 1
    open(OUTPATH, 'w').write(json.dumps(score_d))

In [27]:
score_lincs()

1. Chen, B., Ma, L., Paik, H., Sirota, M., Wei, W., Chua, M.-S., So, S., and Butte, A.J. (2017). Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets. Nature Communications 8, 16022.