# PIC-SURE API use-case: quick analysis on COPDgene data

This is a tutorial notebook, aimed to be quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI). 

Original data exposed through PIC-SURE API encompass a large heterogeneity of data organization underneath. PIC-SURE hide this complexity and exposes the different studies dataset in a single tabular format. By easing the process of data extraction, it allows investigators to focus on the downstream analyses and facilitate reproducible sciences.

Currently, only phenotypic variables are accessible through the PIC-SURE API, but access to genomic variables is coming soon.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using any of those languages.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patient that match criteria, and create cohort from this interactive exploration.

The python API is actively developed by the Avillach-Lab at Harvard Medical School.

PIC-SURE API GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client



 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the `get_your_token.ipynb` notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

# Environment set-up

### Pre-requisite
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Packages installation

Installation of the packages listed in the `requirements.txt` file, as well as the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [None]:
!cat requirements.txt

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting numpy==1.16.4
  Using cached numpy-1.16.4.zip (5.1 MB)
Collecting tqdm>=4.38.0
  Using cached tqdm-4.54.1-py2.py3-none-any.whl (69 kB)
Building wheels for collected packages: numpy
  Building wheel for numpy (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-u0bydm7v/numpy_9ff7f43bb233458f9c5e07d89f79b852/setup.py'"'"'; __file__='"'"'/tmp/pip-install-u0bydm7v/numpy_9ff7f43bb233458f9c5e07d89f79b852/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-m9xzwjpx
       cwd: /tmp/pip-install-u0bydm7v/numpy_9ff7f43bb233458f9c5e07d89f79b852/
  Complete output (332 lines):
  Running from numpy source directory.
    return is_string(s) and ('*' in s or '?' is s)
  blas_opt_info:
  blas_mkl_info:
  cus

[31m  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-u0bydm7v/numpy_9ff7f43bb233458f9c5e07d89f79b852/setup.py'"'"'; __file__='"'"'/tmp/pip-install-u0bydm7v/numpy_9ff7f43bb233458f9c5e07d89f79b852/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /tmp/pip-install-u0bydm7v/numpy_9ff7f43bb233458f9c5e07d89f79b852
  Complete output (10 lines):
  Running from numpy source directory.
  
  `setup.py clean` is not supported, use one of the following instead:
  
    - `git clean -xdf` (cleans all files)
    - `git clean -Xdf` (cleans all versioned files, doesn't touch
                        files that aren't checked into the git repo)
  
  Add `--force` to your command to use it anyway if you must (unsupported).
  
  ---------------------------

In [2]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-pz266pbf
Collecting httplib2
  Using cached httplib2-0.18.1-py3-none-any.whl (95 kB)
Building wheels for collected packages: PicSureHpdsLib
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureHpdsLib: filename=PicSureHpdsLib-0.9.0-py2.py3-none-any.whl size=21879 sha256=83afb0891538673037c26f0aa052646a0a464e94280eef92b826180f38642037
  Stored in directory: /tmp/pip-ephem-wheel-cache-zfsj48c3/wheels/e8/35/43/484d5d574661fc4a2c5b083551bc3c7254695764ed17ce397e
Successfully built PicSureHpdsLib
Installing collected packages: httplib2, PicSureHpdsLib
  Attempting uninstall: httplib2
    Found existing installation: httplib2 0.18.1
    Uninstalling httplib2-0.18.1:
      Successfully uninstalled httplib2-0.18.1
  Attempting uninstall: PicSureHpdsLib
    Found existing installation: Pi

Import all the external dependencies, as well as user-defined functions stored in the `python_lib` folder

In [4]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

from utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

##### Setting the display parameter for tables and plots

In [5]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib display parameters
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

## Connecting to a PIC-SURE resource

Several information are required to get access to data through the PIC-SURE API: a network URL, a resource id, and a user-specific security token.

In [6]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [7]:
with open(token_file, "r") as f:
    my_token = f.read()

In [8]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------
|  Resource UUID                       |  Resource Name                                  
+--------------------------------------+------------------------------------------------------
| ERROR:                             
|     User is not authorized. [Token invalid or expired]
+--------------------------------------+------------------------------------------------------


KeyError: 'Resource UUID "02e23f52-f354-4e8b-992c-d37c8b9ba140" was not found!'

Two objects are created here: a `connection` and a `resource` object.

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**. 

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this database.

## Getting help with the PIC-SURE API

Each object exposed by the PicSureHpdsLib library got a `help()` method. Calling it will without parameters print out an information about functionalities of this object. 

In [None]:
resource.help()
disease_var = variablesDict.index.get_level_values(0) == "disease"
age_var = variablesDict.index.get_level_values(1) == "AGE"
lung_cancer = variablesDict.loc[mask_lung_cancer, "varName"] 
ever_asthma = variablesDict.loc[mask_ever_asthma, "varName"] 
current_asthma = variablesDict.loc[mask_current_asthma, "varName"] 
age_var = variablesDict.loc[ages, "varName"]
disease_var = variablesDict.loc[disease, "varName"]

# Getting variable names to filter query on
mask_breath = variablesDict["simplified_varName"] == "Ever asthma?"
mask_breath = variablesDict["simplified_varName"] == "Current asthma?"
mask_breath = variablesDict["simplified_varName"] == "lung_cancer_self_report"
values_stroke_post_HCT = variablesDict.loc[mask_stroke, "categoryValues"]

lung_cancer = plain_variablesDict.loc[mask_lung_cancer, "varName"] 
ever_asthma = plain_variablesDict.loc[mask_ever_asthma, "varName"] 
current_asthma = plain_variablesDict.loc[mask_current_asthma, "varName"] 
age_var = variablesDict.loc[ages, "varName"]
disease_var = variablesDict.loc[disease, "varName"]

my_query.filter().add(disease, values=mask_breath)
my_query.select().add(varnames)

For instance, this output tells us that this `resource` object has 3 methods, and it gives a quick definition of those methods. 

## Using the *variables dictionnary*

Once a connection to the desired resource has been established, we first need to get a knowledge of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A `dictionary` instance enables to retrieve matching records by searching for a specific term, or to retrieve information about all the available variables, using the `find()` method. For instance, looking for variables containing the term `COPD` in their names is done this way: 

In [11]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find()

NameError: name 'resource' is not defined

Subsequently, objects created by the `dictionary.find` method expose the search results via 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`. 

In [None]:
pprint({"Count": dictionary_search.count(), 
        "Keys": dictionary_search.keys()[0:5],
        "Entries": dictionary_search.entries()[0:5]})

In [None]:
dictionary_search.DataFrame().head()

**The `.DataFrame()` method enables to get the result of the dictionary search in a pandas DataFrame format. This way, it allows to:** 


* Use the various information exposed in the dictionary (patient count, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.

Variable names, aren't very pratical to use right away, for two reasons:
1. Very long
2. Presence of backslashes that prevent from copy-pasting. 

However, retrieving the dictionary search result in the form of a dataframe can help to deal with this, as below:

In [None]:
plain_variablesDict = resource.dictionary().find().DataFrame()
plain_variablesDict.shape

Indeed, using the `dictionary.find()` function without arguments return every entries, as shown in the help documentation.

In [None]:
resource.dictionary().help()

In [None]:
plain_variablesDict.iloc[10:20,:]

The dictionary currently returned by the API provides information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if strings, False if numerical
- min/max: only provided for numerical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

### Export Full Data Dictionary to CSV

In order to export the data dictionary first we will create a Pandas dataframe called fullVariableDict

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()

Check that the fullVariableDict dataframe contains some values.

In [None]:
fullVariableDict.iloc[0:3,:]

In [None]:
fullVariableDict.to_csv('data_dictionary.csv')

You should now see a data_dictionary.csv in the file explorer.

#### Variable dictionary + pandas multiIndex

We can use a simple user-defined function (`get_multiIndex_variablesDict`) to add a little more information to the variable dictionary and to simplify working with variables names. It takes advantage of pandas MultiIndex functionality [see pandas official documentation on this topic](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Although not an official feature of the API, such functionality illustrates how to quickly select groups of related variables.

Printing the 'multiIndexed' variable Dictionary allows to quickly see the tree-like organisation of the variable names. Moreover, original and simplified variable names are now stored respectively in the "varName" and "simplified_varName" columns (simplified variable names is simply the last component of the variable name, that is usually the most informative to know what each variable is about).

In [12]:
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)

NameError: name 'get_multiIndex_variablesDict' is not defined

In [None]:
variablesDict

In [None]:
# Now that we have seen how our entire dictionnary looked, we limit the number of lines to be displayed for the future outputs
pd.set_option("max.rows", 50)

Below is a simple example to illustrate the simplicity of use a multiIndex dictionary. Let's say we are interested in every variables pertaining to the "Medical history" and "Medication history" subcategories.

In [None]:
mask_medication = variablesDict.index.get_level_values(2) == "Medication History"
mask_medical = variablesDict.index.get_level_values(2) == "Medical History"
medication_history_variables = variablesDict.loc[mask_medical | mask_medication,:]
medication_history_variables

Although pretty simple, it can be easily combined with other filters to quickly select desired group of variables.

## Querying and retrieving data

Beside from the dictionary, the second cornerstone of the API is the `query` object. It is the entering point to retrieve data from the resource.

First, we need to create a query object.

In [None]:
my_query = resource.query()

The query object has several methods that enable to build a query.

- The `query.select.add()` method accepts variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, without any record (ie subjects/rows) subsetting.

- The `query.require.add()` method accepts variable names as string or list of strings as argument, and will allow the query to return all the variables passed, and only records that do not contain any null values for those variables.

- The `query.anyof.add()` method accepts variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, and only records that do contain at least one non-null value for those variables.

- The `query.filter.add()` method accepts a variable name as argument, plus additional values to filter on that given variable. The query will return this variable and only the records that do match this filter criteria.

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

#### Building the query

In [None]:
mask = variablesDict["simplified_name"] == "How old were you when you completely stopped smoking? [Years old]"
yo_stop_smoking_varname = variablesDict.loc[mask, "name"] 

mask_cat = variablesDict["categorical"] == True
mask_count = variablesDict["observationCount"].between(100,2000)
varnames = variablesDict.loc[mask_cat & mask_count, "name"]

my_query.filter().add(yo_stop_smoking_varname, min=20, max=70)
my_query.select().add(varnames[:50])

 ## Selecting Consent Groups

Sometime it can be necessary to limit results to a group of patients that have provided common consent types. By default, PIC-SURE will enforce limits to the consents that each researcher has individually been authorized for, however it may be desirable to further restrict the results. To view the available consent groups, you can use the query.filter().show() function on a new query. Look for the list of values under "\_Consents\Short Study Accession with Consent Code\"

In [None]:
resource.query().filter().show()

In order to update the values, the existing list needs to be cleared first, then replaced. (phs000179.c2 is one consent code used in the COPDGene study.)

In [None]:
my_query.filter().delete("\\_Consents\\Short Study Accession with Consent Code\\")

In [None]:
my_query.filter().add("\\_Consents\\Short Study Accession with Consent Code\\", ['phs000179.c2'])

### Retrieving the data

Once our query object is finally built, we use the `getResultsDataFrame` function to retrieve the data corresponding to our query

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.tail()

In [None]:
query_result[yo_stop_smoking_varname].plot.hist(legend=None, title= "Age stopping smoking", bins=15)

## Retrieving data from query run through PIC-SURE UI

It is possible for you to retrieve the results of a query that you have previously run using the PIC-SURE UI. To do this you must "select data for export", then select the information that you want the query to return and then click "prepare data export". Once the query is finished executing, a group of buttons will be presented.  Click the "copy query ID to clipboard" button to copy your unique query identifier so you can paste it into your notebook.


Paste your query's ID into your notebook and assign it to a variable.  You then use the `query.getResults(resource, yourQueryUUID)` function with an initialized resource object to retrieve the data from your query as shown below.


The screenshot below shows the button of interest in the PIC-SURE UI. It shows that the previously run query has a DataSetID of `dce08fab-98d3-434a-937a-cb583679efe8`. At this point a copy-paste process is used to provide the DataSetID to the API, as shown in the example code below.  To run this code you must replace the example query ID with a query ID from a query that you have run in the PIC-SURE API.

<img src="https://drive.google.com/uc?id=1kxFLxjEdMfkF4HjdWBaNju0PyMrYxGR0">


In [None]:
# To run this using your notebook you must replace it with the ID value of a query that you have run.
DataSetID = '<<replace with your QuerySetID>>'

In [None]:
%%capture
results = resource.retrieveQueryResults(DataSetID)

In [None]:
from io import StringIO
df_UI = pd.read_csv(StringIO(results), low_memory=False)

In [None]:
df_UI.head()