# Extraction of LINCS Dataset Metadata from LDP API

*October 9<sup>th</sup>, 2017*

**Denis Torre**
##### Overview
This Jupyter notebook contains a step-by-step overview of the extraction of LINCS Dataset Metadata from the LDP API *fetchdata* endpoint.  For more information, visit the API documentation here: http://lincsportal.ccs.miami.edu/apis/.

##### Quick Links

Use these links to download the final data in tab-separated format:

* Full Table ([link](https://raw.githubusercontent.com/denis-torre/lincs-dataset-browser/master/notebooks/lincs_data.csv))
* Filtered Table - only most relevant fields ([link](https://raw.githubusercontent.com/denis-torre/lincs-dataset-browser/master/notebooks/filtered_lincs_data.csv))


### Extraction of LINCS Dataset Metadata
##### 1. Download data from API

In [21]:
# Import modules
import urllib2, json
import pandas as pd

# Get API URL
api_url = 'http://lincsportal.ccs.miami.edu/dcic/api/fetchdata?searchTerm=*&limit=2000'

# Read API data
api_data_string = urllib2.urlopen(api_url).read()

Here we print the first 1500 characters of the resulting string, which is formatted as JSON.

In [17]:
print api_data_string[:1500]

{"results":{ "totalDocuments":350,"documents":[
  {
    "centerdatasetid": "http://lincs.hms.harvard.edu/db/datasets/20023/",
    "funding": "1U54HG006097-01",
    "projectname": "LINCS phase 1",
    "assayoverview": "The KINOMEscan assay platform is based on a competition binding assay that is run for a compound of interest against each of a panel of 317 to 456 kinases. The assay has three components: a kinase-tagged phage, a test compound, and an immobilized ligand that the compound competes with to displace the kinase. The amount of kinase bound to the immobilized ligand is determined using quantitative PCR of the DNA tag.  Results for each kinase are reported as \"Percent of control\", where the control is DMSO and where a 100% result means no inhibition of kinase binding to the ligand in the presence of the compound, and where low percent results mean strong inhibition. The KINOMEscan data are presented graphically on TREEspot Kinase Dendrograms (http://www.kinomescan.com/Tools---

##### 2. Extract list of search results

In [18]:
# Remove non-ASCII characters
api_data_string = unicode(api_data_string, errors='ignore')

# Convert data to Python dictionary
api_data_dictionary = json.loads(api_data_string)

# Extract result list
api_result_list = api_data_dictionary['results']['documents']

Here we print the number of results of the query.

In [19]:
print len(api_result_list)

350


##### 3. Convert results to table format

In [23]:
# Convert to Pandas DataFrame
api_result_dataframe = pd.DataFrame(api_result_list).set_index('datasetid')
api_result_dataframe.head()

Unnamed: 0_level_0,_version_,antibody,assaydesignmethod,assayformat,assayname,assayoverview,biologicalbucket,biologicalprocess,cellline,centerdatasetid,...,screeninglabinvestigator,size,smallmolecule,smlincsidentifier,statsfields,statsvalues,technologies,timepoints,tool,toollink
datasetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LDS-1009,1571407499608719360,,[KINOMEscan],Biochemical,[KINOMEscan kinase-small molecule binding assay],The KINOMEscan assay platform is based on a co...,Binding,[Small molecule metabolic process],,http://lincs.hms.harvard.edu/db/datasets/20023/,...,Qingsong Liu,[ 0.07],[SB590885],"[LSM-42746, LSM-42746]","[smallmolecule, protein]","[1, 432]",KINOMEscan,[],"[Harmonizome, Life]",[http://amp.pharm.mssm.edu/Harmonizome/dataset...
LDS-1018,1571407499677925376,,[KINOMEscan],Biochemical,[KINOMEscan kinase-small molecule binding assay],The KINOMEscan assay platform is based on a co...,Binding,[Small molecule metabolic process],,http://lincs.hms.harvard.edu/db/datasets/20032/,...,Qingsong Liu,[ 0.07],[GW843682X],"[LSM-1014, LSM-1014]","[smallmolecule, protein]","[1, 432]",KINOMEscan,[],"[Harmonizome, Life]",[http://amp.pharm.mssm.edu/Harmonizome/dataset...
LDS-1017,1571407499740839936,,[KINOMEscan],Biochemical,[KINOMEscan kinase-small molecule binding assay],The KINOMEscan assay platform is based on a co...,Binding,[Small molecule metabolic process],,http://lincs.hms.harvard.edu/db/datasets/20031/,...,Qingsong Liu,[ 0.07],[GSK-461364],"[LSM-1013, LSM-1013]","[smallmolecule, protein]","[1, 432]",KINOMEscan,[],"[Harmonizome, Life]",[http://amp.pharm.mssm.edu/Harmonizome/dataset...
LDS-1006,1571407499804803072,,[KINOMEscan],Biochemical,[KINOMEscan kinase-small molecule binding assay],The KINOMEscan assay platform is based on a co...,Binding,[Small molecule metabolic process],,http://lincs.hms.harvard.edu/db/datasets/20020/,...,Qingsong Liu,[ 0.07],[Sorafenib],"[LSM-1008, LSM-1008]","[smallmolecule, protein]","[1, 432]",KINOMEscan,[],"[Harmonizome, Life]",[http://amp.pharm.mssm.edu/Harmonizome/dataset...
LDS-1007,1571407499867717632,,[KINOMEscan],Biochemical,[KINOMEscan kinase-small molecule binding assay],The KINOMEscan assay platform is based on a co...,Binding,[Small molecule metabolic process],,http://lincs.hms.harvard.edu/db/datasets/20021/,...,Qingsong Liu,[ 0.07],[HG6-64-1],"[LSM-43248, LSM-43248]","[smallmolecule, protein]","[1, 432]",KINOMEscan,[],"[Harmonizome, Life]",[http://amp.pharm.mssm.edu/Harmonizome/dataset...


In the DataFrame above, each row represents a different LINCS dataset.  Below we filter the dataframe to include the most relevant columns (this is done for simplicity, please refer to the full table for a more comprehensive overview of the data).

In [28]:
# Get subset
filtered_api_result_dataframe = api_result_dataframe[['datasetgroup', 'datasetname', 'description', 'cellline', 'smallmolecule', 'technologies']]
filtered_api_result_dataframe.head()

Unnamed: 0_level_0,datasetgroup,datasetname,description,cellline,smallmolecule,technologies
datasetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LDS-1009,LDG-1008,SB590885 KINOMEscan,"For the panel of kinases, the percent of kinas...",,[SB590885],KINOMEscan
LDS-1018,LDG-1017,GW843682 KINOMEscan,"For the panel of kinases, the percent of kinas...",,[GW843682X],KINOMEscan
LDS-1017,LDG-1016,GSK461364 KINOMEscan,"For the panel of kinases, the percent of kinas...",,[GSK-461364],KINOMEscan
LDS-1006,LDG-1005,Sorafenib KINOMEscan,"For the panel of kinases, the percent of kinas...",,[Sorafenib],KINOMEscan
LDS-1007,LDG-1006,HG-6-64-1 KINOMEscan,"For the panel of kinases, the percent of kinas...",,[HG6-64-1],KINOMEscan


In [29]:
# Write data
api_result_dataframe.to_csv('lincs_data.csv')
filtered_api_result_dataframe.to_csv('filtered_lincs_data.csv')