![alt text](pdc.PNG "PDC")

The notebook can be found [here.](https://github.com/esacinc/PDC-Public/blob/master/API_documentation/PDC_retrieve_study_metadata.ipynb)

This notebook attempts to demonstrate the following:

1. Use the Proteome Data Commons (PDC) API to retrieve all study metadata.
2. Print the metadata to a file.

The metadata are intended to help select studies of interest for downstream analysis (e.g., protein expression.)

In [1]:
import requests
import json

First let's define a function to call the PDC API.

In [2]:
def query_pdc(query):
    # PDC API url
    url = 'https://pdc.esacinc.com/graphql'
    
    # Send the POST query
    print('Sending query.')
    
    pdc_response = requests.post(url, json={'query': query})
    # Set up a data structure for the query result
    
    # Check the results
    if pdc_response.ok:
        # Decode the response
        return pdc_response.json()
    else:
        # Response not OK, see error
        return pdc_response.raise_for_status()

Next, let's setup the GraphQL query.  This query is designed to retrieve a broad range of metadata defined at the program, project, and study levels.  You can limit what is returned by omitting fields from your query.

You can also practice running queries using the [GraphiQL tool](https://pdc.esacinc.com/graphiql), installed on the PDC server.

In [3]:
study_metadata_query = '''
{
  programsProjectsStudies {
    program_id
    program_submitter_id
    name
    sponsor
    start_date
    end_date
    program_manager
    projects {
      project_id
      project_submitter_id
      name
      studies {
        study_id
        study_submitter_id
        submitter_id_name
        study_name
        program_name
        project_name
        program_id
        project_id
        project_submitter_id
        disease_type
        primary_site
        analytical_fraction
        experiment_type
        acquisition_type
        cases_count
        aliquots_count
      }
    }
  }
}'''

Next, let's run the query using the function that we defined and the GraphQL statement that we setup in the previous cell.

In [4]:
study_mdata = query_pdc(study_metadata_query)

Sending query.


Next, let's save the response to a file.

In [5]:
formatted = json.dumps(study_mdata, indent=2)
print(formatted[0:1000])

{
  "data": {
    "programsProjectsStudies": [
      {
        "program_id": "10251935-5540-11e8-b664-00a098d917f8",
        "program_submitter_id": "Clinical Proteomic Tumor Analysis Consortium",
        "name": "Clinical Proteomic Tumor Analysis Consortium",
        "sponsor": null,
        "start_date": "2018-06-29",
        "end_date": null,
        "program_manager": "Ratna Thangudu",
        "projects": [
          {
            "project_id": "267d6671-0e78-11e9-a064-0a9c39d33490",
            "project_submitter_id": "CPTAC3-Discovery",
            "name": "CPTAC3-Discovery",
            "studies": [
              {
                "study_id": "cfe9f4a2-1797-11ea-9bfa-0a42f3c845fe",
                "study_submitter_id": "CPTAC GBM Discovery Study - Proteome",
                "submitter_id_name": "CPTAC GBM Discovery Study - Proteome",
                "study_name": null,
                "program_name": null,
                "project_name": null,
                "program_id": null,

In [6]:
with open('studymdata.json', 'w') as outfile:
    outfile.writelines(formatted)

That's it! You can now use the information in the file, along with the browse/filter tools on the PDC to identify studies that may be of interset to you.  The output also contains many of the identifiers used in the PDC which will be useful when using other API calls.

This ends this notebook.
Please submit any questions or requests to: PDCHelpDesk@mail.nih.gov