<h1><center>Data Acquisition</center></h1>

This notebook deals with data acquisition from the GDC API, exploring different endpoints and filters that can be utilized to ultimatiely result in a pandas dataframe for easy manipulation. A seperate notebook, titled DataRegAlg, should be used for actual aquisition. This notebook functions to explore different endpoints, display the functionality of the GDC API and shed insight to how the code in DataRegAlg works.


The files queried in this notebook are miRNA-seq txt files from the GDC API. Each file represents one tissue sample, with each row representing a miRNA. These files will be used for differntial expression analysis.

---

## Import Packages

This notebook will be using multiple packages to query and interpret data. Outlined below is the reason for each.

| Package | Use |
| --- | --- |
|requests| retrieving information from the url endpoints given on the <br> [GDC API User Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/) |
|pandas (pd) | storage and wrangling of data in pandas.DataFrame objects |
| json | converting the request responses into JSON format for easy use |

In [1]:
import requests
import pandas as pd
import json

---

## Work Flow


The workflow is as follows

---

## Retrieve File Information/metadata

The first step is to retrieve the information on the files pertaining to our anaysis. This information will include file UUIDs that can be used to query the GDC API for the txt file contents. To retrieve UUIDs, the **files** endpoint will be used and multiple filters will be passed, shortcutting to the **cases** endpoint for more selective filtering.



#### The Following Cells:


* Use the **files** endpoint

* Shortcut to the **cases** endpoint

* Utilize multiple filters: "files.experimental_strategy" and "cases.project.project_id"

* Return relevant file info in JSON format

In [2]:
url="https://api.gdc.cancer.gov/files"                                 # url found on the GDI website

filt = {                                                               # creates dictionary of filtering parameters
    "op":"and",
    "content":[
        {
            "op":"=",                                                  # filters for miRNA expression Quantification..
            "content":{                                                # .. which increases speed of other filtering
                "field":"files.data_type",
                "value":["miRNA Expression Quantification"]
            }
        },
        {
            "op":"=",                                                  # filter for miRNA-seq information
            "content":{
                "field":"files.experimental_strategy",
                "value":["miRNA-Seq"]
            }
        },
        {
            "op":"in",                                                 # filter by Project title, this takes a comma seperated ..
            "content":{                                                # .. list, allowing for us to input specific projects
                "field":"cases.project.project_id",
                "value":["TCGA-BRCA"]
            }
        }
            ]
}
            
D = {"filters":json.dumps(filt),
    "size":"50",
    "expand": "cases.project",
    "fields":"file_id,file_name,cases.submitter_id,cases.samples.sample_type,data_format"}

r = requests.get(url, params=D)  
cats=r.json()

Below is the first file the above code retrieved. The string associated to the "id" key will be used with the **data** endpoint to retrieve the corresponding txt file.

In [3]:
cats['data']['hits'][0]

{'cases': [{'project': {'dbgap_accession_number': None,
    'disease_type': 'Breast Invasive Carcinoma',
    'name': 'Breast Invasive Carcinoma',
    'primary_site': 'Breast',
    'project_id': 'TCGA-BRCA',
    'released': True,
    'state': 'open'},
   'samples': [{'sample_type': 'Primary Tumor'}],
   'submitter_id': 'TCGA-S3-AA12'}],
 'data_format': 'TXT',
 'file_id': 'bd5873b0-4c3c-4aba-987e-0730145d5ea1',
 'file_name': '2596fddb-8ada-40da-95b1-e8631e9b48d1.mirbase21.mirnas.quantification.txt',
 'id': 'bd5873b0-4c3c-4aba-987e-0730145d5ea1'}

---

## Retrieve Data from list above

The next step is to use the UUIDs from the previous query to retrieve file data. Information for each file must be accquired individually, thus a for-loop appears to be the proper tool for the job. Here, only one file is received.

#### The Following Cells:

- Use the **data** endpoint

- Use a simple HTTP GET request including the file UUID (file_id)


__Note:__ According to the GDC API Users Guide, retrieving multiple files requires a POST request. However, due to the nature of our data (relatively small size) I do not think this is necessary. I believe POST requests are required for genomic datasets for easier acquisistion without large compuational power.

In [4]:
url = 'https://api.gdc.cancer.gov/data/'+'bd5873b0-4c3c-4aba-987e-0730145d5ea1'     # Add string of file UUID to url.

D = {}

r = requests.get(url, params=D, headers = {"Content-Type": "application/json"})  

In [5]:
r

<Response [200]>

---

## Clean and Convert the HHTP response content into a DataFrame

The contents of the request reponse come in byte format. To use the data retrieved, it must be converted to a string and each row saved as a DataFrame. The *Pandas* package will be used to store and wrangle data.


#### The Following Cells:

- Use *Pandas* DataFrames

- Iterate over each row of acquired data and save it to lists for conversion to a DataFrame

In [6]:
name = []                                                                           # creates lists to append to
readCnt = []
readCntMil = []
xMap = []


b= str(r.content)[2:]                                                               # Convert bytes file to string ..
                                                                                    # .. and remove hardcoding.

c=b.split('\\n')                                                                    # split by rows


for i in c[:-1]:                                                                    # iterate over every row
    
    d=i.split('\\t')                                                                # split by column      
    
    name.append(d[0])                                                               # append to appropriate list
    readCnt.append(d[1])
    readCntMil.append(d[2])
    xMap.append(d[3])

In [7]:
# create a dictonary for conversion to pandas.DataFrame
df_dict= {'miRNA_id':name[1:],
          "read_count":readCnt[1:],
         "reads_per_million_miRNA_mapped":readCntMil[1:],
         "cross-mapped":xMap[1:]}

In [8]:
# create a pandas dataframe for easy manipulation and exportation

df = pd.DataFrame(data=df_dict)

df = df[
    ['miRNA_id',
          "read_count",
         "reads_per_million_miRNA_mapped",
         "cross-mapped"]
       ]

In [9]:
df.head()

Unnamed: 0,miRNA_id,read_count,reads_per_million_miRNA_mapped,cross-mapped
0,hsa-let-7a-1,47975,10625.871,N
1,hsa-let-7a-2,47663,10556.766847,N
2,hsa-let-7a-3,47888,10606.601573,N
3,hsa-let-7b,70374,15586.973336,N
4,hsa-let-7c,1652,365.897632,N


---

# Testing

One important aspect of this analysis is ensuring proper sample sizes. Each file represents a sample from a specific site, ranging from tumor tissue to normal tissue. Before downloading the data and starting analysis, the counts of each sample type should be checked to ensure it is worth while.


#### The Following Cells:

- Checks for distribution of sample types
- Uses the same protocall as original file information collection and filtering

In [13]:
url="https://api.gdc.cancer.gov/files"                                 # url found on the GDI website

filt = {
    "op":"and",
    "content":[
        {
            "op":"=",
            "content":{
                "field":"files.data_type",
                "value":["miRNA Expression Quantification"]
            }
        },
        {
            "op":"=",
            "content":{
                "field":"files.experimental_strategy",
                "value":["miRNA-Seq"]
            }
        },
        {
            "op":"in",                                                 # filter by Project title, this takes a comma seperated ..
            "content":{                                                # .. list, allowing for us to input specific projects
                "field":"cases.project.project_id",
                "value":["TCGA-BRCA"]
            }
        }
            ]
}
            
D = {"filters":json.dumps(filt),
    "size":"2500",                                                   # only major difference between above. Takes all miRNA-seq
    "expand": "cases.project",
    "fields":"file_id,file_name,cases.submitter_id,cases.samples.sample_type,data_format"}
r = requests.get(url, params=D)  
cats=r.json()

In [14]:
a={}

for i in range(len(cats["data"]["hits"])):                           # iterates over all files and counts different sample type
    
    if cats["data"]["hits"][i]['cases'][0]['samples'][0]["sample_type"] not in a.keys():
        a[cats["data"]["hits"][i]['cases'][0]['samples'][0]["sample_type"]] = 1
        
    else:
        a[cats["data"]["hits"][i]['cases'][0]['samples'][0]["sample_type"]] += 1

In [15]:
a

{'Metastatic': 7, 'Primary Tumor': 1096, 'Solid Tissue Normal': 104}

***

The code from this notebook will used in DataRegAlg to create an all-in-one funtion to query data.

---

---