<h1><center>Data Acquisition Algorithms</center></h1>

This notebook contains algorithms for obtaining, aggregating and cleaning miRNA-Seq data for a specific tumor type from the GDC API. Data acquired here is to be used for differential expression analysis.

A lot of the code presented here is similar to DataRetrieval, however it is linked together to run all at once.

---

## Import Packages

This notebook will be using multiple packages to query and interpret data. Outlined below is the reason for each.

| Package | Use |
| --- | --- |
|requests| retrieving information from the url endpoints given on the <br> [GDC API User Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/) |
|pandas (pd) | storage and wrangling of data in pandas.DataFrame objects |
| json | converting the request responses into JSON format for easy use |

In [1]:
import requests
import pandas as pd
import json

---

## Retrieve File Information/metadata

The first step is to retrieve the information on the files pertaining to our anaysis. This information will include file UUIDs that can be used to query the GDC API for the txt file contents. To retrieve UUIDs, the **files** endpoint will be used and multiple filters will be passed, shortcutting to the **cases** endpoint for more selective filtering. The files are then used to create a dictionary that containing the UUID and sample type.



#### The Following Cells:


* Use the **files** endpoint

* Shortcut to the **cases** endpoint

* Utilize multiple filters: "files.experimental_strategy" and "cases.project.project_id"

* Return relevant file info in JSON format

* Extract important information and save to dictionary

### file_filt()

This algorithm is the first step for data acquisition: extracting specific file information from the GDC with particular filters.

In [2]:
def file_filt(projects):
    """ Algorithm that uses the GDC API to find files that fit our criteria
    
    Parameters:
        projects: list of comma seterpated project names coded as strings
    
    """

    url="https://api.gdc.cancer.gov/files"                              # url found on the GDI website

    filt = {                                                            # creates dictionary of filtering parameters
        "op":"and",
        "content":[
            {
                "op":"=",                                               # filters for miRNA expression Quantification..
                "content":{                                             # .. which increases speed of other filtering
                    "field":"files.data_type",
                    "value":["miRNA Expression Quantification"]
                }
            },
            {
                "op":"=",                                               # filter for miRNA-seq information
                "content":{
                    "field":"files.experimental_strategy",
                    "value":["miRNA-Seq"]
                }
            },
            {
                "op":"in",                                              # filter by Project title, this takes a comma seperated..
                "content":{                                             # .. list, allowing for us to input specific projects
                    "field":"cases.project.project_id",
                    "value":projects
                }
            }
                ]
    }

    D = {"filters":json.dumps(filt),
        "size":"2500",
        "expand": "cases.project",
        "fields":"file_id,file_name,cases.submitter_id,cases.samples.sample_type,data_format"}

    r = requests.get(url, params=D) 
    
    cats=r.json()
    
    return cats

### file_Dict()

This algorithm uses the json object returned by **file_filt()** and creates a dictionary of UUIDs and sample types.

In [3]:
def file_Dict(cats):
    """ Creats a dictionary from the output of file_filt() containing id:sample_type
    
    parameters:
        cats: json of all information outputted from the file_filt() agorithm
       
       returns a dictionary
    """
    a={}
    for i in range(len(cats["data"]["hits"])):                           # iterates over all files and counts different sample..
                                                                         # type
        a[cats["data"]["hits"][i]['file_id']]=cats["data"]["hits"][i]['cases'][0]['samples'][0]["sample_type"]
    
    return a

---

## Retrieve Data from list above

The next step is to use the UUIDs from the previous query to retrieve file data.

#### The Following Cells:

- Use the **data** endpoint

- Use a simple HTTP GET request including the file UUID (file_id)


__Note:__ According to the GDC API Users Guide, retrieving multiple files requires a POST request. However, due to the nature of our data (relatively small size) I do not think this is necessary. I believe POST requests are required for genomic datasets for easier acquisistion without large compuational power.

### data_retrieval()

This algorithm does a GET request to the GDC API using a file UUID acquired earlier. This algorthim will exist nestled in a for-loop that iterates over the dictionary created from **file_Dict()**.

In [4]:
def data_retrieval(UUID):
    """ retrieves data through the GDC API from a UUID
    
    Parameters:
        UUID: file_id
    
    returns content of the return
    """
    
    url = 'https://api.gdc.cancer.gov/data/'+ UUID     # Add string of file UUID to url.

    D = {}

    r = requests.get(url, params=D, headers = {"Content-Type": "application/json"})
    
    return r.content

---

## Clean and Convert the HHTP response content into a DataFrame

The contents of the request reponse come in byte format. To use the data retrieved, it must be converted to a string and each row saved as a DataFrame. The *Pandas* package will be used to store and wrangle data.


#### The Following Cells:

- Use *Pandas* DataFrames

- Iterate over each row of acquired data and save it to lists for conversion to a DataFrame

### response_clean()

This algorthim converts the GET request response from **data_retrieval()** to a pandas dataframe.

In [5]:
def response_clean(resp,UUID,tum_dict,cpm):
    """ cleans the response from the GDC API
    
    Parameters:
        resp: response from the data endpoint of the GDC API
        UUID: UUID to use
        tum_dict : dictionary of UUID and sample type
        cpm: boolean; if want raw reads input False, else input True
        
        return clean pandas dataframe
    """

    name = []                                                                       # creates lists to append to
    readCnt = []
    readCntMil = []
    xMap = []

    b= str(resp)[2:]                                                                # Convert bytes file to string and remove..
                                                                                    # .. hardcoding.

    c=b.split('\\n')   # split by rows

    for i in c[:-1]:                                                                # iterate over every row
        d=i.split('\\t')                                                            # split by column      

        name.append(d[0])                                                           # append to appropriate list
        readCnt.append(d[1])
        readCntMil.append(d[2])
        xMap.append(d[3])

    if cpm:
        df_dict= {'miRNA_id':name[1:],
                  tum_dict[UUID] + "_" + UUID:readCntMil[1:],                                 #readCntMil or readCnt
                 }
        
    else:
        df_dict= {'miRNA_id':name[1:],
                  tum_dict[UUID] + "_" + UUID:readCnt[1:],                                 #readCntMil or readCnt
                 }

    df = pd.DataFrame(data=df_dict)
 
    return df

### data_total()

This algorthm combines **file_Dict()** and **response_Clean()** by itterating over every file UUID and creating a dataframe that consists of all the file information acquired.

In [6]:
def data_total(tum_dict,cpm):
    """ aggregates all data together
    
    parameters:
        tum_dict: dictionary created form file_Dict()
        cpm: if want cpm input True, if want raw input False
        
        returns: master Df
    """
    
    df=pd.DataFrame(columns=['miRNA_id'])
    
    i=0
    
    for UUID, site in tum_dict.items():
        content = data_retrieval(UUID)
        
        temp_df = response_clean(content,UUID,tum_dict,cpm)
        
        df = pd.merge(df,temp_df,how='outer',on='miRNA_id')
        i+=1
        if i%100 == 0:
            print(i)
        
    return df

---

### main()

This is that final algorithm that aggregates all the code into one run-able cell that inputs what project(s) is/are to be queried and if the values should be raw or converted to cpm. The previous cells must be run for **main()** to function properly.

**Note** This cell may take some time to run. Built in it will print how many files have been aggregated every 100 files.

In [7]:
def main(projects,cpm):
    """ Uses all of the functions previously made
    
    Parameters:
        projects: name(s) of the projects interested, each name coded as a string in a comma seperated list
        cpm: if want cpm input True, if want raw input False
    
    Returns: usable dataframe
    """
    
    cats = file_filt(projects)
    
    site_dict = file_Dict(cats)
    
    master_df = data_total(site_dict,cpm)
    
    
    
    master_df=master_df.set_index('miRNA_id')
    
    finalT = master_df.transpose()
    
    finalT.index.name="UUID"
    
    finalT['Site'] = ""
    
    for i in finalT.index.values:
        if i[:13] == "Primary Tumor":
            finalT.loc[i,"Site"] = "Tumor"

        elif i[:13] == "Solid Tissue ":
            finalT.loc[i,"Site"] = "Normal"

        else:
            finalT.loc[i,"Site"] = "NA"
        
        
    
    
    
    return finalT

final = main(["TCGA-PRAD"],True)

100
200
300
400
500


In [10]:
final.to_csv("..")

In [9]:
final.head()

UUID


miRNA_id,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c,hsa-let-7d,hsa-let-7e,hsa-let-7f-1,hsa-let-7f-2,hsa-let-7g,...,hsa-mir-942,hsa-mir-943,hsa-mir-944,hsa-mir-95,hsa-mir-9500,hsa-mir-96,hsa-mir-98,hsa-mir-99a,hsa-mir-99b,Site
UUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Primary Tumor_f11e4a51-f661-4357-a6ea-bcac35803ff3,7699.123918,7683.570959,7695.849611,4706.407205,4285.454095,209.351012,599.607492,4641.330351,4683.282411,271.767491,...,1.637154,0.0,6.34397,0.613933,0.0,15.962247,24.966592,990.68255,12480.02161,Tumor
Primary Tumor_860889dd-ad00-411b-8eee-166d01e22fb5,9047.765059,9068.904198,9091.965077,7124.103402,5510.482451,374.952811,1156.460377,5329.838898,5477.599345,613.88914,...,1.067633,0.0,2.56232,1.28116,0.0,32.242525,35.872478,1782.093484,20274.569658,Tumor
Primary Tumor_1cc9ab15-5258-4c8b-b1b2-2519bdf5bcd4,4307.509537,4273.954612,4354.930134,5805.002069,3563.477599,508.592834,583.744774,1491.114323,1540.476114,351.078805,...,2.218507,0.0,0.0,1.109254,0.0,48.252537,18.302687,1915.403874,20693.128284,Tumor
Primary Tumor_5d7f0a0d-ffc9-42f4-b80f-a0ff244ab2ce,11755.617135,11870.189263,11826.427879,6977.356413,9305.668788,395.403056,1281.484937,7355.358289,7755.240856,805.795248,...,2.239756,0.172289,0.861445,2.756623,0.0,47.207162,45.656562,2506.6314,15278.580836,Tumor
Primary Tumor_99e05909-2544-4b6c-9571-6f299385747d,17415.647471,17405.199435,17410.116158,7923.913712,2836.949173,357.077008,758.404523,5124.454551,5563.88667,558.662651,...,0.0,0.0,0.0,2.458362,0.0,46.708869,22.125254,1525.413312,14008.358429,Tumor


***