# Process metadata from SRP198979
#### Adam Klie<br>05/23/2020
Get accession list (SRR_Acc_List.txt) and run table (SraRunTable.txt) from https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP198979&o=acc_s%3Aa

In [18]:
import pandas as pd
import numpy as np

### Get SRR IDs for data download
Need a list of SRR IDS to use fastq-dump to get the raw fastq data. Pull these out from `SraRunTable.txt`

In [19]:
data_dir = '/home/aklie/scratch/exRNA_breast_cancer/fastq'

In [20]:
metadata = pd.read_csv('{}/SraRunTable.txt'.format(data_dir))

In [21]:
metadata.columns

Index(['Run', 'Age', 'Assay Type', 'AvgSpotLen', 'Bases', 'BioProject',
       'BioSample', 'Bytes', 'Center Name', 'Condition', 'Consent',
       'DATASTORE filetype', 'DATASTORE provider', 'DATASTORE region',
       'Experiment', 'gender', 'GEO_Accession (exp)', 'Instrument',
       'library_construction_method', 'LibraryLayout', 'LibrarySelection',
       'LibrarySource', 'Organism', 'Platform', 'ReleaseDate', 'Sample Name',
       'source_name', 'SRA Study'],
      dtype='object')

In [22]:
# Total dataset contains 128 index files that we don't need for this analysis, remove them with this line
non_index = metadata[(metadata["AvgSpotLen"] == 75) | (metadata["AvgSpotLen"] == 51)]

In [23]:
# Get a list of SRR ids and save as input to download_data.sh
srrs = non_index["Run"].values
with open('{}/SRR_Acc_List_NonIndex.txt'.format(data_dir), 'w') as filehandle:
    filehandle.writelines("%s\n" % id for id in srrs)

### Generate metadata for Qiime2
For easy Qiime2 analysis, build a metadata table with sampleid (SRR ID), gender, cancer occurence, and recurrence metadata

In [24]:
non_index["sampleid"] = non_index["Run"] + ".unmapped"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [25]:
# Simple functions to generate useful metadata columns
def addCancerColumn(x):
    if "breast cancer" in x["Condition"]:
        return "cancer"
    else:
        return "normal"

def addRecurrenceColumn(x):
    if x["condition"] == "cancer":
        if "with recurrence" in x["Condition"]:
            return "yes"
        else:
            return "no"
    else:
        return np.nan

In [26]:
# Get a condition column specifying whether sample is from cancer or normal
non_index["condition"] = non_index.apply(addCancerColumn, 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [27]:
# Get a recurrence column specifying whether cancer patient showed cancer recurrence or did not
non_index["recurrence"] = non_index.apply(addRecurrenceColumn, 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [28]:
# Make sure values match up with expected: 96 cancer samples and 32 normal, 68 no recurrence, 28 recurrence
print(non_index["condition"].value_counts())
print(non_index["recurrence"].value_counts())

cancer    96
normal    32
Name: condition, dtype: int64
no     68
yes    28
Name: recurrence, dtype: int64


In [29]:
# Simplify table
sample_metadata = non_index[["sampleid", "gender", "condition", "recurrence"]]
sample_metadata.head()

Unnamed: 0,sampleid,gender,condition,recurrence
128,SRR9094428.unmapped,female,cancer,yes
129,SRR9094429.unmapped,female,cancer,yes
130,SRR9094430.unmapped,female,cancer,yes
131,SRR9094431.unmapped,female,cancer,yes
132,SRR9094432.unmapped,female,cancer,yes


In [30]:
# Save table as tsv for Qiime2 use
sample_metadata.to_csv('{}/sample_metadata.tsv'.format(data_dir), sep='\t', index=False)

In [34]:
# Generate a metadata file for only recurrence patients
recurrence_metadata = sample_metadata[sample_metadata["recurrence"].notna()]
recurrence_metadata.to_csv('{}/recurrence_metadata.tsv'.format(data_dir), sep='\t', index=False)