<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Metadata-download" data-toc-modified-id="Metadata-download-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Metadata download</a></span></li><li><span><a href="#SRA" data-toc-modified-id="SRA-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>SRA</a></span><ul class="toc-item"><li><span><a href="#Import-needed-packages" data-toc-modified-id="Import-needed-packages-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import needed packages</a></span></li><li><span><a href="#Download-per-sample-metadata-from-NCBI-FTP-using-previously-parsed-for-now,-will-need-to-write-code-to-parse" data-toc-modified-id="Download-per-sample-metadata-from-NCBI-FTP-using-previously-parsed-for-now,-will-need-to-write-code-to-parse-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Download per sample metadata from <a href="https://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/" target="_blank">NCBI FTP</a> <font color="red">using previously parsed for now, will need to write code to parse</font></a></span></li><li><span><a href="#Sample-and-study-technical-metadata-need-to-find-where-this-is-hiding" data-toc-modified-id="Sample-and-study-technical-metadata-need-to-find-where-this-is-hiding-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Sample and study technical metadata <font color="red">need to find where this is hiding</font></a></span></li><li><span><a href="#Download-BioSample-attribute-definitions" data-toc-modified-id="Download-BioSample-attribute-definitions-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Download BioSample <a href="https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/" target="_blank">attribute definitions</a></a></span></li></ul></li><li><span><a href="#Qiita" data-toc-modified-id="Qiita-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Qiita</a></span><ul class="toc-item"><li><span><a href="#Download-metadata-from-Qiita" data-toc-modified-id="Download-metadata-from-Qiita-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Download metadata from Qiita</a></span></li><li><span><a href="#Save-lists-of-Study-IDs-from-downloaded-file-names" data-toc-modified-id="Save-lists-of-Study-IDs-from-downloaded-file-names-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Save lists of Study IDs from downloaded file names</a></span></li><li><span><a href="#Pull-out-the-metadata-from-groups-of-studies-and-save-as-pickle-objects" data-toc-modified-id="Pull-out-the-metadata-from-groups-of-studies-and-save-as-pickle-objects-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Pull out the metadata from groups of studies and save as pickle objects</a></span></li><li><span><a href="#Combine-the-attribute-value-pairs-from-each-group-into-a-final-dataframe-and-save" data-toc-modified-id="Combine-the-attribute-value-pairs-from-each-group-into-a-final-dataframe-and-save-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Combine the attribute-value pairs from each group into a final dataframe and save</a></span></li></ul></li><li><span><a href="#Word-vector-model" data-toc-modified-id="Word-vector-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Word vector model</a></span></li></ul></div>

# Metadata download
Adam Klie<br>
11/23/2019<br>
Updated 08/24/2020<br>
Script to download relevant data for building, training and testing metadata prediction models

# SRA

## Import needed packages

In [25]:
import xml.etree.ElementTree as ET
import pandas as pd
import ipywidgets as widgets
from IPython.display import IFrame
import qgrid
import tqdm
import os
import glob
import random

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

## Download per sample metadata from [NCBI FTP](https://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/) <font color='red'>using previously parsed for now, will need to write code to parse</font>

In [26]:
IFrame('https://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/', width=800, height=450)

In [27]:
#!wget -O ../data/sra/NCBI_SRA_Metadata_20181202.tar.gz https://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/NCBI_SRA_Metadata_20181202.tar.gz

In [28]:
#!tar -vxzf ../data/NCBI_SRA_Metadata_20181202.tar.gz
#sra_raw = pd.read_json('../data/NCBI_SRA_Metadata_20181202')

## Sample and study technical metadata <font color='red'>need to find where this is hiding</font>

## Download BioSample [attribute definitions](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/)

In [29]:
IFrame('https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/', width=800, height=500)

In [30]:
#!wget -O ../data/BioSampleAttributes.xml https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml

In [31]:
def parseXML(xmlfile):
  
    # create element tree object
    tree = ET.parse(xmlfile)
  
    # get root element
    root = tree.getroot()
    
    # create empty list for news items
    attribute_df = pd.DataFrame()
  
    # iterate news items
    for item in root.findall('./Attribute'):
        
        # empty news dictionary
        attribute = {}
  
        # iterate child elements of item
        for child in item:
            attribute[child.tag] = child.text
            
        # append news dictionary to news items list
        attribute_df = attribute_df.append(attribute, ignore_index = True)
      
    # return news items list
    return attribute_df

In [32]:
df = parseXML('../data/sra/BioSampleAttributes.xml').set_index('Name')

In [33]:
qgrid_widget = qgrid.show_grid(df, show_toolbar=True)

In [34]:
qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

In [35]:
df.to_pickle('../data/sra/BioSampleAttributes.pickle')

# Qiita

## Download metadata from Qiita 

Start by going to the [Qiita website](https://qiita.ucsd.edu/), selecting the "More Info" drop down and selecting the "Download public BIOM and metadata files" option.

In [36]:
IFrame('https://qiita.ucsd.edu/', width=800, height=500)

## Save lists of Study IDs from downloaded file names

In [2]:
file_ids = list(set([file_id.split("_")[0] for file_id in os.listdir("../data/qiita/download/")]))

In [3]:
random.shuffle(file_ids)

In [4]:
id_groups = [file_ids[i:i + 20] for i in range(0, len(file_ids), 20)]

In [5]:
for i,group in enumerate(id_groups):
    with open('../data/qiita/file_ids_{}.txt'.format(i+1), 'w') as f:
        f.write("\n".join(group))

## Pull out the metadata from groups of studies and save as pickle objects

In [14]:
# !sbatch downloadQiitaMetadata.sbatch

Submitted batch job 3287403


## Combine the attribute-value pairs from each group into a final dataframe and save

In [20]:
# qiita_pairs = pd.Series(index = pd.MultiIndex(levels=[[],[]], labels=[[],[]], names=[u'Sample_ID', u'Attribute']))
# for i in tqdm.tqdm_notebook(range(27)):
#     curr_pickle = pd.read_pickle('../data/qiita/file_ids_{}.pickle'.format(i+1))
#     qiita_pairs = pd.concat([qiita_pairs, curr_pickle], axis=0)
# qiita_pairs.to_pickle('allQiita.pickle')

HBox(children=(IntProgress(value=0, max=27), HTML(value='')))

# Word vector model

In [None]:
# wget http://evexdb.org/pmresources/vec-space-models/PubMed-w2v.bin