#### Adam Klie<br>04/08/2020
# Downloading AGP data via `redbiom`
Downloads AGP feature table and metadata from Qiita

## Requirements (details needed)
 - bash kernel (https://macintoshguy.wordpress.com/2016/04/09/bash-notebooks-in-jupyter/)<br>
 - `redbiom` (see env_setup.ipynb)
 - Run `conda activate <redbiom_env>` prior to opening notebook

### 1. Set-up for download

In [1]:
# Make directory if necessary, then move to download directory
DATE=$(date +%F | sed 's/-/_/g')
mkdir ../data/test/${DATE}
cd ../data/test/${DATE}

### 1. Define "exercise" metadata choices

In [2]:
redbiom search metadata \
    --categories "exercise" > exercise_metadata_list.txt

### 2. Determine how many samples have each metadata feature

In [3]:
while read p; do
  N_SAMPLES="$(redbiom summarize metadata-category \
      --category $p --dump | wc -l)"
  echo -e "$p\t$N_SAMPLES"
done < exercise_metadata_list.txt

exercise_location	   26347
exercise	    1510
enjoyment_of_exercise	     161
exercise_frequency	   28017
exercise_status	     289
total_hours_exercise	     161
exercise_frequency_unit	    1510
pm_lifestyle_change_how_change_in_exercise	     986


### 3. Use exercise_frequency (most frequent and actually has some information in it)

In [4]:
echo -e "exercise_frequency"
redbiom summarize metadata-category \
    --category "exercise_frequency" \
    --counter | tail -5

exercise_frequency
LabControl test	1095
Rarely (a few times/month)	3132
Daily	5133
Occasionally (1-2 times/week)	6217
Regularly (3-5 times/week)	9270


### 4. Choose a context

In [5]:
export CTX=Deblur-Illumina-16S-V4-150nt-780653

### 5. Save all AGP sample ids to text file (option to subset)

In [6]:
export IS_SUBSET=0  # 0 is subset, any other number is full dataset
export NUM_SAMP=100
export DATASET=${CTX}_${IS_SUBSET}_${NUM_SAMP}.ids
echo $DATASET

Deblur-Illumina-16S-V4-150nt-780653_0_100.ids


In [7]:
if [ $IS_SUBSET -eq 0 ]
then
    redbiom search metadata "where qiita_study_id == 10317" | grep -vi "blank" | sort -R | head -$NUM_SAMP > $DATASET
    wc -l $DATASET  
else
    redbiom search metadata "where qiita_study_id == 10317" | grep -vi "blank" > $DATASET
    wc -l $DATASET
fi

     100 Deblur-Illumina-16S-V4-150nt-780653_0_100.ids


### 7. Fetch the biom table for these samples and context

In [8]:
redbiom fetch samples \
    --context $CTX \
    --from $DATASET \
    --output samples.biom

95 sample ambiguities observed. Writing ambiguity mappings to: samples.biom.ambiguities


### 9. Look at the BIOM table for subset of the samples to verify

In [9]:
biom summarize-table -i samples.biom | head

Num samples: 95
Num observations: 8,147
Total count: 1,702,640
Table density (fraction of non-zero values): 0.027

Counts/sample summary:
 Min: 21.000
 Max: 229,399.000
 Median: 13,601.000
 Mean: 17,922.526


### 10. Retrieve all the metadata associated with these samples

In [10]:
redbiom fetch sample-metadata \
    --from $DATASET \
    --context $CTX \
    --output metadata.tsv \
    --all-columns

95 sample ambiguities observed. Writing ambiguity mappings to: metadata.tsv.ambiguities
