#### Adam Klie<br>04/08/2020
# Downloading AGP data via `redbiom`
Download AGP feature table and metadata from Qiita
 - __Output__
     - CONTEXT_IS_SUBSET_NUM_SAMP.ids -- IDs of samples to download
     - exercise_metadata.list -- a list of metadata in Qiita that contains "exercise" keyword
     - metadata.tsv -- tab seperated metadata associated with samples contained in CONTEXT_IS_SUBSET_NUM_SAMP.ids
     - samples.biom -- sequence by sample table
     - \*.ambiguities -- mapping of samples

## Requirements
- Follow README.md to set-up environment for data download
    - bash kernel
    - `redbiom`

### 1. Set-up for download

In [1]:
# Set some environment variables for filenames 
TYPE=full  # "test" for subset of AGP, "full" for entire AGP dataset 
IS_SUBSET=1  # 0 if TYPE=test, 1 if TYPE=full
NUM_SAMP=ALL  # number of samples as int if TYPE=test, "ALL" if TYPE=full

In [2]:
# Make a data directory if necessary, then move to download in that directory
DATE=$(date +%F | sed 's/-/_/g')
[ ! -d "../data/${TYPE}/${DATE}" ] && mkdir -p ../data/${TYPE}/${DATE}
cd ../data/${TYPE}/${DATE}

### 1. Define "exercise" metadata choices

In [3]:
redbiom search metadata \
    --categories "exercise" > exercise_metadata_list.txt

### 2. Determine how many samples have each metadata feature

In [4]:
while read p; do
  N_SAMPLES="$(redbiom summarize metadata-category \
      --category $p --dump | wc -l)"
  echo -e "$p\t$N_SAMPLES"
done < exercise_metadata_list.txt

exercise	1510
total_hours_exercise	161
exercise_frequency_unit	1510
exercise_status	289
exercise_frequency	28017
enjoyment_of_exercise	161
pm_lifestyle_change_how_change_in_exercise	986
exercise_location	26347


### 3. Use exercise_frequency (most frequent and actually has some information in it)

In [5]:
echo -e "exercise_frequency"
redbiom summarize metadata-category \
    --category "exercise_frequency" \
    --counter | tail -5

exercise_frequency
LabControl test	1095
Rarely (a few times/month)	3132
Daily	5133
Occasionally (1-2 times/week)	6217
Regularly (3-5 times/week)	9270


### 4. Choose a context

In [6]:
export CTX=Deblur-Illumina-16S-V4-150nt-780653

### 5. Save all AGP sample ids to text file (option to subset)

In [7]:
export DATASET=${CTX}_${IS_SUBSET}_${NUM_SAMP}.ids
echo $DATASET

Deblur-Illumina-16S-V4-150nt-780653_1_ALL.ids


In [8]:
if [ $IS_SUBSET -eq 0 ]  # if TYPE=test
then
    redbiom search metadata "where qiita_study_id == 10317" | grep -vi "blank" | shuf -n $NUM_SAMP > $DATASET
    wc -l $DATASET  
else  # if TYPE=full
    redbiom search metadata "where qiita_study_id == 10317" | grep -vi "blank" > $DATASET
    wc -l $DATASET
fi

26377 Deblur-Illumina-16S-V4-150nt-780653_1_ALL.ids


### 7. Fetch the biom table for these samples and context

In [9]:
redbiom fetch samples \
    --context $CTX \
    --from $DATASET \
    --output samples.biom

25180 sample ambiguities observed. Writing ambiguity mappings to: samples.biom.ambiguities


### 9. Look at the BIOM table for subset of the samples to verify

In [10]:
biom summarize-table -i samples.biom | head

Num samples: 25,180
Num observations: 1,028,814
Total count: 524,626,716
Table density (fraction of non-zero values): 0.000

Counts/sample summary:
 Min: 2.000
 Max: 499,002.000
 Median: 14,772.500
 Mean: 20,835.056


### 10. Retrieve all the metadata associated with these samples

In [11]:
redbiom fetch sample-metadata \
    --from $DATASET \
    --context $CTX \
    --output metadata.tsv \
    --all-columns

25180 sample ambiguities observed. Writing ambiguity mappings to: metadata.tsv.ambiguities
