#### Adam Klie<br>04/08/2020
# Downloading AGP data via `redbiom`
Download AGP feature table and metadata from Qiita

## Requirements (details needed)
- Follow README.md to set-up environment for data download
- At minimum:
    - bash kernel
    - redbiom

### 1. Set-up for download

In [2]:
TYPE=test
IS_SUBSET=0  # 0 is subset, any other number is full dataset
NUM_SAMP=100

In [3]:
# Make directory if necessary, then move to download directory
DATE=$(date +%F | sed 's/-/_/g')
[ ! -d "../data/${TYPE}/${DATE}" ] && mkdir -p ../data/${TYPE}/${DATE}
cd ../data/${TYPE}/${DATE}

### 1. Define "exercise" metadata choices

In [4]:
redbiom search metadata \
    --categories "exercise" > exercise_metadata_list.txt

### 2. Determine how many samples have each metadata feature

In [5]:
while read p; do
  N_SAMPLES="$(redbiom summarize metadata-category \
      --category $p --dump | wc -l)"
  echo -e "$p\t$N_SAMPLES"
done < exercise_metadata_list.txt

pm_lifestyle_change_how_change_in_exercise	986
enjoyment_of_exercise	161
exercise_status	289
exercise	1510
exercise_frequency	28017
exercise_location	26347
total_hours_exercise	161
exercise_frequency_unit	1510


### 3. Use exercise_frequency (most frequent and actually has some information in it)

In [6]:
echo -e "exercise_frequency"
redbiom summarize metadata-category \
    --category "exercise_frequency" \
    --counter | tail -5

exercise_frequency
LabControl test	1095
Rarely (a few times/month)	3132
Daily	5133
Occasionally (1-2 times/week)	6217
Regularly (3-5 times/week)	9270


### 4. Choose a context

In [7]:
export CTX=Deblur-Illumina-16S-V4-150nt-780653

### 5. Save all AGP sample ids to text file (option to subset)

In [8]:
export DATASET=${CTX}_${IS_SUBSET}_${NUM_SAMP}.ids
echo $DATASET

Deblur-Illumina-16S-V4-150nt-780653_0_100.ids


In [14]:
if [ $IS_SUBSET -eq 0 ]
then
    redbiom search metadata "where qiita_study_id == 10317" | grep -vi "blank" | shuf -n $NUM_SAMP > $DATASET
    wc -l $DATASET  
else
    redbiom search metadata "where qiita_study_id == 10317" | grep -vi "blank" > $DATASET
    wc -l $DATASET
fi

100 Deblur-Illumina-16S-V4-150nt-780653_0_100.ids


### 7. Fetch the biom table for these samples and context

In [15]:
redbiom fetch samples \
    --context $CTX \
    --from $DATASET \
    --output samples.biom

96 sample ambiguities observed. Writing ambiguity mappings to: samples.biom.ambiguities


### 9. Look at the BIOM table for subset of the samples to verify

In [16]:
biom summarize-table -i samples.biom | head

Num samples: 96
Num observations: 9,405
Total count: 2,410,045
Table density (fraction of non-zero values): 0.027

Counts/sample summary:
 Min: 49.000
 Max: 310,599.000
 Median: 15,544.000
 Mean: 25,104.635


### 10. Retrieve all the metadata associated with these samples

In [17]:
redbiom fetch sample-metadata \
    --from $DATASET \
    --context $CTX \
    --output metadata.tsv \
    --all-columns

96 sample ambiguities observed. Writing ambiguity mappings to: metadata.tsv.ambiguities
