# MAG Quality Control
## 1. Fetching datasets from BUSCO for bacteria, archaea, and fungi

In [None]:
#Estimate and assess the purity of our dataset with BUSCO. Same code as in W4.
#Bacteria
! qiime annotate fetch-busco-db \
    --p-lineages bacteria_odb12 \
    --o-db $data_dir/busco-db-bacteria.qza

#Archaea
qiime annotate fetch-busco-db \
    --p-lineages archaea_odb12 \
    --o-db $data_dir/busco-db-archaea.qza

#Fungi
qiime annotate fetch-busco-db \
  --p-lineages fungi_odb12 \
  --o-db $data_dir/busco-db-fungi.qza

## 2. Bacteria 
### 2.1 Run Busco

In [None]:
! qiime annotate evaluate-busco \
    --i-mags $data_dir/updog_mags.qza \
    --i-db $data_dir/busco-db-bacteria.qza \
    --p-lineage-dataset bacteria_odb12 \
    --p-cpu 3 \
    --o-results $data_dir/busco-results-bacteria.qza \
    --o-visualization $data_dir/mags-busco-bacteria.qzv


### 2.2 MAGs Filtering
Now that we evaluated the quality of our MAGs, we can use this information to filter out only the best ones.

In [None]:
mosh annotate filter-mags \
  --i-mags $data_dir/mags.qza \
  --m-metadata-file $data_dir/busco-results-bacteria.qza \
  --p-where "complete > 50 AND contamination < 10" \
  --p-no-exclude-ids \
  --p-on mag \
  --o-filtered-mags $data_dir/mags_filtered_bacteria_50.qza \
  --verbose

## 3. Archaea
Follows the same steps as for bacteria
### 3.1 Run Busco


In [None]:
qiime annotate evaluate-busco \
    --i-mags $data_dir/updog_mags_131025.qza \
    --i-db $data_dir/busco-db-archaea.qza \
    --p-lineage-dataset archaea_odb12 \
    --p-cpu 3 \
    --o-results $data_dir/busco-results-archaea.qza \
    --o-visualization $data_dir/mags-busco-archaea.qzv

### 3.2 Filtering MAGs

In [None]:
mosh annotate filter-mags \
  --i-mags $data_dir/mags.qza \
  --m-metadata-file $data_dir/busco-results-archaea.qza \
  --p-where "complete > 50 AND contamination < 10" \
  --p-no-exclude-ids \
  --p-on mag \
  --o-filtered-mags $data_dir/mags_filtered_archaea_50.qza \
  --verbose

## 4. Fungi

As the quality check with BUSCO took too much time and needed more computational capacity in order to run, we partitioned the sequences per sample ID 

### 4.1 Partitioning

In [None]:
qiime types partition-sample-data-mags \
  --i-mags updog_mags.qza \
  --p-num-partitions 126 \
  --o-partitioned-mags busco_inputs/updog_mags_partitions

### 4.2 Run Busco

then we ran BUSCO on each individual file

In [None]:
QZA_DIR="/cluster/scratch/$USER/updog/busco_inputs/updog_mags_partitions"

# Pick the file corresponding to this task
SAMPLE_FILE=$(ls $QZA_DIR/*.qza | sed -n "${SLURM_ARRAY_TASK_ID}p")

echo "Processing $SAMPLE_FILE on $SLURM_JOB_NODELIST"

# Run BUSCO for fungi
qiime annotate evaluate-busco \
    --i-mags $SAMPLE_FILE \
    --i-db $data_dir/busco-db-fungi.qza \
    --p-lineage-dataset fungi_odb12 \
    --p-cpu 3 \
    --o-results $output_dir/$(basename $SAMPLE_FILE .qza)_busco-results-fungi.qza \
    --o-visualization $output_dir/$(basename $SAMPLE_FILE .qza)_busco-fungi.qzv

### 4.3 MAGs filtering

In [None]:
# Paths
data_dir=/cluster/scratch/$USER/updog
samples_dir=$data_dir/busco_inputs/updog_mags_partitions          # directory with per-sample .qza
busco_metrics=$data_dir/busco-results-bacteria.qza  # your BUSCO results file
output_dir=$data_dir/busco_filtered
mkdir -p $output_dir

# Pick the sample file for this array task
SAMPLE_FILE=$(ls $samples_dir/*.qza | sed -n "${SLURM_ARRAY_TASK_ID}p")
BASENAME=$(basename "$SAMPLE_FILE" .qza)

echo "Filtering $SAMPLE_FILE on $SLURM_JOB_NODELIST"

# Run the filter
mosh annotate filter-mags \
  --i-mags $SAMPLE_FILE \
  --m-metadata-file $busco_metrics \
  --p-where "complete > 50 AND contamination < 10" \
  --p-no-exclude-ids \
  --p-on mag \
  --o-filtered-mags $output_dir/${BASENAME}_filtered.qza \
  --verbose

### 4.4 Collating filtered MAGs

we then collated all the filtered fungi MAGs in order to only have one file with all of the mags for the dereplication step

In [None]:
qiime types collate-sample-data-mags \
  --i-mags $data_dir/busco_filtered/*.qza \
  --o-collated-mags $data_dir/mags_filtered_all_fungi.qza