## Machine Learning: How well does microbial composition predict abduction status?

In [1]:
# Setup
import os
import qiime2 as q2
import pandas as pd

from qiime2 import Visualization

data_dir = 'Alien_data'

In [45]:
# Classify abduction status based on microbial composition

! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/small-RF-classifier

[32mSaved SampleEstimator[Classifier] to: Alien_data/small-RF-classifier/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/small-RF-classifier/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/small-RF-classifier/predictions.qza[0m
[32mSaved Visualization to: Alien_data/small-RF-classifier/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/small-RF-classifier/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/small-RF-classifier/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/small-RF-classifier/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/small-RF-classifier/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/small-RF-classifier/test_targets.qza[0m
[0m

In [46]:
Visualization.load(f'{data_dir}/small-RF-classifier/accuracy_results.qzv')

The overall accuracy is high but not much higher than the baseline accuracy. Classifications are clearly skewed towards the "non-abducted" category. Almost all non-abducted samples are correctly classified, but 60% of abducted samples are misclassified as non-abducted. The high baseline accuracy is likely due to the imbalance in abducted vs non-abducted sample sizes, and the correct classification of all non-abducted samples.

In [47]:
# Visualize individual samples' predictions and probabilities

! qiime metadata tabulate \
  --m-input-file $data_dir/small-RF-classifier/test_targets.qza \
  --m-input-file $data_dir/small-RF-classifier/predictions.qza \
  --m-input-file $data_dir/small-RF-classifier/probabilities.qza \
  --o-visualization $data_dir/small-RF-classifier/test_predprob.qzv

[32mSaved Visualization to: Alien_data/small-RF-classifier/test_predprob.qzv[0m
[0m

In [48]:
Visualization.load(f'{data_dir}/small-RF-classifier/test_predprob.qzv')

In [49]:
# Feature importance: which microbial compositions were most important for 
#                     distinguishing abducted vs non-abducted samples?

! qiime metadata tabulate \
    --m-input-file $data_dir/small-RF-classifier/feature_importance.qza \
    --o-visualization $data_dir/small-RF-classifier/feature_importance.qzv

[32mSaved Visualization to: Alien_data/small-RF-classifier/feature_importance.qzv[0m
[0m

In [50]:
Visualization.load(f'{data_dir}/small-RF-classifier/feature_importance.qzv')

In [52]:
! qiime sample-classifier heatmap \
  --i-table $data_dir/table-filtered.qza \
  --i-importance $data_dir/small-RF-classifier/feature_importance.qza \
  --m-sample-metadata-file $data_dir/str_metadata.tsv  \
  --m-sample-metadata-column alleged_abduction \
  --p-group-samples \
  --p-feature-count 20 \
  --o-filtered-table $data_dir/small-RF-classifier/important-feature-table-top-20.qza \
  --o-heatmap $data_dir/small-RF-classifier/important-feature-heatmap.qzv

[32mSaved Visualization to: Alien_data/small-RF-classifier/important-feature-heatmap.qzv[0m
[32mSaved FeatureTable[Frequency] to: Alien_data/small-RF-classifier/important-feature-table-top-20.qza[0m
[0m

In [53]:
Visualization.load(f'{data_dir}/small-RF-classifier/important-feature-heatmap.qzv')

### Repeat with Optimized Feature Selection

(Had no effect on accuracy)

In [54]:
# Classify abduction status based on microbial composition

! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/RF-opt-feature-selection

[32mSaved SampleEstimator[Classifier] to: Alien_data/RF-opt-feature-selection/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/RF-opt-feature-selection/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/RF-opt-feature-selection/predictions.qza[0m
[32mSaved Visualization to: Alien_data/RF-opt-feature-selection/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/RF-opt-feature-selection/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/RF-opt-feature-selection/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/RF-opt-feature-selection/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-opt-feature-selection/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-opt-feature-selection/test_targets.qza[0m
[0m

In [55]:
Visualization.load(f'{data_dir}/RF-opt-feature-selection/accuracy_results.qzv')

### Repeat with Parameter Tuning

Does parameter tuning improve the model's accuracy? No effect.

In [57]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-estimator RandomForestClassifier \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/RF-param-tuning

[32mSaved SampleEstimator[Classifier] to: Alien_data/RF-param-tuning/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/RF-param-tuning/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/RF-param-tuning/predictions.qza[0m
[32mSaved Visualization to: Alien_data/RF-param-tuning/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/RF-param-tuning/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/RF-param-tuning/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/RF-param-tuning/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-param-tuning/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-param-tuning/test_targets.qza[0m
[0m

In [58]:
Visualization.load(f'{data_dir}/RF-param-tuning/accuracy_results.qzv')

### Repeat with More Estimators (more trees)
Does increasing the number of trees improve the model's accuracy?
(Default is 100 trees, here trying 300 and 500) Neither improved overall accuracy.

In [63]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-n-estimators 300 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/RF-threehundred-trees

[32mSaved SampleEstimator[Classifier] to: Alien_data/RF-threehundred-trees/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/RF-threehundred-trees/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/RF-threehundred-trees/predictions.qza[0m
[32mSaved Visualization to: Alien_data/RF-threehundred-trees/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/RF-threehundred-trees/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/RF-threehundred-trees/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/RF-threehundred-trees/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-threehundred-trees/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-threehundred-trees/test_targets.qza[0m
[0m

In [64]:
Visualization.load(f'{data_dir}/RF-threehundred-trees/accuracy_results.qzv')

In [61]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-n-estimators 500 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/RF-fivehundred-trees

[32mSaved SampleEstimator[Classifier] to: Alien_data/RF-fivehundred-trees/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/RF-fivehundred-trees/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/RF-fivehundred-trees/predictions.qza[0m
[32mSaved Visualization to: Alien_data/RF-fivehundred-trees/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/RF-fivehundred-trees/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/RF-fivehundred-trees/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/RF-fivehundred-trees/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-fivehundred-trees/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-fivehundred-trees/test_targets.qza[0m
[0m

In [62]:
Visualization.load(f'{data_dir}/RF-fivehundred-trees/accuracy_results.qzv')

### Repeat with fewer and more folds (cross-validation)
Default is 5, try 3 and 10. All with 500 trees. Again no effect.

In [65]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-n-estimators 500 \
  --p-cv 3 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/RF-three-fold

[32mSaved SampleEstimator[Classifier] to: Alien_data/RF-three-fold/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/RF-three-fold/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/RF-three-fold/predictions.qza[0m
[32mSaved Visualization to: Alien_data/RF-three-fold/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/RF-three-fold/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/RF-three-fold/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/RF-three-fold/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-three-fold/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-three-fold/test_targets.qza[0m
[0m

In [66]:
Visualization.load(f'{data_dir}/RF-three-fold/accuracy_results.qzv')

In [71]:
# Removed feature selection parameter because it causes error that 
# "linkage must be computed on at least two observations". Possible that 
# in some partitions of the data, only one feature remains important 
# when using optimized feature selection (all feature importances seem low)

! qiime sample-classifier classify-samples \
  --i-table $data_dir/table-filtered.qza \
  --m-metadata-file $data_dir/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-parameter-tuning \
  --p-n-estimators 500 \
  --p-cv 10 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/RF-ten-fold

[32mSaved SampleEstimator[Classifier] to: Alien_data/RF-ten-fold/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: Alien_data/RF-ten-fold/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: Alien_data/RF-ten-fold/predictions.qza[0m
[32mSaved Visualization to: Alien_data/RF-ten-fold/model_summary.qzv[0m
[32mSaved Visualization to: Alien_data/RF-ten-fold/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: Alien_data/RF-ten-fold/probabilities.qza[0m
[32mSaved Visualization to: Alien_data/RF-ten-fold/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-ten-fold/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: Alien_data/RF-ten-fold/test_targets.qza[0m
[0m

In [72]:
Visualization.load(f'{data_dir}/RF-ten-fold/accuracy_results.qzv')

##### Overall, no parameters improved the overall accuracy 

The overall accuracy is almost equal to the baseline accuracy, meaning the model performs basically only as well as a model that simply classifies samples as the most frequent class. This suggests that microbial composition does not provide enough information to predict abduction status.

#### Not done: evaluating over- or under-fitting