## Machine Learning: How well does microbial composition predict abduction status?

In [1]:
# Setup
import os
import qiime2 as q2
import pandas as pd

from qiime2 import Visualization

data_dir = '../data'

In [2]:
# Classify abduction status based on microbial composition

! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/small-RF-classifier

[32mSaved SampleEstimator[Classifier] to: ../data/ML/small-RF-classifier/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/small-RF-classifier/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/small-RF-classifier/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/small-RF-classifier/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/small-RF-classifier/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/small-RF-classifier/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/small-RF-classifier/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/small-RF-classifier/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/small-RF-classifier/test_targets.qza[0m
[0m

In [3]:
Visualization.load(f'{data_dir}/ML/small-RF-classifier/accuracy_results.qzv')

The overall accuracy is high (87.9%) but not much higher than the baseline accuracy(84.8%). Classifications are clearly skewed towards the "non-abducted" category. Almost all non-abducted samples are correctly classified, but 60% of abducted samples are misclassified as non-abducted. The high baseline accuracy is likely due to the imbalance in abducted vs non-abducted sample sizes, and the correct classification of all non-abducted samples.

In [4]:
# Visualize individual samples' predictions and probabilities

! qiime metadata tabulate \
  --m-input-file $data_dir/ML/small-RF-classifier/test_targets.qza \
  --m-input-file $data_dir/ML/small-RF-classifier/predictions.qza \
  --m-input-file $data_dir/ML/small-RF-classifier/probabilities.qza \
  --o-visualization $data_dir/ML/small-RF-classifier/test_predprob.qzv

[32mSaved Visualization to: ../data/ML/small-RF-classifier/test_predprob.qzv[0m
[0m

In [5]:
Visualization.load(f'{data_dir}/ML/small-RF-classifier/test_predprob.qzv')

In [6]:
# Feature importance: which microbial compositions were most important for 
#                     distinguishing abducted vs non-abducted samples?

! qiime metadata tabulate \
    --m-input-file $data_dir/ML/small-RF-classifier/feature_importance.qza \
    --o-visualization $data_dir/ML/small-RF-classifier/feature_importance.qzv

[32mSaved Visualization to: ../data/ML/small-RF-classifier/feature_importance.qzv[0m
[0m

In [7]:
Visualization.load(f'{data_dir}/ML/small-RF-classifier/feature_importance.qzv')

In [8]:
! qiime sample-classifier heatmap \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --i-importance $data_dir/ML/small-RF-classifier/feature_importance.qza \
  --m-sample-metadata-file $data_dir/metadata/str_metadata.tsv  \
  --m-sample-metadata-column alleged_abduction \
  --p-group-samples \
  --p-feature-count 20 \
  --o-filtered-table $data_dir/ML/small-RF-classifier/important-feature-table-top-20.qza \
  --o-heatmap $data_dir/ML/small-RF-classifier/important-feature-heatmap.qzv

[32mSaved Visualization to: ../data/ML/small-RF-classifier/important-feature-heatmap.qzv[0m
[32mSaved FeatureTable[Frequency] to: ../data/ML/small-RF-classifier/important-feature-table-top-20.qza[0m
[0m

In [9]:
Visualization.load(f'{data_dir}/ML/small-RF-classifier/important-feature-heatmap.qzv')

#

### RF with Optimized Feature Selection

(Had no effect on accuracy)

In [10]:
# Classify abduction status based on microbial composition

! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-opt-feature-selection

[32mSaved SampleEstimator[Classifier] to: ../data/ML/RF-opt-feature-selection/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-opt-feature-selection/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-opt-feature-selection/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/RF-opt-feature-selection/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/RF-opt-feature-selection/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-opt-feature-selection/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/RF-opt-feature-selection/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-opt-feature-selection/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-opt-feature-selection/test_targets.qza[0m
[0m

In [11]:
Visualization.load(f'{data_dir}/ML/RF-opt-feature-selection/accuracy_results.qzv')

#

### RF with Parameter Tuning

Does parameter tuning improve the model's accuracy? No effect.

In [14]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-estimator RandomForestClassifier \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-param-tuning

[32mSaved SampleEstimator[Classifier] to: ../data/ML/RF-param-tuning/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-param-tuning/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-param-tuning/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/RF-param-tuning/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/RF-param-tuning/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-param-tuning/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/RF-param-tuning/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-param-tuning/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-param-tuning/test_targets.qza[0m
[0m

In [15]:
Visualization.load(f'{data_dir}/ML/RF-param-tuning/accuracy_results.qzv')

#

### RF with More Trees
Does increasing the number of trees improve the model's accuracy?
(Default is 100 trees, here trying 300 and 500) Neither improved overall accuracy.

In [16]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-n-estimators 300 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-threehundred-trees

[32mSaved SampleEstimator[Classifier] to: ../data/ML/RF-threehundred-trees/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-threehundred-trees/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-threehundred-trees/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/RF-threehundred-trees/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/RF-threehundred-trees/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-threehundred-trees/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/RF-threehundred-trees/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-threehundred-trees/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-threehundred-trees/test_targets.qza[0m
[0m

In [17]:
Visualization.load(f'{data_dir}/ML/RF-threehundred-trees/accuracy_results.qzv')

In [18]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-n-estimators 500 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-fivehundred-trees

[32mSaved SampleEstimator[Classifier] to: ../data/ML/RF-fivehundred-trees/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-fivehundred-trees/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-fivehundred-trees/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/RF-fivehundred-trees/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/RF-fivehundred-trees/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-fivehundred-trees/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/RF-fivehundred-trees/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-fivehundred-trees/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-fivehundred-trees/test_targets.qza[0m
[0m

In [19]:
Visualization.load(f'{data_dir}/ML/RF-fivehundred-trees/accuracy_results.qzv')

#

### RF with fewer and more folds (cross-validation)
Default is 5, try 3 and 10. All with 500 trees. Again no effect.

In [20]:
! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-n-estimators 500 \
  --p-cv 3 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-three-fold

[32mSaved SampleEstimator[Classifier] to: ../data/ML/RF-three-fold/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-three-fold/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-three-fold/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/RF-three-fold/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/RF-three-fold/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-three-fold/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/RF-three-fold/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-three-fold/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-three-fold/test_targets.qza[0m
[0m

In [21]:
Visualization.load(f'{data_dir}/ML/RF-three-fold/accuracy_results.qzv')

In [22]:
# Removed feature selection parameter because it causes error that 
# "linkage must be computed on at least two observations". Possible that 
# in some partitions of the data, only one feature remains important 
# when using optimized feature selection (all feature importances seem low)

! qiime sample-classifier classify-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-test-size 0.2 \
  --p-estimator RandomForestClassifier \
  --p-parameter-tuning \
  --p-n-estimators 500 \
  --p-cv 10 \
  --p-palette 'enigma' \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-ten-fold

[32mSaved SampleEstimator[Classifier] to: ../data/ML/RF-ten-fold/sample_estimator.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-ten-fold/feature_importance.qza[0m
[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-ten-fold/predictions.qza[0m
[32mSaved Visualization to: ../data/ML/RF-ten-fold/model_summary.qzv[0m
[32mSaved Visualization to: ../data/ML/RF-ten-fold/accuracy_results.qzv[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-ten-fold/probabilities.qza[0m
[32mSaved Visualization to: ../data/ML/RF-ten-fold/heatmap.qzv[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-ten-fold/training_targets.qza[0m
[32mSaved SampleData[TrueTargets] to: ../data/ML/RF-ten-fold/test_targets.qza[0m
[0m

In [23]:
Visualization.load(f'{data_dir}/ML/RF-ten-fold/accuracy_results.qzv')

##### Overall, no parameters improved the overall accuracy 

The overall accuracy is almost equal to the baseline accuracy, meaning the model performs basically only as well as a model that simply classifies samples as the most frequent class. This suggests that microbial composition does not provide enough information to predict abduction status.

#

### Nested cross-validation

In [24]:
# 3-fold, 500 trees
! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 3 \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 500 \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-ncv-classifier-three-fold

[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-ncv-classifier-three-fold/predictions.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-ncv-classifier-three-fold/feature_importance.qza[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-ncv-classifier-three-fold/probabilities.qza[0m
[0m

In [25]:
# Classify abduction status based on microbial composition

! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/RF-ncv-classifier-three-fold/predictions.qza \
  --i-probabilities $data_dir/ML/RF-ncv-classifier-three-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/str_metadata.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/RF-ncv-classifier-three-fold/ncv_confusion_matrix.qzv

[32mSaved Visualization to: ../data/ML/RF-ncv-classifier-three-fold/ncv_confusion_matrix.qzv[0m
[0m

In [26]:
Visualization.load(f'{data_dir}/ML/RF-ncv-classifier-three-fold/ncv_confusion_matrix.qzv')

In [27]:
# 5-fold, 500 trees
! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 5 \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 500 \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-ncv-classifier-five-fold

[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-ncv-classifier-five-fold/predictions.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-ncv-classifier-five-fold/feature_importance.qza[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-ncv-classifier-five-fold/probabilities.qza[0m
[0m

In [28]:
# Classify abduction status based on microbial composition
! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/RF-ncv-classifier-five-fold/predictions.qza \
  --i-probabilities $data_dir/ML/RF-ncv-classifier-five-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/str_metadata.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/RF-ncv-classifier-five-fold/ncv_confusion_matrix.qzv

[32mSaved Visualization to: ../data/ML/RF-ncv-classifier-five-fold/ncv_confusion_matrix.qzv[0m
[0m

In [29]:
Visualization.load(f'{data_dir}/ML/RF-ncv-classifier-five-fold/ncv_confusion_matrix.qzv')

In [30]:
# 10-fold, 500 trees
! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 10 \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 500 \
  --p-random-state 0 \
  --output-dir $data_dir/ML/RF-ncv-classifier-ten-fold

[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/RF-ncv-classifier-ten-fold/predictions.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/RF-ncv-classifier-ten-fold/feature_importance.qza[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/RF-ncv-classifier-ten-fold/probabilities.qza[0m
[0m

In [31]:
# Classify abduction status based on microbial composition

! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/RF-ncv-classifier-ten-fold/predictions.qza \
  --i-probabilities $data_dir/ML/RF-ncv-classifier-ten-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/str_metadata.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/RF-ncv-classifier-ten-fold/ncv_confusion_matrix.qzv

[32mSaved Visualization to: ../data/ML/RF-ncv-classifier-ten-fold/ncv_confusion_matrix.qzv[0m
[0m

In [32]:
Visualization.load(f'{data_dir}/ML/RF-ncv-classifier-ten-fold/ncv_confusion_matrix.qzv')

#

### Repeat with Balanced Sample Sizes

Limitation of the method below: cannot set seed for reproducible random sampling of the non-abducted samples.

In [35]:
# Randomly sample non-abducted samples in metadata
from random import sample

meta = pd.read_csv(f"{data_dir}/metadata/str_metadata.tsv", sep = "\t")
meta.head()

Unnamed: 0,sampleid,stool_consistency,hct_source,disease,categorical_time_relative_to_engraftment,week_relative_to_hct,timepoint_of_transplant,day_relative_to_nearest_hct,alleged_abduction
0,N4VICF,formed,cord,Myelodysplastic Syndromes,pre,one week before HCT,6.0,-6.0,non_abducted
1,8A0F9A,formed,cord,Leukemia,pre,two weeks before HCT,7.0,-7.0,non_abducted
2,5Y49IM,semi-formed,cord,Leukemia,peri,one week before HCT,7.0,0.0,abducted
3,ZKJI45,semi-formed,cord,Leukemia,post,one week after HCT,7.0,8.0,non_abducted
4,2I7SIQ,liquid,cord,Leukemia,peri,one week before HCT,0.0,0.0,abducted


In [36]:
meta_nonab = meta.loc[meta['alleged_abduction'] == "non_abducted"]
meta_nonab.head()

Unnamed: 0,sampleid,stool_consistency,hct_source,disease,categorical_time_relative_to_engraftment,week_relative_to_hct,timepoint_of_transplant,day_relative_to_nearest_hct,alleged_abduction
0,N4VICF,formed,cord,Myelodysplastic Syndromes,pre,one week before HCT,6.0,-6.0,non_abducted
1,8A0F9A,formed,cord,Leukemia,pre,two weeks before HCT,7.0,-7.0,non_abducted
3,ZKJI45,semi-formed,cord,Leukemia,post,one week after HCT,7.0,8.0,non_abducted
6,XO59R8,liquid,cord,Leukemia,pre,one week before HCT,1.0,-1.0,non_abducted
7,AFG7YZ,semi-formed,cord,Leukemia,post,two weeks after HCT,1.0,15.0,non_abducted


In [37]:
# Get number of non-abducted samples that need to be REMOVED 
ncut = len(meta_nonab) - len(meta.loc[meta['alleged_abduction'] == "abducted"])
ncut

109

In [38]:
cutIDs = sample(list(meta_nonab["sampleid"]), ncut)
cutIDs[:10]

['L60NPJ',
 '62ARKK',
 '5CPP5N',
 'G3VBHP',
 'RYYSOI',
 'PGIR4X',
 'XDEIYI',
 'ZQT8ZN',
 '0KB68F',
 'NTRPTL']

In [39]:
meta.drop(axis = 0, index = meta.loc[meta["sampleid"].isin(cutIDs)].index,  inplace=True, errors='raise')

In [40]:
len(meta.loc[meta["alleged_abduction"] == "non_abducted"])

26

In [41]:
# write new metadata file
meta.to_csv(f"{data_dir}/metadata/meta_balanced.tsv", sep = "\t", index=False)

In [42]:
# Double-check the new metadata file
meta = pd.read_csv(f"{data_dir}/metadata/meta_balanced.tsv", sep = "\t")
meta.head()

Unnamed: 0,sampleid,stool_consistency,hct_source,disease,categorical_time_relative_to_engraftment,week_relative_to_hct,timepoint_of_transplant,day_relative_to_nearest_hct,alleged_abduction
0,5Y49IM,semi-formed,cord,Leukemia,peri,one week before HCT,7.0,0.0,abducted
1,2I7SIQ,liquid,cord,Leukemia,peri,one week before HCT,0.0,0.0,abducted
2,PCUMU7,semi-formed,cord,Leukemia,post,two weeks after HCT,0.0,16.0,abducted
3,Q4TOSG,formed,cord,Myelodysplastic Syndromes,post,HCT week,6.0,7.0,abducted
4,8MIL3L,formed,cord,Non-Hodgkin's Lymphoma,pre,two weeks before HCT,7.0,-7.0,non_abducted


In [43]:
# Filter feature table to only include the remaining samples 

! qiime feature-table filter-samples \
  --i-table $data_dir/taxonomy/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/meta_balanced.tsv \
  --o-filtered-table $data_dir/ML/table-filtered-balanced.qza 

[32mSaved FeatureTable[Frequency] to: ../data/ML/table-filtered-balanced.qza[0m
[0m

In [44]:
# Rerun nested cross-validation

# 3-fold, 100 trees
! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/ML/table-filtered-balanced.qza \
  --m-metadata-file $data_dir/metadata/meta_balanced.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 3 \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 100 \
  --p-random-state 0 \
  --output-dir $data_dir/ML/balanced-ncv-three-fold

[32mSaved SampleData[ClassifierPredictions] to: ../data/ML/balanced-ncv-three-fold/predictions.qza[0m
[32mSaved FeatureData[Importance] to: ../data/ML/balanced-ncv-three-fold/feature_importance.qza[0m
[32mSaved SampleData[Probabilities] to: ../data/ML/balanced-ncv-three-fold/probabilities.qza[0m
[0m

In [45]:
# Classify abduction status based on microbial composition

! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/balanced-ncv-three-fold/predictions.qza \
  --i-probabilities $data_dir/ML/balanced-ncv-three-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/meta_balanced.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/balanced-ncv-three-fold/ncv_confusion_matrix.qzv

[32mSaved Visualization to: ../data/ML/balanced-ncv-three-fold/ncv_confusion_matrix.qzv[0m
[0m

In [46]:
Visualization.load(f'{data_dir}/ML/balanced-ncv-three-fold/ncv_confusion_matrix.qzv')

Now the baseline accuracy is lower because the frequency of the classes are equal, and the overall accuracy is even slightly worse than random chance. Repeating with more trees and more folds:

In [None]:
# 5-fold, 300 trees

! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/ML/table-filtered-balanced.qza \
  --m-metadata-file $data_dir/metadata/meta_balanced.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 5 \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 300 \
  --p-random-state 0 \
  --output-dir $data_dir/ML/balanced-ncv-five-fold

In [None]:
# Classify abduction status based on microbial composition

! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/balanced-ncv-five-fold/predictions.qza \
  --i-probabilities $data_dir/ML/balanced-ncv-five-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/meta_balanced.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/balanced-ncv-five-fold/ncv_confusion_matrix.qzv

In [None]:
Visualization.load(f'{data_dir}/ML/balanced-ncv-five-fold/ncv_confusion_matrix.qzv')

In [None]:
# 10-fold, 300 trees

! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/ML/table-filtered-balanced.qza \
  --m-metadata-file $data_dir/metadata/meta_balanced.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 10 \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 300 \
  --p-random-state 0 \
  --output-dir $data_dir/ML/balanced-ncv-ten-fold

In [None]:
# Classify abduction status based on microbial composition

! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/balanced-ncv-ten-fold/predictions.qza \
  --i-probabilities $data_dir/ML/balanced-ncv-ten-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/meta_balanced.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/balanced-ncv-ten-fold/ncv_confusion_matrix.qzv

In [None]:
Visualization.load(f'{data_dir}/ML/balanced-ncv-ten-fold/ncv_confusion_matrix.qzv')

#

### Trying Linear Support Vector Classifier (SVC)

With 5-fold nested cross-validation (turns out even worse)

In [5]:
! qiime sample-classifier classify-samples-ncv \
  --i-table $data_dir/ML/table-filtered.qza \
  --m-metadata-file $data_dir/metadata/str_metadata.tsv \
  --m-metadata-column alleged_abduction \
  --p-parameter-tuning \
  --p-cv 5 \
  --p-estimator LinearSVC \
  --p-random-state 0 \
  --output-dir $data_dir/ML/LSVC-ncv-five-fold

[32mSaved SampleData[ClassifierPredictions] to: ../Alien_data/LSVC-ncv-five-fold/predictions.qza[0m
[32mSaved FeatureData[Importance] to: ../Alien_data/LSVC-ncv-five-fold/feature_importance.qza[0m
[32mSaved SampleData[Probabilities] to: ../Alien_data/LSVC-ncv-five-fold/probabilities.qza[0m
[0m

In [6]:
# Classify abduction status based on microbial composition

! qiime sample-classifier confusion-matrix \
  --i-predictions $data_dir/ML/LSVC-ncv-five-fold/predictions.qza \
  --i-probabilities $data_dir/ML/LSVC-ncv-five-fold/probabilities.qza \
  --m-truth-file $data_dir/metadata/str_metadata.tsv \
  --m-truth-column alleged_abduction \
  --o-visualization $data_dir/ML/LSVC-ncv-five-fold/LSVC_ncv_confusion_matrix.qzv

[32mSaved Visualization to: ../Alien_data/LSVC-ncv-five-fold/LSVC_ncv_confusion_matrix.qzv[0m
[0m

In [12]:
Visualization.load(f'{data_dir}/ML/LSVC-ncv-five-fold/LSVC_ncv_confusion_matrix.qzv')