## Step 1: SCRuB_Decontamination
**Goal: To run [SCRuB](https://shenhav-and-korem-labs.github.io/SCRuB/) to remove any lab associated contamination**

Citation:
Austin, G.I., Park, H., Meydan, Y. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01696-w  

Qiime2 Install: https://forum.qiime2.org/t/q2-scrub-release/26609

Conda env: terminal: HCC-microbialDNA, jupyter: comad_EAC_figures

### Imports

In [2]:
import pandas as pd
import qiime2
import biom

In [4]:
#Import prep and study info from pangenome filtered data on Qiita
prep = pd.read_csv('qiita_downloads/pese_pangenome_align-RS210-masked/15336_mod_prep_16181_20231119-181626.txt', sep = '\t')
meta = pd.read_csv('qiita_downloads/pese_pangenome_align-RS210-masked/sample_information_mod_prep_16181.txt', sep = '\t')

### Import Biom and convert to QZA (only have to run once) 

In [5]:
#RS210 -clean: 5/30/24
biom_table = biom.load_table('qiita_downloads/pese_pangenome_align-RS210-masked/biom/none.biom')

# Convert BIOM table to Pandas df for viewing
biom_artifact = qiime2.Artifact.import_data('FeatureTable[Frequency]', biom_table)
biom_df = biom_artifact.view(pd.DataFrame)#.T

#Need to convert _ to . because of strange issues in R
biom_df.index = biom_df.index.str.replace('_', '.')

display(biom_df)

# Convert pandas DF to QZA file, then save file
#biom_q2 = qiime2.Artifact.import_data("FeatureTable[Frequency]", biom_df)
#biom_q2.save('qiita_downloads/pese_pangenome_align-RS210-masked/none.qza')

Unnamed: 0,G000001985,G000002415,G000002445,G000002495,G000002655,G000002725,G000002765,G000002825,G000002845,G000002855,...,G905367715,G905397275,G905397375,G905397435,G907163105,G907163265,G907166805,G910573815,G910576415,G913698045
CRC10.B.2.S58.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CRC10.B.S57.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CRC.10.P.2.S34.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CRC.10.P.S33.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CRC10.T.2.S60.L003.aligned,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HCC.9.P.2.S128.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HCC9.B.S189.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HCC9.B.2.S190.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HCC8.T.S187.L003.aligned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Prep for SCRuB (only have to run once)

In [4]:
#Create meta for RS210-clean manual run - 5/30/24 -- post RS210-clean
combo = pd.merge(meta, prep, on='sample_name', how='outer')

scrub_meta = pd.DataFrame()
scrub_meta['sampleid'] = combo['run_prefix'] + '.aligned'
scrub_meta['is_control'] = combo['empo_1'].replace({'Host-associated': 0.0, 'Control': 1.0})
#scrub_meta['is_control'] = combo['empo_1'].replace({'Host-associated': 'FALSE', 'Control': 'TRUE'})
scrub_meta['sample_type'] = combo['qiita_sample_type']
scrub_meta['well_id'] = combo['sample_well']

scrub_meta['sampleid'] = scrub_meta['sampleid'].str.replace('_', '.')

scrub_meta.to_csv('processed_data/SCRuB/scrub_meta_pangenome_rs210clean.tsv', sep = '\t', index = False)

scrub_meta

Unnamed: 0,sampleid,is_control,sample_type,well_id
0,BLANK1.8D.2.S12.L003.aligned,1.0,control blank,H8
1,BLANK1.8E.2.S14.L003.aligned,1.0,control blank,J8
2,BLANK1.8G.2.S18.L003.aligned,1.0,control blank,N8
3,BLANK2.12F.2.S26.L003.aligned,1.0,control blank,L12
4,BLANK2.12G.S27.L003.aligned,1.0,control blank,M12
...,...,...,...,...
155,HCC9.B.2.S190.L003.aligned,0.0,tumor,P5
156,HCC.9.P.S127.L003.aligned,0.0,plasma,A10
157,HCC.9.P.2.S128.L003.aligned,0.0,plasma,B10
158,HCC9.T.S191.L003.aligned,0.0,tumor,M4


### Decontaminate with SCRuB

In [5]:
#RS210 - clean (New 6/17/24)
! qiime SCRuB SCRuB \
--i-table qiita_downloads/pese_pangenome_align-RS210-masked/none.qza \
--m-metadata-file processed_data/SCRuB/scrub_meta_pangenome_rs210clean.tsv \
--p-control-idx-column is_control \
--p-sample-type-column sample_type \
--p-well-location-column well_id \
--p-control-order "control blank" \
--o-scrubbed processed_data/SCRuB/pese_pangenome_align-RS210_masked_none_scrubbed.qza

[32mSaved FeatureTable[Frequency] to: processed_data/SCRuB/pese_pangenome_align-RS210_masked_none_scrubbed.qza[0m
[0m