<a href="https://colab.research.google.com/github/erin-baggs/DuckweedMicrobes/blob/main/ITS_Data_Processing_UNITE_update.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Google co-lab notebook for ITS rRNA amplicon sequencing  

# **Get data for tutorial**

To access the data needed for this tutorial we need to download [UNITE](https://unite.ut.ee/repository.php) database files (sh_taxonomy_qiime_ver8_dynamic_s_all_10.05.2021.txt, sh_refs_qiime_ver8_dynamic_s_all_10.05.2021.fasta).

The demultiplexed fasta reads which allow you to skip to **Mapping to reference database** can be found on ncbi PRJNA785658 (SRR22220775-SRR22220777. 

Versions of barcode demultiplex scripts used in processing raw nanopore reads can be found on [github](https://github.com/krasileva-group/Duckweed-Microbiome.git).  

Once you have downloaded the data put it in your google drive or load into the colab session. 

Click on left panel, navigate to folder icon on the far left. Then at the top of the file bar click on the folder with the google drive symbol and agree to mount google drive. 

## Steps for analyzing raw reads (skip if downloaded fasta from SRA)

In [None]:
# The Google Colab Environment does not have conda set, this would
# ordinarily be the easies option to install these tools.

!pip install git+https://github.com/rrwick/Porechop.git  # just so pomoxis will install cleanly
!pip install medaka pomoxis aplanat intervaltree==3.0.2
# install samtools from source
!wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
!tar -xjf samtools-1.10.tar.bz2
!cd samtools-1.10 && ./configure --prefix=/usr/local/ && make && make install
!wget https://github.com/lh3/minimap2/releases/download/v2.17/minimap2-2.17_x64-linux.tar.bz2
!tar -xjf minimap2-2.17_x64-linux.tar.bz2
!cp /content/minimap2-2.17_x64-linux/minimap2 /usr/local/bin/
!pip install requests 

In [None]:
!cp  /content/drive/MyDrive/ITS-Colab/scripts/adapter-barcode.py /usr/local/lib/python3.7/dist-packages/porechop/adapters.py

In [None]:
!porechop -i /content/drive/MyDrive/ITS-Colab/combined-ITS.fastq -b /content/drive/MyDrive/ITS-Colab/ITS-demultiplex

Barcodes to pond

BC01 = 404 
BC02 = 405 
BC03 = 923 

Combine fasta per site 

In [None]:
# The Google Colab Environment does not have conda set, this would
# ordinarily be the easies option to install these tools.

!pip install git+https://github.com/rrwick/Porechop.git  # just so pomoxis will install cleanly
!pip install medaka pomoxis aplanat intervaltree==3.0.2
# install samtools from source
!wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
!tar -xjf samtools-1.10.tar.bz2
!cd samtools-1.10 && ./configure --prefix=/usr/local/ && make && make install
!wget https://github.com/lh3/minimap2/releases/download/v2.17/minimap2-2.17_x64-linux.tar.bz2
!tar -xjf minimap2-2.17_x64-linux.tar.bz2
!cp /content/minimap2-2.17_x64-linux/minimap2 /usr/local/bin/
!pip install requests 

Download the UNITE database files from [here](https://unite.ut.ee/repository.php) 

In [None]:
!minimap2 -d /content/drive/MyDrive/ITS-Colab/UNITE/UNITE.mmi /content/drive/MyDrive/ITS-Colab/UNITE/sh_refs_qiime_ver8_dynamic_s_all_10.05.2021.fasta

In [None]:
!mkdir /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/
!mv /content/drive/MyDrive/ITS-Colab/ITS-demultiplex/*fastq /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/
!mv BC01.fasta 404.fasta
!mv BC02.fasta 405.fasta
!mv BC03.fasta 923.fasta
!mkdir /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/scratch
!mv /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/none.fastq /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/BC04.fastq /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/scratch
!mkdir /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/fasta/
!for f in /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/*fastq ; do sed -n '1~4s/^@/>/p;2~4p' $f > /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/fasta/$(basename $f .fastq).fasta ; done
!mkdir /content/drive/MyDrive/ITS-Colab/sam/AbundanceTables-30-9-22
!mkdir /content/drive/MyDrive/ITS-Colab/sam/samoutput-30-9-22

Can add a loop to command below to process all the ITS mock sequences at once. Also to get reads rather than species per genre you need to adjust the fasta headers so each one is unique

In [None]:
!minimap2 -K 5M -ax map-ont -L /content/drive/MyDrive/ITS-Colab/UNITE/UNITE.mmi /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/fasta/404.fasta > 404.sam
!minimap2 -K 5M -ax map-ont -L /content/drive/MyDrive/ITS-Colab/UNITE/UNITE.mmi /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/fasta/405.fasta > 405.sam
!minimap2 -K 5M -ax map-ont -L /content/drive/MyDrive/ITS-Colab/UNITE/UNITE.mmi /content/drive/MyDrive/ITS-Colab/ITS-demultiplex-30-9-22/fasta/923.fasta > 923.sam

In [None]:
!for file in *.sam; do echo "==> ${file} <=="; grep -v '^@' "${file}" > "${file}.output"; done

Filter reads based on SAM header to keep only those that map and do not have additional SAM [flags](https://broadinstitute.github.io/picard/explain-flags.html).  

This was not needed for bacteria, I wonder if its to do with the bacterial database being more complete/ appropriate? 



In [None]:
!mkdir /content/unite
!cat /content/404.sam.output | awk -F'\t' '$2 == 0||$2 == 256 {print $0;}' > /content/unite/404.sam.output
!cat /content/405.sam.output | awk -F'\t' '$2 == 0||$2 == 256 {print $0;}' > /content/unite/405.sam.output
!cat /content/923.sam.output | awk -F'\t' '$2 == 0||$2 == 256 {print $0;}' > /content/unite/923.sam.output

In [None]:
!cp /content/unite/*sam.output /content/drive/MyDrive/ITS-Colab/sam/samoutput-30-9-22/  

# Abundance tables 
To create abundcance tables we use more of the files from the SILVA DB and some adapted code from the [Puntseq](https://https://github.com/d-j-k/puntseq) project. If interested here is the related paper Urban L (2020), Freshwater monitoring by nanopore sequencing elife [link text](https://elifesciences.org/articles/61504)

In [None]:
#Generate table of reads per species
import pandas as pd
import io
import os
import requests
import numpy as np
# load all files from the SILVA database
sild = pd.read_csv('/content/drive/MyDrive/ITS-Colab/UNITE/sh_taxonomy_qiime_ver8_dynamic_s_all_10.05.2021.txt', sep='\t', header=None)
sild.columns = ['taxid','tree']
sild['ranks'] = [x.split(';')[-1:] for x in sild.tree.values]
sild['tree'] = [x[0:-1] for x in sild.tree]
sild.index = sild.taxid
ranks = 'species'
# choose dir of sam files
dirc = '/content/unite' 


# create 
nr = 0
for filename in os.listdir(dirc):
    print(filename)
    
    try:
        silva_10k = pd.read_csv('/content/unite/%s' %filename, 
                         sep='\t', header=None, usecols = [0,2,4,13])
    except: 
        continue
    
    silva_10k.columns = ['Read_ID', 'id','MS', 'ASs']
    silva_10k['ASs'] = silva_10k['ASs'].astype('str')
    silva_10k['AS'] = [x.split(':i:')[-1] for x in silva_10k['ASs'].values]
    silva_10k.dropna(axis=0, subset=['AS'], inplace=True)
    silva_10k['AS'] = silva_10k['AS'].astype('float')
    mini = silva_10k[silva_10k['AS'] == silva_10k.groupby('Read_ID')['AS'].transform('max')]
    mini = mini[['Read_ID', 'MS', 'AS','id']]              
    mini.columns = ['read','score','as','id']
    mini = mini[~mini.id.isnull()]  
    mini['taxid'] =sild.ranks.loc[mini.id.values].values


    if ranks == 'species':
        mini['ranks'] = sild.ranks.loc[mini.id.values].values
        mini.index = mini.read  
        for i in mini.index[mini.duplicated(subset='read', keep=False)].unique():
            minil = list(mini.loc[i].taxid.values)
            if minil.count(minil[0]) != len(minil):
                mini.drop(i)
        mini.drop_duplicates(subset='read', keep='first', inplace=True)

    mini['ranks']= [(x[0].strip("[]")) for x in mini.ranks] 
    mini['ranks']= [(x.split("s__")[1]) for x in mini.ranks]     #Current WORKS
    mini2 = pd.DataFrame(mini.ranks.value_counts())

    if nr==0:
        minif = mini2.copy(deep=True)
        minif.columns.values[nr] = filename.split('.')[0]
    else:
        minif = minif.merge(mini2, left_index=True, right_index=True, how='outer')
        minif.columns.values[nr] = filename.split('.')[0]
    nr = nr+1
# describe all missing bacteria as absent
minif = minif.fillna(0) 
  
minif.to_csv('/content/minimap2_unite_species_%s.txt' %ranks, sep='\t')


In [None]:
#Generate table of reads per genus
import pandas as pd
import io
import os
import requests
import numpy as np
# load all files from the SILVA database
sild = pd.read_csv('/content/drive/MyDrive/ITS-Colab/UNITE/sh_taxonomy_qiime_ver8_dynamic_s_all_10.05.2021.txt', sep='\t', header=None)
sild.columns = ['taxid','tree']
sild['ranks'] = [x.split(';')[-2:-1] for x in sild.tree.values]
sild['tree'] = [x[0:-1] for x in sild.tree]
sild.index = sild.taxid
ranks = 'genus'
# choose dir of sam files
dirc = '/content/unite' 
# create 
nr = 0
for filename in os.listdir(dirc):
    
    try:
        silva_10k = pd.read_csv('/content/unite/%s' %filename, 
                         sep='\t', header=None, usecols = [0,2,4,13])
    except: 
        continue
    
    silva_10k.columns = ['Read_ID', 'id','MS', 'ASs']
    silva_10k['ASs'] = silva_10k['ASs'].astype('str')
    silva_10k['AS'] = [x.split(':i:')[-1] for x in silva_10k['ASs'].values]
    silva_10k.dropna(axis=0, subset=['AS'], inplace=True)
    silva_10k['AS'] = silva_10k['AS'].astype('float')
    mini = silva_10k[silva_10k['AS'] == silva_10k.groupby('Read_ID')['AS'].transform('max')]
    mini = mini[['Read_ID', 'MS', 'AS','id']]              
    mini.columns = ['read','score','as','id']
    mini = mini[~mini.id.isnull()]  
    mini['taxid'] =sild.ranks.loc[mini.id.values].values


    if ranks == 'genus':
        mini['ranks'] = sild.ranks.loc[mini.id.values].values
        mini.index = mini.read  
        for i in mini.index[mini.duplicated(subset='read', keep=False)].unique():
            minil = list(mini.loc[i].taxid.values)
            if minil.count(minil[0]) != len(minil):
                mini.drop(i)
        mini.drop_duplicates(subset='read', keep='first', inplace=True)

    mini['ranks']= [(x[0].strip("[]")) for x in mini.ranks] 
    mini['ranks']= [(x.split("g__")[1]) for x in mini.ranks]     #Current WORKS
    mini2 = pd.DataFrame(mini.ranks.value_counts())

    if nr==0:
        print('seen')
        minif = mini2.copy(deep=True)
        minif.columns.values[nr] = filename.split('.')[0]
    else:
        minif = minif.merge(mini2, left_index=True, right_index=True, how='outer')
        minif.columns.values[nr] = filename.split('.')[0]

    nr = nr+1
# describe all missing bacteria as absent
minif = minif.fillna(0) 
  
minif.to_csv('/content/minimap2_unite_%s.txt' %ranks, sep='\t')


In [None]:
#Generate table of reads per family
import pandas as pd
import io
import os
import requests
import numpy as np
# load all files from the SILVA database
sild = pd.read_csv('/content/drive/MyDrive/ITS-Colab/UNITE/sh_taxonomy_qiime_ver8_dynamic_s_all_10.05.2021.txt', sep='\t', header=None)
sild.columns = ['taxid','tree']
sild['ranks'] = [x.split(';')[-4:-3] for x in sild.tree.values]
sild['tree'] = [x[0:-1] for x in sild.tree]
sild.index = sild.taxid
ranks = 'order'
# choose dir of sam files
dirc = '/content/unite' 
# create 
nr = 0
print('here')
for filename in os.listdir(dirc):
    print(filename) 
    
    try:
        silva_10k = pd.read_csv('/content/unite/%s' %filename, sep='\t', header=None, usecols = [0,2,4,13])
    except: 
        continue
    
    silva_10k.columns = ['Read_ID', 'id','MS', 'ASs']
    silva_10k['ASs'] = silva_10k['ASs'].astype('str')
    silva_10k['AS'] = [x.split(':i:')[-1] for x in silva_10k['ASs'].values]
    silva_10k.dropna(axis=0, subset=['AS'], inplace=True)
    silva_10k['AS'] = silva_10k['AS'].astype('float')
    mini = silva_10k[silva_10k['AS'] == silva_10k.groupby('Read_ID')['AS'].transform('max')]
    mini = mini[['Read_ID', 'MS', 'AS','id']]              
    mini.columns = ['read','score','as','id']
    mini = mini[~mini.id.isnull()]  
    mini['taxid'] =sild.ranks.loc[mini.id.values].values


    if ranks == 'order':
        mini['ranks'] = sild.ranks.loc[mini.id.values].values
        mini.index = mini.read  
        for i in mini.index[mini.duplicated(subset='read', keep=False)].unique():
            minil = list(mini.loc[i].taxid.values)
            if minil.count(minil[0]) != len(minil):
                mini.drop(i)
        mini.drop_duplicates(subset='read', keep='first', inplace=True)

    mini['ranks']= [(x[0].strip("[]")) for x in mini.ranks] 
    mini['ranks']= [(x.split("o__")[1]) for x in mini.ranks]     #Current WORKS
    mini2 = pd.DataFrame(mini.ranks.value_counts())

    if nr==0:
        minif = mini2.copy(deep=True)
        minif.columns.values[nr] = filename.split('.')[0]
    else:
        minif = minif.merge(mini2, left_index=True, right_index=True, how='outer')
        minif.columns.values[nr] = filename.split('.')[0]

    nr = nr+1
# describe all missing bacteria as absent
minif = minif.fillna(0) 
  
minif.to_csv('/content/minimap2_unite_%s.txt' %ranks, sep='\t')


In [None]:
#Generate table of reads per genus
import pandas as pd
import io
import os
import requests
import numpy as np
# load all files from the SILVA database
sild = pd.read_csv('/content/drive/MyDrive/ITS-Colab/UNITE/sh_taxonomy_qiime_ver8_dynamic_s_all_10.05.2021.txt', sep='\t', header=None)
sild.columns = ['taxid','tree']
sild['ranks'] = [x.split(';')[-6:-5] for x in sild.tree.values]
sild['tree'] = [x[0:-1] for x in sild.tree]
sild.index = sild.taxid
ranks = 'phylum'
# choose dir of sam files
dirc = '/content/' 
# create 
nr = 0
for filename in os.listdir(dirc):
    
    try:
        silva_10k = pd.read_csv('/content/%s' %filename, 
                         sep='\t', header=None, usecols = [0,2,4,13])
    except: 
        continue
    
    silva_10k.columns = ['Read_ID', 'id','MS', 'ASs']
    silva_10k['ASs'] = silva_10k['ASs'].astype('str')
    silva_10k['AS'] = [x.split(':i:')[-1] for x in silva_10k['ASs'].values]
    silva_10k.dropna(axis=0, subset=['AS'], inplace=True)
    silva_10k['AS'] = silva_10k['AS'].astype('float')
    mini = silva_10k[silva_10k['AS'] == silva_10k.groupby('Read_ID')['AS'].transform('max')]
    mini = mini[['Read_ID', 'MS', 'AS','id']]              
    mini.columns = ['read','score','as','id']
    mini = mini[~mini.id.isnull()]  
    mini['taxid'] =sild.ranks.loc[mini.id.values].values


    if ranks == 'phylum':
        mini['ranks'] = sild.ranks.loc[mini.id.values].values
        mini.index = mini.read  
        for i in mini.index[mini.duplicated(subset='read', keep=False)].unique():
            minil = list(mini.loc[i].taxid.values)
            if minil.count(minil[0]) != len(minil):
                mini.drop(i)
        mini.drop_duplicates(subset='read', keep='first', inplace=True)

    mini['ranks']= [(x[0].strip("[]")) for x in mini.ranks] 
    mini['ranks']= [(x.split("p__")[1]) for x in mini.ranks]     #Current WORKS
    mini2 = pd.DataFrame(mini.ranks.value_counts())

    if nr==0:
        minif = mini2.copy(deep=True)
        minif.columns.values[nr] = filename.split('.')[0]
    else:
        minif = minif.merge(mini2, left_index=True, right_index=True, how='outer')
        minif.columns.values[nr] = filename.split('.')[0]

    nr = nr+1
# describe all missing bacteria as absent
minif = minif.fillna(0) 
  
minif.to_csv('/content/minimap2_unite_%s.txt' %ranks, sep='\t')
