# Introduction 

 scientific question: Previous study shows a close physical connection between SARS-CoV-2 and Rothia dentocariosa in Covid patients’ rooms. Throughout this project I’d like to answer a scientific question: what unique components and sequences does Rothia dentocariosa reveal compared to other closely related Rothia bacteria in order to be detected and be present near by SARS-CoV-2?

Background: Rothia aeria, Rothia dentocariosa, and Rothia mucilaginosa are related to respiratory infection and are also located in the mouth as well as the lungs. The family of those bacteria are Mocrococcaceae and those three bacteria are closely located within the Micrococcaceae family phylogenetic tree. According to the research paper entitled “SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment,” a high percentage of Rothia dentocariosa are detected in the SARS-CoV-2 patients’ room. I want to compare those three Rothia bacterial sequences by the sequence alignment to find any differences and unique regions in Rothia dentocariosa, compared to other two Rothia bacteria. 

Source: 
    
* Rothia bacteria sequence source linke

Rothia aeria 
https://www.ncbi.nlm.nih.gov/nuccore/AP017895

Rothia mucilaginosa
https://www.ncbi.nlm.nih.gov/nuccore/AP014938 

Rothia dentocariosa
https://www.ncbi.nlm.nih.gov/nuccore/CP054018.1 

* type1 and type2 fimbriae pdb file

https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22full_text%22%2C%22parameters%22%3A%7B%22value%22%3A%22type1%20fimbriae%22%7D%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%5D%2C%22logical_operator%22%3A%22and%22%2C%22label%22%3A%22full_text%22%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A25%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%2265ee21cd93486930c39966d1f9487d20%22%7D%7D


scientific hypothesis: If Rothia dentocariosa appears to be frequently and highly present near by SARS-CoV-2, then there will be distinct sequential regions as well as structural uniqueness on fimbriae protein compared to other closely related respiratory causing bacteria. It is important to check the structure of fimbriae because fimbriae stick to other organisms as well as non organisms. Any differences on fimbriae of Rothia dentocariosa may be related to the SARS-CoV-2. 

# Package Discription

* AlignIO: Used for multiple sequence alignment. It enables reading the sequence file and align multiple sequences. 

* cats: Used when combining filed. 

* ClustalwCommandline: Used for multiple sequence alignment program known as MUSCLE.

* MafftCommandline: Used for multiple sequence alignment program known as MUSCLE. 

* py3Dmol: USed for viewing the 3D structure of protein. 

* logomaker: Used for creating DNA, RNA or protein sequence logo. 

Source: 
- https://biopython.org/wiki/AlignIO
- https://www.reddit.com/r/bioinformatics/comments/owdrna/combining_multiple_fasta_files_into_one_file_for/
- https://pypi.org/project/py3Dmol/

## Method #1: Multiple sequence alignment with sequence of Rothia dentocariosa, Rothia mucilaginosa, and Rothia aeria. 

method#1 discription: For method 1, I did a multiple sequence alignment. Rothia aeria, Rothia dentocariosa, and Rothia mucilaginosa are related to respiratory infection and are also located in the mouth as well as the lungs. The family of those bacteria are Mocrococcaceae and those three bacteria are closely located within the Micrococcaceae family phylogenetic tree. According to the research paper entitled “SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment,” a high percentage of Rothia dentocariosa are detected in the SARS-CoV-2 patients’ room. I want to compare those three Rothia bacterial sequences by the sequence alignment to find any differences and unique regions in Rothia dentocariosa, compared to other two Rothia bacteria. Throughout multiple sequence alignment with those three Rothia bacterial sequences, I’ll compare and contrast any unique area within the sequence which may have a relationship with SARS-CoV-2.

In [5]:
# Import libraries
from Bio import AlignIO
  
# Creating DNA Sequence Alignments for all three Rothia bacteria
Denotocariosa_DNA = AlignIO.read(open("Rothia denotocariosa DNA copy.fasta"), "fasta")

Mulcilagionsa_DNA = AlignIO.read(open("Rothia mucilaginosa DNA copy.fasta"), "fasta")

Aeria_DNA = AlignIO.read(open("Rothia aeria DNA copy.fasta"), "fasta")
  
# Print alignment objects
print(Denotocariosa_DNA)

print(Mulcilagionsa_DNA)

print(Aeria_DNA)


Alignment with 1 rows and 2533415 columns
TATTTCTCGTCGCGGGCTGATGCGTGCCGCCGCTGTAGGCGTGC...TGG sequence
Alignment with 1 rows and 2292716 columns
TCTCTCACACACCTTCGCTACAGATTTGTCGGTGGTGTATTCAC...GAG AP014938.1
Alignment with 1 rows and 2588680 columns
ATTGCGGCAGCCATCTACTCTCCCACACCCACCAAGATGCAGTA...AAA AP017895.1


##### install cats for combining all three Rothia bacteria sequences into one file. (see next chuck)

source: https://www.reddit.com/r/bioinformatics/comments/owdrna/combining_multiple_fasta_files_into_one_file_for/ 

In [38]:
# install cats

In [16]:
pip install cats

Note: you may need to restart the kernel to use updated packages.


##### combine all three sequences into one fasta file using cat. 

In [37]:
# combine all three sequences using cat. 

In [17]:
cat "Rothia denotocariosa DNA copy.fasta" "Rothia mucilaginosa DNA copy.fasta" "Rothia aeria DNA copy.fasta" > "combined.fasta"

##### Tried to create the multiple sequence alginment isomg AlignIO however, I've got an error due to different lengths of the sequences. 

In [36]:
# align with combined sequences. 
from Bio import AlignIO

alignments = AlignIO.parse("combined.fasta", "fasta")
for alignment in alignments:
     print(alignment)
     print()

    
print(alignment)

ValueError: Sequences must all be the same length

In [35]:
# align with combined sequences. 
from Bio import AlignIO
alignment = AlignIO.read("combined.fasta", "fasta")
AlignIO.write([alignment], "combined.fasta", "fasta")

ValueError: Sequences must all be the same length

In [34]:
# Check bio.align.applications functions. 
import Bio.Align.Applications
dir(Bio.Align.Applications)

['ClustalOmegaCommandline',
 'ClustalwCommandline',
 'DialignCommandline',
 'MSAProbsCommandline',
 'MafftCommandline',
 'MuscleCommandline',
 'PrankCommandline',
 'ProbconsCommandline',
 'TCoffeeCommandline',
 '_ClustalOmega',
 '_Clustalw',
 '_Dialign',
 '_MSAProbs',
 '_Mafft',
 '_Muscle',
 '_Prank',
 '_Probcons',
 '_TCoffee',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__']

In [39]:
# import clustalwcommandline for multiple sequence alignments. 
from Bio.Align.Applications import ClustalwCommandline
help(ClustalwCommandline)

Help on class ClustalwCommandline in module Bio.Align.Applications._Clustalw:

class ClustalwCommandline(Bio.Application.AbstractCommandline)
 |  ClustalwCommandline(cmd='clustalw', **kwargs)
 |  
 |  Command line wrapper for clustalw (version one or two).
 |  
 |  http://www.clustal.org/
 |  
 |  Notes
 |  -----
 |  Last checked against versions: 1.83 and 2.1
 |  
 |  References
 |  ----------
 |  Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA,
 |  McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD,
 |  Gibson TJ, Higgins DG. (2007). Clustal W and Clustal X version 2.0.
 |  Bioinformatics, 23, 2947-2948.
 |  
 |  Examples
 |  --------
 |  >>> from Bio.Align.Applications import ClustalwCommandline
 |  >>> in_file = "unaligned.fasta"
 |  >>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file)
 |  >>> print(clustalw_cline)
 |  clustalw2 -infile=unaligned.fasta
 |  
 |  You would typically run the command line with clustalw_cline() or via
 |  the

In [40]:
# multiple sequence alignments using ClustalwCommandline
from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("combined", infile="/Users/choejeongin-in/BIMM143_SP22/combined.fasta")
print(cline)

combined -infile=/Users/choejeongin-in/BIMM143_SP22/combined.fasta


In [41]:
# import MafftCommandline for the multiple sequence alignemnts
from Bio.Align.Applications import MafftCommandline

In [45]:
# find the location of the file. 
import os

#to get the current working directory
directory = os.getcwd()

print(directory)

/Users/choejeong-in/BIMM143_SP22


In [42]:
# MafftCommandline for the multiple sequence alignemnts
>>> mafft_exe = "/opt/local/mafft"
>>> in_file = "/Users/choejeongin-in/BIMM143_SP22/combined.fasta"
>>> mafft_cline = MafftCommandline(mafft_exe, input=in_file)
>>> print(mafft_cline)


/opt/local/mafft /Users/choejeongin-in/BIMM143_SP22/combined.fasta


In [43]:
# MafftCommandline for the multiple sequence alignemnts
rom Bio.Align.Applications import MafftCommandline
mafft_cline=MafftCommandline(input="combined.fasta")
print(mafft_cline) 
stdout, stderr = mafft_cline()
output.write(stdout) 

SyntaxError: invalid syntax (3736627516.py, line 2)

In [44]:
# MafftCommandline for the multiple sequence alignemnts
mafft_cline = MafftCommandline(input="/Users/choejeongin-in/BIMM143_SP22/combined.fasta") 
stdout, stderr = mafft_cline()

with open("combined.fasta", "fasta") as handle: 
    handle.write(stdout)
    
stdout

ApplicationError: Non-zero return code 127 from 'mafft /Users/choejeongin-in/BIMM143_SP22/combined.fasta', message '/bin/sh: mafft: command not found'

In [45]:
# MafftCommandline for the multiple sequence alignemnts
mafft_cline = MafftCommandline(input="/Users/choejeong-in/BIMM143_SP22/combined.fasta")
stdout, stderr = mafft_cline()

with open("combined.fasta", "fasta") as handle:
    handle.write(stdout)
    
stdout

/Users/choejeong-in/BIMM143_SP22

SyntaxError: invalid syntax (3400069838.py, line 10)

## Method #2: Structural analysis 

Method#2 discription: For method 2, I chose the structural analysis in order to compare the fimbriae of those three bacteria. Fimbriae is a long and thin protein that is attached to the outer membrane of the bacteria. Fimbriae helps bacteria to attach on the surface. Since Rothia dentocariosa appears to be with the SARS-CoV-2 more frequently, I’ll compare the shape of fimbriae between those three bacteria and find any structural differences. 


In [33]:
# install py3Dmol for visualizing fimbriae protein structures. 

In [73]:
pip install py3Dmol

Note: you may need to restart the kernel to use updated packages.


In [5]:
# Rothia mucilaginosa and Rothia aeria have type1 fimbriae
import py3Dmol
view = py3Dmol.view(query='pdb:5GQP')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

<py3Dmol.view at 0x7fadd66b74f0>

In [29]:
# Rothia denotocariosa has type2 fimbriae 
import py3Dmol
view = py3Dmol.view(query='pdb:3S5C')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

<py3Dmol.view at 0x7fadd95bcca0>

Structural analysis source: https://www.blopig.com/blog/2016/10/viewing-3d-molecules-interactively-in-jupyter-ipython-notebooks/

## Analysis Method: Sequence Logos

Analysis method discription: For the analysis method, I chose to do 3D protein measurements and sequence logos. 3D protein measurements will enable us to visualize any special regions on the bacterial fimbriae. Sequence logos will help us to identify the conserved region and non conserved region of sequence. This will guide us to find the distinctive region that could possibly be a key element that links to the SARS-CoV-2 and Rothia dentocariosa. 

In [32]:
# install logomaker 

In [26]:
pip install logomaker

Note: you may need to restart the kernel to use updated packages.


In [31]:
# sequence logo using logomaker
import logomaker
logomaker.demo('combined.fasta')

LogomakerError: name = 'combined.fasta' is not valid. Must be one of dict_keys(['fig1d', 'fig1e', 'colorschemes', 'fig1f', 'fig1b', 'fig1c', 'logo'])

source: https://logomaker.readthedocs.io/en/latest/

# Analyzing the results 

So far, I was only able to run method 2 correctly. Type 1 and 2 fimbriae protein structure displays very different shapes. Type 1 protein is twice smaller than type 2. Interestingly, Rothia dentocariosa has type 2 fimbriae which seems much bigger in structure. I’m assuming due to this bigger shape of fimbriae, Rothia dentocariosa may have presented and bound to SARS-CoV-2 more often compared to other Rothia bacterias.  

I used biopython fucntion for multipe sequence alignment. It reads DNA, RNA, or protein sequences and align those sequences. 