# **vLife Virtusa**

# **Gene Similarity of Pandemics using BioPython**

I am going to use Biopython to analyze the established sequence of the caronavirus. For documentation look here: BioPython

Some facts regarding the virus itself: COVID-19

It is of the (+)ssRNA classification of viruses, which means it is a single stranded virus that can be directly translated into protein.
The actual virus is called SARS-CoV-2, Covid-19 is the name for the respiratory disease it causes.

**Usecase Description**

Gene Similarity is to compare the SARS-CoV-2 virus sequence with the other pandemics and viruses like SARS, MERS, Civet-SARS,BAT-SARS, EBOLAV, camelus, plasmodium-malariae, hiv2, hedgehog and find the similarity scores between them. 

**SARS-Cov-2 sequence input - **

The SARS-Cov-2 sequence should be dropped in as the input and would be compared with other pandemics.

**Description**

The first 100 sequence of COV-2 are pairwise compared with other pandemics using Bio python package pairwise and a score is determined based on the match.
Then, the complete genome sequence of Cov-2 is compared with other pandemics and a score is determined based on the following:
1. aligner.match_score = 1
2. aligner.mismatch_score = -1
3. aligner.open_gap_score = -0.5
4. aligner.extend_gap_score = -0.1

Later the score is divided by the total length of CoV-2 genome sequence (which is the input). based on which percentage of similarity is determined. 


**Input**
The Input used in the usecase is FASTA format DNA of COVID-19 isolate.

**Input to the model**

DNA FASTA file of SARS, MERS, Civet-SARS,BAT-SARS, EBOLAV, camelus, plasmodium-malariae, hiv2, hedgehog pandemics.

**Dataset source**
The input dataset source - https://bigd.big.ac.cn/gwh/browse/virus/coronaviridae

Fasta file of SARS and MERS - https://www.kaggle.com/radenkovic/coronavirus-accession-sars-mers-cov2

Fasta file of Civet-SARS,BAT-SARS and EBOLAV - https://www.kaggle.com/akiator9/ebolav-vs-sarscov-vs-mers 

Fasta file of camelus, plasmodium-malariae, hiv2 and hedgehog - https://www.kaggle.com/amansingh29/cameluscovfasta


In [5]:
import numpy as np
import pandas as pd
import re
import plotly.offline as py
import plotly.graph_objects as go
from Bio import SeqIO
import os
os.environ['QT_QPA_PLATFORM']='offscreen' 

import matplotlib.pyplot as plt
%matplotlib inline
from colorama import Back, Style, Fore

try:
    from Bio.Align.Applications import ClustalwCommandline
    from Bio import Align, pairwise2
except:
    !pip install biopython
    from Bio.Align.Applications import ClustalwCommandline
    from Bio import Align, pairwise2

We use BioPython packages for the analysis

# **DNA INPUT OF COV2 FASTA SEQUENCE**

In [7]:
class Genesimilarity():
    def genome(dna_seq):
        with open("../input/coronavirus-accession-sars-mers-cov2/sars.fasta", "r") as file:
            sars_genome = file.read().split("genome")[1].replace("\n", "")
        with open("../input/coronavirus-accession-sars-mers-cov2/mers.fasta", "r") as file:
            mers_genome = file.read().split("genome")[1].replace("\n", "")
        sample_DNAseq = dna_seq
        cov_genome = sample_DNAseq.seq
        cov_genome = str(cov_genome)
        with open("../input/ebolav-vs-sarscov-vs-mers/Civet-SARS.fasta", "r") as file:
            civet_sl_cov_genome = file.read().split("complete genome")[-1].replace("\n", "")
        with open("../input/ebolav-vs-sarscov-vs-mers/BAT-SARS.fasta", "r") as file:
            bat_sl_cov_genome = file.read().split("complete genome")[-1].replace("\n", "")
        with open("../input/ebolav-vs-sarscov-vs-mers/EBOLAV.fasta", "r") as file:
            ebola5_genome = file.read().split("complete genome")[-1].replace("\n", "")
        with open("../input/cameluscovfasta/camelus.fasta", "r") as file:
            camel_cov_genome = file.read().split("complete genome")[-1].replace("\n", "")
        with open("../input/cameluscovfasta/plasmodium-malariae.fasta", "r") as file:
            malaria_genome = file.read().split("complete sequence")[-1].replace("\n", "")
        with open("../input/cameluscovfasta/hiv2.fasta", "r") as file:
            hiv2_genome = file.read().split("complete genome")[-1].replace("\n", "")
        with open("../input/cameluscovfasta/hedgehog.fasta", "r") as file:
            hedgehog_cov_genome = file.read().split("complete genome")[-1].replace("\n", "")

        def gen_protein_seq(genome_str):
            protein = {"TTT": "F", "CTT": "L", "ATT": "I", "GTT": "V",
                   "TTC": "F", "CTC": "L", "ATC": "I", "GTC": "V",
                   "TTA": "L", "CTA": "L", "ATA": "I", "GTA": "V",
                   "TTG": "L", "CTG": "L", "ATG": "M", "GTG": "V",
                   "TCT": "S", "CCT": "P", "ACT": "T", "GCT": "A",
                   "TCC": "S", "CCC": "P", "ACC": "T", "GCC": "A",
                   "TCA": "S", "CCA": "P", "ACA": "T", "GCA": "A",
                   "TCG": "S", "CCG": "P", "ACG": "T", "GCG": "A",
                   "TAT": "Y", "CAT": "H", "AAT": "N", "GAT": "D",
                   "TAC": "Y", "CAC": "H", "AAC": "N", "GAC": "D",
                   "TAA": "STOP", "CAA": "Q", "AAA": "K", "GAA": "E",
                   "TAG": "STOP", "CAG": "Q", "AAG": "K", "GAG": "E",
                   "TGT": "C", "CGT": "R", "AGT": "S", "GGT": "G",
                   "TGC": "C", "CGC": "R", "AGC": "S", "GGC": "G",
                   "TGA": "STOP", "CGA": "R", "AGA": "R", "GGA": "G",
                   "TGG": "W", "CGG": "R", "AGG": "R", "GGG": "G"
                   }
            protein_seq = ""

        # generating the protein seq
            for i in range(0, len(genome_str) - (3 + len(genome_str) % 3), 3):
                protein_seq += protein[genome_str[i:i + 3]]
            return protein_seq

        genomes = {"SARS": sars_genome, "MERS": mers_genome, "COVID-19": cov_genome, "Civet_SL_CoV": civet_sl_cov_genome,
               "Bat_SL_CoV": bat_sl_cov_genome, "Ebola": ebola5_genome, "Camel_CoV": camel_cov_genome,
               "Malaria": malaria_genome, "HIV": hiv2_genome, "Hedgehog_CoV": hedgehog_cov_genome}

        def get_orfs(protein_seq):
            orf_strands = []
            for seq in protein_seq.split("STOP"):
                for nu in range(len(seq) - 1, -1, -1):
                    if seq[nu] == "M":
                        orf_strands.append(seq[nu:] + "STOP")
        # search in original seq
            patterns = "|".join(orf_strands)
            res_seq = re.sub(patterns, lambda x: x.group(0).lower(), protein_seq)
            for genome_name, genome in genomes.items():
                get_orfs(gen_protein_seq(genome))

        new1 = str("alilgning the first 100 nucleotides of COVID-19 genome with every other genomes:\n")
        new2=[]
        for genome_name, genome in genomes.items():
            if genome_name != 'COVID-19':
                new2.append(("COVID-19 and {} genome".format(genome_name)))
                alignments = pairwise2.align.globalxx(cov_genome[:100], genome[:100])
                new2.append((pairwise2.format_alignment(*alignments[0], full_sequences=True)))
        new2 = '\n'.join(new2)

        aligner = Align.PairwiseAligner()
        aligner.mode = 'global'
        aligner.match_score = 1
        aligner.mismatch_score = -1
        aligner.open_gap_score = -0.5
        aligner.extend_gap_score = -0.1

        a = ("Similarity scores between\n")
        b=[]
        for genome_name, genome in genomes.items():
            if genome_name != 'COVID-19':
                score = aligner.score(cov_genome, genome)
                b.append(str("COVID-19 & {} genome sequences:\t {} ({:.2f}%)".format(genome_name, round(score),
                                                                              100 * ((score) / len(cov_genome)))))
        b = '\n'.join(b)
        return print(new1),print(new2),print(a),print(b)


dna_seq = SeqIO.read("/kaggle/input/testdatagenome/NC_046965.genome.fasta","fasta")
Genesimilarity.genome(dna_seq)

alilgning the first 100 nucleotides of COVID-19 genome with every other genomes:

COVID-19 and SARS genome
ACT-TTGAGCA-TTGATATATA--TAT---AT--ATATAT--C-ATACTCA-CCTT-G--C-CTTGTGCAAGAACCT-T-TTCTTAACA-AAACGGA-CTTATAG---TGCGTACG-GT---T-T-GCT-G--
| | || ||   ||  | | ||  ||    |   | | |   | | || || || | |  | |||||   || |  | | ||| |  |  |||| || ||| ||    | | |  | ||   | | ||| |  
A-TATT-AG--GTT--T-T-TACCTA-CCCA-GGA-A-A-AGCCA-AC-CAACC-TCGATCTCTTGT---AG-A--TCTGTTC-T--C-TAAAC-GAACTT-TA-AAAT-C-T--GTGTAGCTGTCGCTCGGC
  Score=67

COVID-19 and MERS genome
-ACTTTGAGCAT-TGATATA--TATA---TATA-T-A-TATCATACTCAC-C-TTGCCT-TGTGCAAGAACCTTT--TCTTAACAA-AACGGA-CTTATAG-T----GCGTAC--G--GTTT-GC-T---G-
 | ||| |  |  ||| |||  | |    |||  | | | ||   | | | | ||  || | ||| |||| ||||  | ||      ||| || |||| |  |    ||   |  |  |||| || |   | 
GA-TTT-A--A-GTGA-ATAGCT-T-GGCTAT-CTCACT-TC---C-C-CTCGTT--CTCT-TGC-AGAA-CTTTGAT-TT-----TAAC-GAACTTA-A-ATAAAAGC---CCTGTTGTTTAGCGTATCGT
  Score=68

COVID-19 and Civet_SL_CoV genome
ACTTTGAGCA

(None, None, None, None)

# **Conclusion -** 

The DNA sequence of a Cov-2 entered is pairwise compared with other pandemics like SARS, MERS, Civet-SARS,BAT-SARS, EBOLAV, camelus, plasmodium-malariae, hiv2, hedgehog pandemics and similarity score between COVID-19 Genome and other Genome is determined.