Name: Anushka Srivastava
<br>Roll no: 2022086

#### Requirements
- Biopython library

In [None]:
# Run this first to install the Biopython library
!pip install Biopython

In [4]:
# import statements
from Bio import Entrez, SeqIO
from os import path
from os import makedirs
from os import listdir
import re

### Download the genome sequence of Baker’s Yeast (Saccharomyces cerevisiae) using Python or R.

We store all the files in the directory named `output_dir` which will be automatically created on running the full code.

In [None]:
if not path.exists("output_dir"):
    makedirs("output_dir")

We are using the `nucleotide` database to search for the fasta file. After accessing the fasta sequence, we perform the file writing operation to save the sequence in a file.

In [None]:
# Function to download the fasta file for the specified accession number and storing it in the output directory
def download_fasta_sequence(email, accession_number):
  Entrez.email = email
  handle = Entrez.efetch(db = "nucleotide", id = accession_number, rettype = "fasta", retmode = "text")
  sequence = handle.read()
  file_name = accession_number + ".fasta"
  file_path = path.join("output_dir", file_name)
  with open(file_path,"w") as f:
    f.write(sequence)

We store all the accession numbers of the yeast genome <u><i>(Chromosome 1-16 and Mitochondria)</i></u> in a list and iterate through them to download their files.

In [20]:
email = "anushka22086@iiitd.ac.in"
accession_no = ["NC_001133.9","NC_001134.8","NC_001135.5","NC_001136.10",
                "NC_001137.3","NC_001138.5","NC_001139.9","NC_001140.6",
                "NC_001141.2","NC_001142.9","NC_001143.9","NC_001144.5",
                "NC_001145.3","NC_001146.8","NC_001147.6","NC_001148.4","NC_001224.1"]

In [None]:
for ac in accession_no:
    download_fasta_sequence(email,ac)

##### Reference:
- Tutorial 2 slides

### Find the Origin of Replication in the Yeast Genome.

To find out the Origin of Replications in the genome of Baker's yeast, we first find out the Autonomously Replicating Sequence of the genome. An autonomously replicating sequence (ARS) contains the origin of replication in the yeast genome. The ARS share a somewhat similar pattern called the ARS Consensus Sequence (ACS), which varies from length 11-17 bp.

According to <a href = "https://www.yeastgenome.org/locus/S000121252">Saccharomyces Genome Databases</a>, the ARS nearly matches the given 11 bp ACS: <b>5'-WTTTAYRTTTW-3'</b>, where W can be T or A, Y can be C or T and R can be A or G.
<br>
This sequence can be further extended to 17 bp ACS: <b>5'-WWWWTTTAYRTTTWGTT-3'</b>

So to find out the Origins of Replication in our yeast genome, we use the following logic:<br>
1) We first use the 11 bp ACS sequence mentioned above. The 17 bp ACS sequence will be included in this case too as it is simply formed by extending the 11 bp sequence.<br>
2) A total of 16 combinations are possible for this ACS sequence. We store all the possible combinations in a list as `ARS Motifs`.<br>
3) We iterate through the list of files, we use the `SeqIO` library to parse through the fasta file and obtain the genome sequence.<br>
4) We then find out the indices of all the ARS motifs one by one using the `finditer` function in the `re` library which returns the iterator of all the occurences of the given motifs.<br>
5) This count is then stored in a dictionary for each file.<br>
6) Since the replication origins have been mapped to the ARS sequences, this count of ARS sequences is roughly equal to the number of ORIs. According to <a href = "https://bioresearch.byu.edu/cs418/BA-chap1and2.pdf">Finding Hidden Messages in DNA by Phillip Compeau and Pavel Pevzner</a>, there are over 400 different ORIs in a yeast genome, which also verifies our output.

In [8]:
ars_count = 0
ars_pos = {}

# ARS Motifs
ars = ["TTTTATATTTT","TTTTATATTTA","TTTTATGTTTT","TTTTATGTTTA",
       "TTTTACATTTT","TTTTACATTTA","TTTTACGTTTT","TTTTACGTTTA",
       "ATTTATATTTT","ATTTATATTTA","ATTTATGTTTT","ATTTATGTTTA",
       "ATTTACATTTT","ATTTACATTTA","ATTTACGTTTT","ATTTACGTTTA"]

In [2]:
# Function to find ARS sequences from the motifs in each fasta file
def find_ars_sequences(file, file_no):
    global ars_count
    records = SeqIO.parse(file,"fasta")
    for record in records:
        sequence = str(record.seq)
        for motif in ars:
            matches = re.finditer(motif, sequence)
            for match in matches:
                ars_count += 1
                if file_no in ars_pos:
                    ars_pos[file_no].append(match.start()) # 0-based indexing followed
                else:
                    ars_pos[file_no] = [match.start()]

We follow 0-based indexing. Hence, our sequence begins from the index 0.

In [9]:
# Retrieving the list of all files 
fasta_files = listdir("output_dir")

for file in fasta_files:
    file = "output_dir/" + file
    idx = file.find(".fasta")
    find_ars_sequences(file,file[11:idx])

We iterate through our dictionary and print out the indices of ORIs in each file.

In [10]:
print("Total number of Origins of Replication (ORIs) found:", ars_count)
print()
for i in ars_pos:
    if i == "NC_001224.1":
        continue
    print("Number of ORIs in " + str(i) + ": " + str(len(ars_pos[i])))
    print("Position of ORIs in " + str(i) + ":")
    print(sorted(ars_pos[i]))
    print()
print("Number of ORIs in NC_001224.1: " + str(len(ars_pos["NC_001224.1"])))
print("Position of ORIs in:")
print(sorted(ars_pos["NC_001224.1"]))

Total number of Origins of Replication (ORIs) found: 459

Number of ORIs in NC_001133.9: 7
Position of ORIs in NC_001133.9:
[17149, 159953, 171816, 176236, 176522, 208605, 229450]

Number of ORIs in NC_001134.8: 29
Position of ORIs in NC_001134.8:
[80, 53415, 122598, 189470, 195767, 231686, 238293, 246606, 256898, 326080, 326195, 368745, 381151, 403313, 420235, 424981, 543395, 562508, 568821, 603190, 622760, 632052, 665038, 676293, 755032, 777821, 784662, 792466, 812416]

Number of ORIs in NC_001135.5: 11
Position of ORIs in NC_001135.5:
[11256, 14700, 52343, 74520, 78863, 152629, 201845, 224863, 231261, 233373, 315820]

Number of ORIs in NC_001136.10: 57
Position of ORIs in NC_001136.10:
[42618, 50459, 67634, 77223, 104908, 111128, 117397, 156052, 210566, 232057, 233925, 263124, 340870, 347217, 405175, 420761, 422287, 427871, 439367, 443872, 470295, 477645, 480280, 521602, 521761, 548323, 561437, 609151, 628647, 636557, 655624, 677939, 688187, 709270, 807779, 913867, 963622, 1057898, 

##### References:
- <a href = "https://www.yeastgenome.org/locus/S000121252"> Saccharomyces Genome Database</a>
- Deshpande, A M., and C S. Newlon. "The ARS Consensus Sequence Is Required for Chromosomal Origin Function in Saccharomyces Cerevisiae." Molecular and Cellular Biology 12, no. 10 (1992): 4305-4313. Accessed February 13, 2024. https://doi.org/10.1128/mcb.12.10.4305.
- <a href = "https://bioresearch.byu.edu/cs418/BA-chap1and2.pdf">Finding Hidden Messages in DNA by Phillip Compeau and Pavel Pevzner</a>
- Nieduszynski, Conrad A., Yvonne Knox, and Anne D. Donaldson. "Genome-wide Identification of Replication Origins in Yeast by Comparative Genomics." Genes & Development 20, no. 14 (2006): 1874-1879. Accessed February 13, 2024. https://doi.org/10.1101/gad.385306.