# Homework 2: Biogeography

#### **Please read the following instructions carefully before you continue.**

This template notebook is for Homework 2, due Thursday, April 18th, 2024.

To use this template, click `File` > `Save a copy in Drive`. You now have your own editable copy to which you can add your code. However, before you make changes, note that we've scaffolded this notebook to help get you started:

- We've already written `import` statements for all the modules you should need. This week, you could feasibly do everything without importing any external modules; do whatever is most helpful for you.
- We provide an introduction that lays out one potential approach for solving the problem, explaining what each logical "chunk" of your code ought to accomplish. You can use this as a starting point for writing your own code.  

Of course, if you'd rather do your own thing, you are not required to follow the path we've laid out, or use the modules we've recommended. However, a few things _are_ required (refer to the [Intro to Colab](https://colab.research.google.com/drive/1fq_HaiuYb1L18uGcoA3eGs6taiUafR-6?usp=sharing) notebook):

- _Literate style._ Dumping everything into a single, monstrous code cell is illegible and unacceptable. Remember to divide your code into reasonable, logical chunks, and to follow up each code cell with a text cell that explains and interprets the results.
- _Comments._ That said, writing "literate" code is not an excuse to avoid writing comments :)
- _Problem labels/numbers._ Please use text cells to clearly label where your solution to one problem ends, and the next begins.

Remember that a human is going to read and grade your notebook, so it is in your best interest to help them understand your work clearly. Your finished solution to a given problem should flow coherently from one code cell to the next. (Our "scaffold" helps you do this!)

If you get stuck, remember that [tutorials](https://bi1.caltech.edu/2024/tutorials) are held in-person each week. Also, note that in accordance with course policy, [the use of generative AI tools is forbidden](https://bi1.caltech.edu/2024/policies) unless otherwise specified.

---

In [None]:
import numpy as np  # numeric computing

---

## Introduction

This week's code problem comes from Question 2a, which asks you to compare DNA sequence similarity between species of African grassland frogs.

- First, you need to read in the text files we've provided, and store the sequences in a variable, list, or other data structure. This might start like
```python
with open(path_to_file) as file:  # read in one of the fasta files
    for line in file:  # parse each line
        # find sequences and extract them
```

- Then, you should define a function that accepts two DNA sequences and scores their similarity. This might start like
```python
def simDNA(seq1, seq2):
    """This is a function that computes a similarity score between two aligned DNA sequences."""
    # your code here
```
- After checking that it works as expected, you can then incorporate that function into a `for` loop (or perhaps several) that applies it to every group of sequences that the problem statement asks you to.

## Question 2a

In [None]:
"""
Defining the simDNA function that calculates the similarity between two DNA
sequences based on the calculation scheme outlined in the homework problem
"""
def simDNA(seq1, seq2):
  if(len(seq1) != len(seq2)):
    print("Seqences must have the same length")
    return
  numSame = 0.0
  totNum = 0.0
  for i in range(len(seq1)):
    if((seq1[i] != '-') and (seq2[i] != '-')):
      totNum +=1

      if(seq1[i] == seq2[i]):
        numSame +=1

  return (numSame/totNum)

In [None]:
 """
 Defining the readFile function that reads each of the data files and stores
 the names of the frogs and their corresponding DNA sequence in a 2D-list.
 Each element of the list has two elements: the frog type and its DNA sequence.
"""
def readFile(filename):
  listFrogs = []

  with open(filename, 'r') as file:
    seq = ""
    name = ""
    for line in file:
      if(line[0] == '>' and seq == ""):
        name = line.strip('\n')
      elif(line[0] == '>' and name != ""):
        listFrogs.append([name, seq])
        name = line.strip('\n')
        seq = ""
      else:
        seq += line.strip('\n')

  listFrogs.append([name, seq])
  return listFrogs

In [None]:
"""
Set the 2D list for the names and DNA sequences of the Africa frogs and the
Sao Tome frogs. Also create the empty list of highest match scores.
"""
africaFrogs = readFile("frogs_africa.txt")
STFrogs = readFile("frogs_st.txt")
SThigh = []
SThighnames = []

finalResult = {}
k = 1
for j in STFrogs:
  result = {}
  for i in africaFrogs:
    val = simDNA(i[1], j[1])
    result[val] = i[0]
  finalResult[j[0]] = result

  for m, n in finalResult.items():
    print("For Sao Tome frog sample ", k, ", the most to least matching African frogs are: \n")
    k+=1
    index = 0
    for x,y in sorted(n.items(), reverse=True):
      print(x, ": ", y,"\n")


For Sao Tome frog sample  1 , the most to least matching African frogs are: 

0.950592885375494 :  >Ptychadena nilotica, Tanzania, Kibebe farm 

0.9483747609942639 :  >Ptychadena nilotica, Uganda, Lake Victoria 

0.9464627151051626 :  >Ptychadena nilotica, Kenya, Mt Kenya 

0.945010183299389 :  >Ptychadena nilotica, Kenya, Taita Hills 

0.9445506692160612 :  >Ptychadena nilotica, Kenya, Makuru 

0.9429175475687104 :  >Ptychadena nilotica, Kenya, Taita Hills 

0.9329501915708812 :  >Ptychadena aff. mascareniensis, Uganda, Kampala 

0.921455938697318 :  >Ptychadena aff. mascareniensis, Guinea 

0.9184652278177458 :  >Ptychadena aff. mascareniensis, Central African Republic, Dzanga-Sangha Reserve 

0.9176245210727969 :  >Ptychadena mascareniensis, Madagascar, Nahampoana 

0.9022988505747126 :  >Ptychadena taenioscelis, Kenya, Kakamega 

0.875 :  >Ptychadena sp., Tanzania, Mikumi 

0.869757174392936 :  >Ptychadena pumilio, Guinea, Mont Bero 

0.862 :  >Ptychadena aff. uzungwensis, Tanzania

_add text and code cells here_

For all of the types of Sao Tome frogs, the top 3 closest matching African Frogs are from Tanzania, Kibebe farm; Uganda, Lake Victoria; and Kenya, Mt. Kenya.

---

### Submission instructions

When you're finished, recall the steps for submitting Colab notebooks:

1. Run all the cells from top-to-bottom, in order (`Runtime` > `Run all`).
2. Once the entire notebook has completed running and the output of all cells is visible, save it (`File` > `Save`).
3. Download the notebook in `.ipynb` format (`File` > `Download` > `Download .ipynb`).
4. Rename the file according to the usual convention (`lastname_firstname_hw#.ipynb`), if you haven't already.
5. Upload the file to Gradescope.
