Topic Notes Key points & Learning Objectives

# Week 2:
### Control flow:
   `while` can be used with `break` and `continue`
   
   Example: print the index of the nucleotide where the first 'CA' is encountered

In [None]:
seq_a = "AGCTGACATGCA"
index = 0

while index < len(sequence) - 1:
    if sequence[index:index+2] == 'CA':
        print(f"The first 'CA' is encountered at index {index}")
        break
    index += 1
    

### Code efficiency
Use the time module to get the current time. Call the time function before and after a code block to calculate the time spent executing the block. The unit of results is in sec.

In [14]:
#assess efficiency of simulating 200 random numbers in the range of 0-10
from random import randint
import time
time_start = time.time()
list1 = []
for i in range(200):
    num = randint(0,10)
    list1.append(num)
time_end = time.time()

efficency = time_end - time_start
print(efficency)

0.0002288818359375


### String search
Can be done using:
1. the default algorithm in Python using `.find()`:

In [32]:
string1 = "ATGCACGTGTCA"
string2 = "GCAC"
string1.find(string2)

2

2. Brute force/manual method:
   Considerations:
   
   `pass` = is a null operation - the function will do nothing if the length of the motif is greater than the sequence
   
   `+1` is needed in the range line as 12-4 = 8, range(8) stops at 7, and would miss checking the last possible position.

In [36]:
def brute_force(string1,string2):
    len1 = len(string1)
    len2 = len(string2)
    index = -1 #default is "no such substring"
    if len2 > len1:
        pass
    else:
        for i in range(len1-len2 +1):
            if string1[i:i+len2] == string2:
                index=i
                break
    return index

print(brute_force(string1,string2))

2


### K-mer counting
Find the most frequent k-mer in a sequence

In [41]:
def most_frequent_kmer(sequence,k):
    length=len(sequence)
    if k>length:
        return "k is bigger than the length of the sequence"
    else:
        if k==length:
            return sequence
        else:
            dic_count={}  #the dictionary to store the count of kmers
            for i in range(length-k+1):
                kmer=sequence[i:i+k]
                if kmer not in dic_count:
                    dic_count[kmer]=1
                else:
                    dic_count[kmer]+=1
            value_key_list= [(value, key) for key, value in dic_count.items()]
            return max(value_key_list)[1]

most_frequent_kmer("ATGCTGCCGTAATGCCGATCAACGTCGGACTATGC",4)

'ATGC'

Problem: Can you modify the above code to find the mst frequent k-mer in a list of sequences rather than just one sequence?

In [44]:
def most_frequent_kmer_in_list(sequence_list,k):
    top_counts = {}
    for sequence in sequence_list:
        length=len(sequence)
        if k>length:
            continue
        else:
            if k==length:
                top_counts[sequence] = 1
            else:
                dic_count={}  #the dictionary to store the count of kmers
                for i in range(length-k+1):
                    kmer=sequence[i:i+k]
                    if kmer not in dic_count:
                        dic_count[kmer]=1
                    else:
                        dic_count[kmer]+=1
                value_key_list = [(value, key) for key, value in dic_count.items()]
                top_counts[max(value_key_list)[1]] = max(value_key_list)[0]
    top_counts_list = [(value, key) for key, value in top_counts.items()]
    return max(top_counts_list)[1]
    

### Edit distance
Measures how different two sequences are

Implement a function to calculate the edit distance of two sequences of the same length.

In [9]:
def edit_distance_same_length(seq_a, seq_b):
    #ensure same length
    if len(seq_a) != len(seq_b):
        raise ValueError("Sequences must be of same length")
    #initialize distance
    distance = 0
    #calculate distance
    for i in range(len(seq_a)):
        if seq_a[i] != seq_b[i]:
            distance += 1
    return distance

print(edit_distance_same_length("ATGCC", "ACGCT"))

2


Implement a function to calculate the edit disance of two sequences of different lengths

In [21]:
def edit_distance_diff_lengths(seq_a, seq_b):
    #initialize distance
    d = 0
    #calc difference in length
    d_length = max([len(seq_a), len(seq_b)]) - min([len(seq_a), len(seq_b)])
    d += d_length

    #calculate distance for overlapping sequence
    for i in range(min([len(seq_a), len(seq_b)])):
        if seq_a[i] != seq_b[i]:
            d += 1
    return d

print(edit_distance_diff_lengths("ATGCCGT", "ACGCT"))

4


### Working with Biopython.

Easy sequence manipulation. >> not yet installed Bio.seq to conda cloud

In [4]:
from Bio.seq import Seq
my_seq = Seq("AGTACACTGGT")

print(my_seq)
print(my_seq.complement())
print(my_seq.reverse_complement())

ModuleNotFoundError: No module named 'Bio'

Parsing FASTA files with Biopython is straightforward and efficient. Biopython provides the SeqIO module, which is designed for reading and writing sequence file formats, including FASTA.

In [None]:
from Bio import SeqIO

#read in FASTA file
for record in SeqIO.parse("example.fasta", "fasta"):
    print(record.id)
    print(record.seq)
    print(len(record))

### New code I learned this week

`%s` is a placeholder for string formatting, used in conjunction with `%` operation to replace the placeholders with the values provided.

Ex: When printing out elements of seq_b, also print out its index.

In [8]:
seq_b = "ACTT"
def print_with_index(seq_b):
    for i in range(len(seq_b)):
        print("%s:%s" % (i, seq_b[i]))
print_with_index(seq_b)

0:A
1:C
2:T
3:T


## Week 3: Sequence Alignment

* Global vs Local Sequence Alignment.
* Evaluating alignments:
  1. Edit distance
  2. Scoring function
  3. Algorithm for inferring best alignment (ie. dynamic programming)
 
Needlemann-Wunsch Algorithm.
> GLOBAL - Aligns end-to-end considering the length of both sequences.
> 1. Initialization: Create a scoring matrix with dimensions based on the lengths of the two sequences. Initialize the first row and column with gap penalties.
> 2. Scoring: Fill in the matrix using a scoring scheme that includes match, mismatch, and gap penalties. Each cell is filled based on the maximum score achievable from the neighboring cells (diagonal, left, and above).
> 3. Traceback: Starting from the bottom-right cell, trace back to the top-left cell to determine the optimal alignment. The path taken during traceback represents the best alignment.

Smith-Waterman Algorithm:
> LOCAL - Finds the most similar subsequences within the larger sequences.
> 1. Initialization: Create a scoring matrix with dimensions based on the lengths of the two sequences. Initialize the first row and column with zeros.
> 2. Scoring: Fill in the matrix using a scoring scheme that includes match, mismatch, and gap penalties. Each cell is filled based on the maximum score achievable from the neighboring cells (diagonal, left, and above), but any negative scores are replaced with zero.
> 3. Traceback: Starting from the highest-scoring cell, trace back to a cell with a score of zero to determine the optimal local alignment. The path taken during traceback represents the best local alignment.
