# Week 3 Notes

## 3.1.1 Introduction to DNA Translation

We can think of DNA as a 1D string of nucleotides (of which there are 4: A, C, G, T).

One nucleotide triplet corresponds to a single amino acid. Each protein in itself is a string of amino acids, of which there are 20.

Central dogma of biology: DNA transcribed into RNA then translated into molecular protein

Outline of case study:
1. Download DNA sequence as a text file
2. Translate DNA sequence into protein sequence
3. Download amino acid sequence as a text file as a reference

## 3.1.2 Downloading DNA Data

We're going to download DNA and Protein Sequences from the NCBI website.

Added dna.txt and protein.txt to the repository.

## 3.1.3 Importing DNA Data into Python

In [3]:
# Defining a new class and method to remove special characters from a string
class MyString(str):
    def process(self):
        return self.replace("\n", "").replace("\r", "").replace(" ", "").upper()


# Importing DNA Data into Python
dna_file_in = open("dna.txt", "r")
dna_file_out = open("dna_processed.txt", "w")

for line in dna_file_in:
    nucleotide_triplet = ""  # Resetting nucleotide triplet

    for nucleotide in MyString(line).process():  # Since strings are immutable, removing special characters will not change the original string
        if len(nucleotide_triplet) == 3:
            dna_file_out.write(nucleotide_triplet + "\n") # Writing triplet to file
            nucleotide_triplet = "" # Resetting nucleotide triplet

        if (nucleotide == "A" or "C" or "G" or "T") and len(nucleotide_triplet) < 3:
            nucleotide_triplet += nucleotide # Adding nucleotide to triplet


# Closing files to prevent memory leaks
dna_file_in.close()
dna_file_out.close()

## 3.1.4 Translating the DNA Sequence

In [4]:
# Opening DNA file and creating new file to write to for protein sequence
dna_file = open("dna_processed.txt", "r")
protein_file = open("protein_processed.txt", "w")

# Defining a table to translate nucleotides to amino acids
nucleotide_to_amino_acid = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
}


# Table lookup to translate nucleotides to amino acids
for triplet in dna_file:
    protein_file.write(
        nucleotide_to_amino_acid[MyString(triplet).process()]
    )


# Closing files to prevent memory leaks
dna_file.close()
protein_file.close()

## 3.2.1 Introduction to Language Processing

Task: to process books written in English, French, German, and Portuguese and find out how book lengths and number of unique words compare between different authors and languages.

## 3.2.2 Counting Words

The fastest and easiest way to count words in Python is $\texttt{collection.Counter(line.split(" "))}$, where $\texttt{line}$ is the line of words to be counted.

## 3.2.3 Reading in a Book

UTF-8 is the most common encoding for text files.

You can open a file with UTF-8 encoding using the following code: $\texttt{with open("file.txt", "r", encoding="utf-8") as file}$.

## 3.2.5 Reading Multiple Files

We're using the $\texttt{os}$ module to read in multiple files.

We can use the $\texttt{os.listdir()}$ function to get a list of items (files or folders) in a directory.

PANDAS IS HERE: especially useful for manipulating numerical tables and time series data.

We can use the $\texttt{pd.DataFrame(data)}$ function to create a dataframe from a dictionary. We can also create an empty table with known columns using $\texttt{pd.DataFrame(columns=(``column1", ``column2"))}$.

We can now use $\texttt{table.loc[row_index]}$ to get the row at the specified index or create a new row at the index.

The $\texttt{table.head()}$ function will show the first five rows of the table, and $\texttt{table.tail()}$ will show the last five rows.

## 3.2.6 Plotting Book Statistics

This is quite simple: if you have a dataframe $\texttt{table}$, you can use $\texttt{plt.plot(table[``column1"], table[``column2"])}$ to plot the data.

You can stratify data using boolean conditions, similar to way we can stratify items in a list.

## 3.3.1 Introduction to kNN Classification

Statistical learning: the process of finding an estimate or prediction for a target variable based on a set of features.
- If the target variable is continuous and quantitative, we have a regression problem.
- If the target variable is categorical and qualitative, we have a classification problem.

This case study involves finding a classifier to a specific classification problem where the classifier will attempt to categorize an unknown object based on previously known objects. --> this is a k-Nearest Neighbors (kNN) classification problem.

In other words, a kNN classifier classifies data points according to the most common class label used by the k nearest neighbors.

## 3.3.2 Finding the Distance between Two Points
If given two points with coordinates $\texttt{(x_1, y_1)}$ and $\texttt{(x_2, y_2)}$, we can find the distance between them using the Pythagorean Theorem.

## 3.3.3 Majority (more specifically, Plurality) Vote
Plurality Vote is a type of problem: given a sequence of votes, we need to find the element with the most votes.

To find the mode of a NumPy array, we can use $\texttt{scipy.stats.mode(array)}$.

## 3.3.4 Finding the Nearest Neighbors
$\texttt{numpy.argsort(array)}$returns the indices in order of increasing values in $\texttt{array}$.

## 3.3.5 Generating Synthetic Data

Synthetic data is data that is generated with the help of a computer model.

It is helpful to generate synthetic data in order to test the performance of a classifier.

## 3.3.6 Making a Prediction Grid

Instead of trying to make a prediction on a single random point, we can try to make a prediction on a grid of points.

The $\texttt{numpy.meshgrid}$ function will create two matrices of x and y values corresponding to a grid of points formed by two column vectors corresponding to the x and y coordinates of the grid (these column vectors are the input).

The function $\texttt{enumrate}$ on a list will return an enumerate object containing both the index and value of each element in the list.

## 3.3.7 Plotting the Prediction Grid
Bias-variance tradeoff: the tendency of a classifier to overfit or underfit a data set.
- This means that too high or too low values of the classifier will result in a classifier that is not able to generalize to new data; intermediate values are best

## 3.3.8 Applying the kNN Method

All of our work above can be done using the $\texttt{KNearsNeighborsClassifier}$ class in the $\texttt{sklearn}$ module, a Python module that contains a number of machine learning algorithms and datasets, including the $\texttt{iris}$ dataset.