# Lab 04 Files and Modules

# Deliverables (50 pts total):
 - sequenceAnalysis.py module - 31 total points
 - classes included: 
     - NucParams, 
     - ProteinParams, 
     - FastAreader
 - genomeAnalyzer.py - 15 total points
 - Lab 4 notebook with Inspection Intro Cell and Inspection Results Cell completed( 4 pts)
 - possible extra credit - 10 additional points
 - Due: Monday May 8, 2023 11:55pm


Congratulations. You have started to build an inventory of some pretty useful functions.  Because these are written as classes, you can easily reuse them. Your ProteinParam class is your first deposit into your very own sequenceAnalysis toolkit.  In Python, these toolkits are called modules.

We are also going to start using real sequence files.  The fastA format, described here: en.wikipedia.org/wiki/FASTA_format is very convenient to use and fully capable of storing sequences of many types. You will be reading these from an input file for this assignment.

## Genomic analysis

There are a few things that we can do that mirror and extend the analyses that we did previously on protein sequences. We can calculate composition statistics on the genome (gc content for example), we can calculate relative codon usage in the genome, and we can calculate amino acid composition by translating those codons used the genome.

For this lab, I have provided a NucParams class, with the required methods that it implements (see below). You will need to design and write those methods, and these are to be placed in a file called sequenceAnalysis.py This is a __*module*__ that you can use from now on.

You will also need to place the ProteinParams class from Lab 3 into this module. This class will not be used for this assignment, but place it into your toolbox.

I have written the FastAreader class.  It is included below. Keep it as is part of your module for now, you may decide to keep it somewhere else later.

The input file for this assignment will be named testGenome.fa, and is available in Canvas. You will not need to submit testGenome.fa, but it will be necessary for your testing.  For development and testing, create a new directory (Lab04) and place the data file (testGenome.fa), your Lab04 notebook, your program (genomeAnalyzer.py), your new module (sequenceAnalysis.py).


## Hints

 - Python modules have the .py extension as files, but when they are imported, the name without the extension is used in the import statement in your program.

 - File placement: Make sure to place your notebook, program, sequenceAnalysis module and the required data files in the same folder. This will allow Python to find them. Read over the FastAreader usage to see how to specify file names that you can use for your data.

## Codon frequency calculations

Notice that NucParams does all of the counting you need. It is responsible for counts of codons and their translated amino acids.

Your genomeAnalyzer.py program has the task of determining which codons are preferred for each of the amino acids and calculating the relative percentage.  For any given amino acid, the relative codon usage (percentages) should sum to 100.0%. Notice that Methionine and Tryptophan only have 1 codon that codes for those, so these will have relative codon usages of 100%.

For example: Lysine is coded by both AAA (607) and AAG (917) (example counts in parentheses).  From our aaComposition() method, we are given the aaComposition dictionary and we can lookup 'K' to find 1524 counts (these came from those 607+917 codons).  We can then calculate 607/1524 for AAA and 917/1524 for AAG.  The associated percentages are thus: 39.8 for AAA and 60.2 for AAG.

AAA = 607/1524 * 100 = 39.8%

AAG = 917/1524 * 100 = 60.2%


## Design specification - sequenceAnalysis.py

### NucParams class

#### \_\_init\_\_

The constructor of the class has one optional parameter, a sequence of type string. It may include upper or lower case letters of the set {ACGTUN} or whitespace.  These will be gene sequences and they begin in frame 1.  In other words the first 3 letters of the sequence encode the first AA of the sequence. Carefully consider in what form this class should maintain its data. Is a string the best structure? This class (NucParams) is intended to be very similar to ProteinParam. Make sure to read addSequence() before making this decision, and remember that objects of this class may need to handle an arbitrarily large number of sequences (hint:  dictionaries are good). As a second hint, notice that __init__ and addSequence are doing VERY similar things - you could just make one of them do most of the work.

#### addSequence() - 5 pts

This method must accept new sequences, from the {ACGTUN} alphabet, and each sequence can be presumed to start in frame 1. This data must be added to the data that you were given with the __init__ method (if any).

#### aaComposition() - 6 pts

This method will return a dictionary of counts over the 20 amino acids and stop codons.  This dictionary is VERY similar to the lab 3 aaComposition, though you must decode the codon first. The translation table from codon to AA is provided. You are counting amino acids by translating from the proper codon table.

#### nucComposition() - 10 pts

This method returns a dictionary of counts of valid nucleotides found in the analysis. (ACGTNU}. If you were given RNA nucleotides, they should be counted as RNA nucleotides. If you were given DNA nucleotides, they should be counted as DNA nucleotides. Any N bases found should be counted also. Invalid bases are to be ignored in this dictionary.

#### codonComposition() - 10 pts

This dictionary returns counts of codons. Presume that sequences start in frame 1, accept the alphabet {ACGTUN} and store codons in RNA format, along with their counts. __Any codons found with invalid bases should be discarded__. Discard codons that contain N bases. This means that all codon counts are stored as RNA codons, even if the input happens to be DNA. If you discard a codon, make sure to not alter the frame of subsequent codons.

#### nucCount()

This returns an integer value, summing every valid nucleotide {ACGTUN} found.  This value should exactly equal the sum over the nucleotide composition dictionary.

## Design specification - genomeAnalyzer.py

This program must import your sequenceAnalysis module.
It is responsible for preparing the summaries and final display of the data.

## Input must be from STDIN
Your FastaReader object will read data from sys.stdin if it is not given a filename. Notice that the filename is specified as a parameter to your main() function. See the last line of the cell containing main() in this notebook:
 - main( 'testGenome.fa').

When you move main() to genomeAnalyzer.py, you can delete the parameter:
 - main()
 this will cause the filename parameter to be *None* (default parameter) and your FastaReader object will interpret a filename of *None* as a request to use sys.stdin. ( see the doOpen() method for further info).

Your notebook will be used for inspections. For the notebook, you will need to specify the input filename. For genomeAnalyzer.py, you will need to delete the parameter.


### Output format - 15 pts

The function to output the results of your analysis has specific formatting rules that you must follow to get full credit. These rules are as follows:

 - First line: sequence length = X.XX Mb with two digits after the decimal and labeled Mb (you need to calculate the number of bases in Mb).
 - second line is blank
 - third line: GC content = XX.X% as a percentage with one digit after the decimal
 - fourth line is blank
 - lines 5 - 68 are the output statistics on relative codon usage for each codon ordered by codon within each amino acid group as follows:

where XXX is the three letters for an RNA codon, A is the 1-letter amino acid code, F is relative codon frequency, use {:5.1f} for the format, and D is for codon count, use the format {:6d}. There is a single space between each of these fields.
For example ( this is not representative of any real genome ):

To get full credit on this assignment, your code needs to:
 - Run properly (execute and produce the correct output). 
 - Include any assumptions or design decisions you made in writing your code
 - contain proper docstrings for the program, classes, modules and any public functions.
 - Contain in-line comments
 
## Extra credit - 10 pts possible

You now have a very powerful set of classes for evaluating genomes. Write a compareGenomes.py program that compares GC content, aaComposition and relative codon bias of 2 genomes. You will have a halophile genome and a hyperthermophile genome to compare.

Submit your code using canvas

Congratulations, you have finished your fourth lab assignment!

## Inspection Intro
Provide design level information for your inspection team here. How do you input data to avoid having to read every sequence from the genome into memory? Where are your composition dictionaries initialized? How does data get added to those composition dictionaries?

## Nuc Params

In [1]:
class NucParams:
    rnaCodonTable = {
    # RNA codon table
    # U
    'UUU': 'F', 'UCU': 'S', 'UAU': 'Y', 'UGU': 'C',  # UxU
    'UUC': 'F', 'UCC': 'S', 'UAC': 'Y', 'UGC': 'C',  # UxC
    'UUA': 'L', 'UCA': 'S', 'UAA': '-', 'UGA': '-',  # UxA
    'UUG': 'L', 'UCG': 'S', 'UAG': '-', 'UGG': 'W',  # UxG
    # C
    'CUU': 'L', 'CCU': 'P', 'CAU': 'H', 'CGU': 'R',  # CxU
    'CUC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R',  # CxC
    'CUA': 'L', 'CCA': 'P', 'CAA': 'Q', 'CGA': 'R',  # CxA
    'CUG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R',  # CxG
    # A
    'AUU': 'I', 'ACU': 'T', 'AAU': 'N', 'AGU': 'S',  # AxU
    'AUC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S',  # AxC
    'AUA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R',  # AxA
    'AUG': 'M', 'ACG': 'T', 'AAG': 'K', 'AGG': 'R',  # AxG
    # G
    'GUU': 'V', 'GCU': 'A', 'GAU': 'D', 'GGU': 'G',  # GxU
    'GUC': 'V', 'GCC': 'A', 'GAC': 'D', 'GGC': 'G',  # GxC
    'GUA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G',  # GxA
    'GUG': 'V', 'GCG': 'A', 'GAG': 'E', 'GGG': 'G'  # GxG
    }
    dnaCodonTable = {key.replace('U','T'):value for key, value in rnaCodonTable.items()}

    def __init__ (self, inString=''):
        pass
        
    def addSequence (self, inSeq):
        pass
    def aaComposition(self):
        return self.aaComp
    def nucComposition(self):
        return self.nucComp
    def codonComposition(self):
        return self.codonComp
    def nucCount(self):
        return sum(self.nucComp.values())

## FastAreader

In [1]:
import sys
class FastAreader :
    ''' 
    Define objects to read FastA files.
    
    instantiation: 
    thisReader = FastAreader ('testTiny.fa')
    usage:
    for head, seq in thisReader.readFasta():
        print (head,seq)
    '''
    def __init__ (self, fname=None):
        '''contructor: saves attribute fname '''
        self.fname = fname
            
    def doOpen (self):
        ''' Handle file opens, allowing STDIN.'''
        if self.fname is None:
            return sys.stdin
        else:
            return open(self.fname)
        
    def readFasta (self):
        ''' Read an entire FastA record and return the sequence header/sequence'''
        header = ''
        sequence = ''
        
        with self.doOpen() as fileH:
            
            header = ''
            sequence = ''
            
            # skip to first fasta header
            line = fileH.readline()
            while not line.startswith('>') :
                line = fileH.readline()
            header = line[1:].rstrip()

            for line in fileH:
                if line.startswith ('>'):
                    yield header,sequence
                    header = line[1:].rstrip()
                    sequence = ''
                else :
                    sequence += ''.join(line.rstrip().split()).upper()

        yield header,sequence


## Main 
Here is a jupyter framework that may come in handy

In [None]:
def main (fileName=None):
    myReader = FastAreader(fileName) 
    myNuc = NucParams()
    for head, seq in myReader.readFasta() :
        myNuc.addSequence(seq)
        
    # sort codons in alpha order, by Amino Acid
    
    # calculate relative codon usage for each codon and print
    for nucI in nucs:
        ...
        print ('{:s} : {:s} {:5.1f} ({:6d})'.format(nuc, aa, val*100, thisCodonComp[nuc]))

if __name__ == "__main__":
    main('testGenome.fa') # make sure to change this in order to use stdin
    

## Inspection Intro

Composition dictionaries are initialized in NucParams class, and the method addSequence updates the dictionaries by iterating and adding nucleotides/AAs/codons with a ticker count of +1 .

## sequenceAnalysis.py

In [None]:
#!/usr/bin/env python3
# Name: Justin Jang (jjang12)
# Group Members: Aster Lathbury(mlathbur), Kimberly Magpantay(klmagpan), Faiz Khan(faahkhan)
"""
Analyze a genome. R


Methods: 
NucParams: Return counts of codons and their translated amino acids.
ProteinParam: Return aa count, pI, molar/mass extinction, and molecular weight
FastAreader: Read FastA files so we can take objects from the file to analyze in genomeAnalyzer




"""




import sys
class NucParams:
    """Track counts of codons and their translated amino acids.
     nucleotide > codon
     Methods:
     _init_(self)
     addSequence(self,sequence)
     aaComposition(self)  (from NB)
     nucComposition(self)  (from NB)
     codonComposition(self)  (from NB)
     nucCount(self)  (from NB)
       """
    rnaCodonTable = {
    # RNA codon table
    # U  
    'UUU': 'F', 'UCU': 'S', 'UAU': 'Y', 'UGU': 'C',  # UxU
    'UUC': 'F', 'UCC': 'S', 'UAC': 'Y', 'UGC': 'C',  # UxC
    'UUA': 'L', 'UCA': 'S', 'UAA': '-', 'UGA': '-',  # UxA
    'UUG': 'L', 'UCG': 'S', 'UAG': '-', 'UGG': 'W',  # UxG
    # C
    'CUU': 'L', 'CCU': 'P', 'CAU': 'H', 'CGU': 'R',  # CxU
    'CUC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R',  # CxC
    'CUA': 'L', 'CCA': 'P', 'CAA': 'Q', 'CGA': 'R',  # CxA
    'CUG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R',  # CxG
    # A
    'AUU': 'I', 'ACU': 'T', 'AAU': 'N', 'AGU': 'S',  # AxU
    'AUC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S',  # AxC
    'AUA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R',  # AxA
    'AUG': 'M', 'ACG': 'T', 'AAG': 'K', 'AGG': 'R',  # AxG
    # G
    'GUU': 'V', 'GCU': 'A', 'GAU': 'D', 'GGU': 'G',  # GxU
    'GUC': 'V', 'GCC': 'A', 'GAC': 'D', 'GGC': 'G',  # GxC
    'GUA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G',  # GxA
    'GUG': 'V', 'GCG': 'A', 'GAG': 'E', 'GGG': 'G'  # GxG
    }
    dnaCodonTable = {key.replace('U','T'):value for key, value in rnaCodonTable.items()}

    def __init__ (self, inString=''):
        '''Set up dictionaries, set value to zero. Add a tally if sequence contains an amino acid.'''
        
        
        #make dictionaries with keys of sequence composition, values counts set to 0
        self.nucComp = {'G': 0,'U': 0,'A': 0,'C': 0,'T': 0,'N': 0}  # nucleotide composition
        self.aaComp = {aa: 0 for aa in NucParams.rnaCodonTable.values()}  # amino acid composition, key is amino acid and value is set to 0
        self.codonComp = {codon: 0 for codon in NucParams.rnaCodonTable.keys()} #codon sequence composition, key is codon and value is set to 0
        self.addSequence(inString)  # call addSequence method with the inputted sequence
               
        
    def addSequence (self, sequence):
        '''Clean the input sequence and increment how many of each RNA codon and its amino acid symbols in a list.
         Store in dictionaries self.codonComp and self.aaComp'''
        
        sequence = ''.join(sequence.split()).upper()  # remove whitespace, convert to uppercase

        for base in self.nucComp:  # update nucleotide composition (nucComp) with # of each nucleotide in sequence
            self.nucComp[base] += sequence.count(base)  #
        
       
        cleanedCodons = []  # create empty list to hold 3-base codons
        for i in range(0, len(sequence), 3):  # iterate over sequence inputted string in increments of 3 nucleotides
            cleanedCodons.append(sequence[i:i+3])  # place at position 0, 3, 6, 9 ...
        

        for codonSeq in cleanedCodons:  # iterate over cleanedCodons[] list
            codonSeq = codonSeq.replace('T','U')  # change from DNA to RNA format b/c we are matching sequence with codonComp which is RNA format
            if 'N'in codonSeq:  # skip N
                continue
            if codonSeq in self.codonComp:  # compare sequence with codon composition dictionary and searches for a match
                self.codonComp[codonSeq] += 1  # increment codon sequence value by 1 IF there is a match
                
                aa = self.rnaCodonTable[codonSeq] # 
                self.aaComp[aa] += 1 #increment amino acid value by 1
        pass
    def aaComposition(self):
        '''Return a composition of amino acids'''
        return self.aaComp
    def nucComposition(self):
        '''Return a compostion of nucleotides'''
        return self.nucComp
    def codonComposition(self):
        '''Return a compostion of 3-letter codons'''
        return self.codonComp
    def nucCount(self):
        '''Return a count of each kind of nucleotide'''
        return sum(self.nucComp.values())

class ProteinParam :
    """Calculate statistics of a given protein sequence. Take a protein string and calculate the physical-chemical properties of a protein sequence. 
Return:
- the number of amino acids and total molecular weight,
- Molar extinction coefficient and Mass extinction coefficient,
- theoretical isoelectric point (pI)
- amino acid composition


# These tables are for calculating:
#     molecular weight (aa2mw), along with the mol. weight of H2O (mwH2O)
#     absorbance at 280 nm (aa2abs280)
#     pKa of positively charged Amino Acids (aa2chargePos)
#     pKa of negatively charged Amino acids (aa2chargeNeg)
#     and the constants aaNterm and aaCterm for pKa of the respective termini
#  Feel free to move these to appropriate methods as you like

# As written, these are accessed as class attributes, for example:
# ProteinParam.aa2mw['A'] or ProteinParam.mwH2O

    #dictionary of amino acids and corresponding molecular weights
    aa2mw = {
        'A': 89.093,  'G': 75.067,  'M': 149.211, 'S': 105.093, 'C': 121.158,
        'H': 155.155, 'N': 132.118, 'T': 119.119, 'D': 133.103, 'I': 131.173,
        'P': 115.131, 'V': 117.146, 'E': 147.129, 'K': 146.188, 'Q': 146.145,
        'W': 204.225,  'F': 165.189, 'L': 131.173, 'R': 174.201, 'Y': 181.189
    }

    #molecular weight water
    mwH2O = 18.015
    #AAs and their absorbance
    aa2abs280 = {'Y':1490, 'W': 5500, 'C': 125}

    #charged AAs and their pKa vals
    aa2chargePos = {'K': 10.5, 'R':12.4, 'H':6}
    aa2chargeNeg = {'D': 3.86, 'E': 4.25, 'C': 8.33, 'Y': 10}
    #pKas of N- and C- terminus
    aaNterm = 9.69
    aaCterm = 2.34

    def __init__ (self, proteinStr: str):
        """Initialize method for taking protein sequence as an inputted string"""

        #empty list to hold individual AAs in protein sequence
        splitAminos = []

        #list allowedAminos is the AAs from aa2mw
        allowedAminos = self.aa2mw.keys()

        for char in proteinStr.upper():  # input to upper
            #iterate thru each char in input str, 
            #if char in input, then append to splitAminos list.
            if char in allowedAminos:
                splitAminos.append(char)

        #join strings in splitAminos, then separate them into individual AAs in the list.
        list = ''.join(splitAminos)
        # joins strings with no separator adding: ''
        # splits joined 'protein' into a char array, assigns to l

        ## Join the list of amino acids back into a single string, and convert to uppercase.
        self.newProteinStr = ''.join(list).upper()
        # initialize newProteinStr to joined l and makes uppercase
        
        # Create an empty dictionary to hold the amino acid composition of the protein.
        self.aaComp = {}
        
        # creates an empty dictionary object to represent # of each amino Acid
        for aminoAcid in self.aa2mw.keys():
            # initializes aminoAcid, loops through aa2mw.keys
            self.aaComp[aminoAcid] = self.newProteinStr.count(aminoAcid)
            # assigns aminoAcid to the key, counts self.newProteinStr for the amino acid letter

    def aaCount (self):
        """Return a single integer count of valid amino acid characters found."""
        return (len(self.newProteinStr))
        # returns len of input self's newProteinStr object

    def _charge_ (self,pH):
        """ Calculate the net charge of a protein based on its amino acid composition and the pH of the surrounding environment. Use for pI method."""   
        #ask TA questions about _charge_
        posCharge = sum(self.aaComp.get(aa, 0) * pow(10, ProteinParam.aa2chargePos.get(aa, 0)) / (pow(10, ProteinParam.aa2chargePos.get(aa, 0)) + pow(10, pH)) for aa in ['R', 'K', 'H'])  # calculate positive charge from equation
        negCharge = sum(self.aaComp.get(aa, 0) * pow(10, pH) / (pow(10, ProteinParam.aa2chargeNeg.get(aa, 0)) + pow(10, pH)) for aa in ['D', 'E', 'C', 'Y'])  # calculate negative charge from equation
        
        nTermChrg = pow(10, ProteinParam.aaNterm) / (pow(10,ProteinParam.aaNterm) + pow(10,pH))
        cTermChrg = pow(10,pH) / (pow(10,ProteinParam.aaCterm) + pow(10,pH))

        netCharge = posCharge - negCharge + nTermChrg - cTermChrg
        return netCharge

    def pI(self):
        """Calculate the  particular pH that yields a neutral net charge"""
        bestCharge = 100000000
        bestpH = 0
        for pH100 in range (0, 1400+1): # want this down to two precision points, so much put it from 1 to 1400 not 1 to 14
            pH = (pH100)/100 #gets the two decimal numbers (gives actual pH)
            currentCharge = self._charge_(pH) #runs pH through the charge method
            keepCharge = abs(currentCharge) #call charge method, gets absolute value 
            
            if keepCharge < bestCharge: # wants the closest value to 0, so keeps it if it is lower than previous
                bestCharge = keepCharge #defines keepCharge as new lower value, runs it through the loop again 
                bestpH =  pH
        return bestpH   
        
    

    def aaComposition (self) :
        """return a dictionary keyed by single letter Amino acid code, and having associated values that are the counts of those amino acids in the sequence."""
        return self.aaComp #returns the dictionary setup at the beginning

    def molarExtinction (self):
        """Return how much light a protein absorbs at a certain wavelength"""
        return (self.newProteinStr.count('Y') * self.aa2abs280['Y']
                     + self.newProteinStr.count('W') * self.aa2abs280['W']
                     + self.newProteinStr.count('C') * self.aa2abs280['C'])
        

    def massExtinction (self):
        """
        Calculate light absorbance divided by molecular weight of the input protein sequence.
        """
        myMW =  self.molecularWeight()
        return self.molarExtinction() / myMW if myMW else 0.0
    

    def molecularWeight (self):
        """Return molecular weight of the protein sequence."""
        waterWeight = self.mwH2O * (self.aaCount() - 1)
        aaWeight = sum(count * self.aa2mw[aa] for aa, count in self.aaComp.items())
        molecularWeight = aaWeight - waterWeight
        return molecularWeight


import sys
class FastAreader :
    ''' 
    Define objects to read FastA files.
    
    instantiation: 
    thisReader = FastAreader ('testTiny.fa')
    usage:
    for head, seq in thisReader.readFasta():
        print (head,seq)
    '''
    def __init__ (self, fname=None):
        '''contructor: saves attribute fname '''
        self.fname = fname
            
    def doOpen (self):
        ''' Handle file opens, allowing STDIN.'''
        if self.fname is None:
            return sys.stdin
        else:
            return open(self.fname)
        
    def readFasta (self):
        ''' Read an entire FastA record and return the sequence header/sequence'''
        header = ''
        sequence = ''
        
        with self.doOpen() as fileH:
            
            header = ''
            sequence = ''
            
            # skip to first fasta header
            line = fileH.readline()
            while not line.startswith('>') :
                line = fileH.readline()
            header = line[1:].rstrip()

            for line in fileH:
                if line.startswith ('>'):
                    yield header,sequence
                    header = line[1:].rstrip()
                    sequence = ''
                else :
                    sequence += ''.join(line.rstrip().split()).upper()

        yield header,sequence



## genomeAnalyzer.py

In [1]:
#!/usr/bin/env python3
# Name: Justin Jang (jjang12)
# Group Members: Aster Lathbury(mlathbur), Kimberly Magpantay(klmagpan), Faiz Khan(faahkhan)
"""Analyze a genome from a FastaAfile."""

from sequenceAnalysis import NucParams, ProteinParam, FastAreader

class GenomeAnalyser:
    '''Display data from a Fasta file, in this case "testGenome.fa". Print sequence length, G-C content, and output statistics on 
    relative codon usage for each codon ordered by codon within each amino acid group. '''

    def __init__(self,fileName = 'testGenome.fa'):
        '''Constructor for GenomeAnalyser class.Take file and run the analyser method on it.'''
        #useful dictionaries might use?
        #add more stuff if needed
        self.fastAreader = FastAreader()  
        self.nucParams = NucParams()
        
        if fileName == None:  #in case fileName is none
            fileName = sys.stdin
        else:
            self.analyser(fileName)
        
    
    
    def sequenceLength(self,read):
        '''Run NucParams on the fastA file and return megabases'''
        nucParams = NucParams()  # easier way to call NucParams methods
        for head,seq in read.readFasta():  # note to self: takes readFasta from FastAreader in sequenceAnalysis
            nucParams.addSequence(seq)

        length = nucParams.nucCount() / 1000000  # return megabases by dividing by 1,000,000
        return length, nucParams
    
    def gcContent(self, nucParams):
        '''Calculate the percentage of G and C nucleotides in sequence from the file.'''
        nucComp = nucParams.nucComposition() #see NucParams in sequenceAnalysis
        gc = nucComp['G'] + nucComp['C']
        gc = gc / nucParams.nucCount() * 100  #see NucParams in sequenceAnalysis

        return gc

    def analyser(self, fastAfile):
        self.read = FastAreader(fastAfile)

       
        len, nucParams = self.sequenceLength(self.read)  # call sequenceLength with self.read as the argument, return len = length of sequence, nucParams = instance of NucParams with compositions stats
        print(f"sequence length = {len:.2f} Mb\n")

        
        self.GC = self.gcContent(nucParams)  # call gcContent with nucParams as the argument, which assigns GC content to the variable self.GC
        print(f"GC content = {self.GC:.1f} %\n")

        codonCount = nucParams.codonComposition() # store codon composition stats in codonCount
        

        for codon in sorted(codonCount.keys(), key=lambda c: (nucParams.rnaCodonTable[c], c)):  # csort alphabetically first by 1-letter amino acid code, THEN by 3-letter amino acid code.
        #is this a troll way of commenting idk                                                                                        # c represents each codon in codonCount
            cCount = codonCount[codon] # count of current codon, store in cCount
            singLetter = nucParams.rnaCodonTable[codon] # store single-letter AA in singLetter
            aminoCount = nucParams.aaComp[singLetter] # store count of AA in aminoCount

            # calculate frequencies
            if cCount > 0:
                finalFreq = (cCount/aminoCount) *100
                print(f'{codon} : {singLetter}  {finalFreq:5.1f}% ({cCount:6d})')
            else:
                finalFreq = (cCount) *100
                print(f'{codon} : {singLetter}  {finalFreq:5.1f}% ({cCount:6d})')

        #print(codonCount)  run this to test raw count before showing frequencies
        #test for redundancy later
    
GenomeAnalyser()

#compare results with test.out

sequence length = 2.21 Mb

GC content = 55.7 %

UAA : -   32.6% (  1041)
UAG : -   38.6% (  1230)
UGA : -   28.8% (   918)
GCA : A   14.1% ( 10605)
GCC : A   40.5% ( 30524)
GCG : A   30.5% ( 22991)
GCU : A   14.9% ( 11238)
UGC : C   67.2% (  4653)
UGU : C   32.8% (  2270)
GAC : D   70.8% ( 22686)
GAU : D   29.2% (  9372)
GAA : E   23.0% ( 11602)
GAG : E   77.0% ( 38951)
UUC : F   60.4% ( 15880)
UUU : F   39.6% ( 10404)
GGA : G   13.5% (  7651)
GGC : G   47.9% ( 27193)
GGG : G   29.3% ( 16664)
GGU : G    9.3% (  5294)
CAC : H   79.6% (  9089)
CAU : H   20.4% (  2334)
AUA : I   45.1% ( 18134)
AUC : I   30.1% ( 12096)
AUU : I   24.8% (  9986)
AAA : K   32.8% ( 12810)
AAG : K   67.2% ( 26206)
CUA : L   15.2% ( 11753)
CUC : L   23.6% ( 18242)
CUG : L   21.2% ( 16349)
CUU : L   14.7% ( 11393)
UUA : L    8.0% (  6190)
UUG : L   17.3% ( 13335)
AUG : M  100.0% ( 13577)
AAC : N   74.6% ( 12962)
AAU : N   25.4% (  4423)
CCA : P   16.4% (  6142)
CCC : P   35.5% ( 13275)
CCG : P   30.2% ( 11301)
CC

<__main__.GenomeAnalyser at 0x15d23693710>

## Inspection Results
Who is your inspection team? What did they find? How did you decide to incorporate their suggestions?

**Faiz's Comments:** I'd recommend adding somewhere in your NucParams class a few lines that discard codons that contain N bases since the directions ask for it. As for what 'continue' does in my code, whenever the loop encounters a codon that contains the letter 'N', it skips that iteration of the loop and moves on to the next codon. Other than that your code is very organized and easy to read, good job!

**Aster's Comments:** Don't forget to add a program overview for both of your sequenceAnalysis and genomeAnalysis modules. We modified our codon strings differently, but I think it is interesting how you separated the codons every 3 characters. I think making each purpose of your genomeAnalysis into a method makes a lot of sense. I personally put it all into the main function, but creating a class makes your code more readable. Also, don't forget the inspection intro (included before NucParams). Great job!

**Kimberly's Comments**: In your __init__ method of your NucParams class, the only difference between our code is how you created the aaComp dictionary. I instead manually created it, similar to your nucComp, and could've probably done it this easier way. In your addSequence method, the way you counted the nucleotides is different in comparison to mine as you counted the number of instances. I incremented each base by 1, similar to your coson and aa compositions. Also, in your genomeAnalysis, it's interesting how you separated your different functions into different methods. It makes it look very organized. I instead added everything to a single "main()" function. Lastly, your code is very easy to read through, especially with your detailed comments.

Response: I implemented a for loop that skips over Ns while iterating over the list of cleaned codons. Also added other overviews and stuff I missed.