Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimization and translation functions for sequences #11

Merged
merged 39 commits into from
Oct 22, 2020
Merged

Add optimization and translation functions for sequences #11

merged 39 commits into from
Oct 22, 2020

Conversation

Koeng101
Copy link
Contributor

This pull request adds codon optimization and translation functions for sequence, plus adds default values for all NCBI default codon tables.

The general idea is that you have a codonTable object that stores amino acids, codons, and the number of occurrences of any given codon in proteins (defaults to 0). That codonTable has methods associated with it to build simple mappings between amino acids <-> codons.

Some checks not yet added in:

  • Translate requires inputs to be divisible by 3
  • Creating an optimization tables requires that all codons must have at least 1 occurrence

Integration to think about:

  • How do we generate codonTable objects from genbank files? How can we save them as JSON for use later?

@TimothyStiles please review and comment with anything you think I should add into the pull request.

}

// Function to generate default codon tables from NCBI https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
func generateCodonTable(aas, starts string) CodonTable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does aas stand for here? Amino acid string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stands for amino acids. On the link, they just name it "AAs", so I just copied their nomenclature (also why starts is named starts)

@TimothyStiles
Copy link
Collaborator

@Koeng101 do you have any idea on how we would generate codon tables for arbitrary sequences? Are there any examples that we can cite/work off of?

@Koeng101
Copy link
Contributor Author

@TimothyStiles Generation of codon tables for arbitrary sequences can be done by counting the codon occurrences for each CDS feature in that GenBank file. This is one of the reasons we were working on the "location" feature previously.

for _, aminoAcid := range codonTable.AminoAcids {
for _, codon := range aminoAcid.Codons {
translationMap[codon.Triplet] = aminoAcid.Letter
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't know go very well. But is this a for loop in a for loop in a map >.>

Copy link
Contributor Author

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One function should be split up because AnnotatedSequences can get big!

Comment on lines 53 to 68
// Optimize takes an amino acid sequence and CodonTable and returns an optimized codon sequence
func Optimize(aminoAcids string, annotatedSequence AnnotatedSequence, codonTable CodonTable) string {
var codons strings.Builder
var sequenceBuffer strings.Builder
for _, feature := range annotatedSequence.Features {
if feature.Type == "CDS" {
sequenceBuffer.WriteString(feature.getSequence())
}
}
optimizationTable := codonTable.generateOptimizationTable(sequenceBuffer.String())
for _, aminoAcid := range aminoAcids {
codons.WriteString(optimizationTable[string(aminoAcid)].Pick().(string))
}
return codons.String()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization table should be an input to the Optimize function. This is because the AnnotatedSequences may be huge (in case of human chromosomes), and you will need to load multiple files in order to get the correct optimization table (again, with multiple human chromosomes). Once generated, you won't need to do that again.

Ideally, there would be some kind of robust import / export function to use JSON codon tables. But at minimum, split this function so that can be added later.

Copy link
Contributor Author

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll make a git issue of JSON representations for the codon table stuff. <hold on, found a bug>

@TimothyStiles
Copy link
Collaborator

Alright this PR is getting to be a monster. I refactored the way command line commands are tested for easier debugging and now have two simple command line utilities that both translate and optimize streams of sequence strings. It also has related library functions as well.

I'm tired and I'm merging it y'all.

🎉 🎉 🎉

@TimothyStiles TimothyStiles merged commit 7b0947a into bebop:prime Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants