### Study session 6 - Functions and files
#### BIOINF 575 - Fall 2022



##### RECAP

##### IF - Allows us to execute code selectivelly.

```python
if [not] <condition>:
    <statements>
elif <condition>:
    <statements>
else:
    <statements>
```


##### FOR or WHILE - execute code repeatedly without having to repeat the code

```python
for var in sequence:
    statements
```

A variable, var, is used to go through each element of the iterable we go through.     
Iterable:  An object capable of returning its members one at a time.    
https://docs.python.org/3/glossary.html    
https://docs.python.org/3/library/collections.abc.html?highlight=iterable#collections.abc.Iterable


##### FUNCTION - block of code that only runs when called and can be reused without having to copy the code

```python
def function_name(arguments):
    <statements>
    return result

# Call the function:
function_name(values)
```

##### FILE - collection of data stored and identified as a unit by the operating system

```python

# Open file and return a stream.  Raise OSError upon failure.
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)


with open(file_name, "r") as file
    for line in file:
        <statements>
```

The result of the open function is an iterable object.    
Iterable:  An object capable of returning its members one at a time.    
https://docs.python.org/3/glossary.html    
https://docs.python.org/3/library/collections.abc.html?highlight=iterable#collections.abc.Iterable


___
<b><font color = "green">Introductory information</font> <br></b>

"Proteins are the end products of the decoding process that starts with the information in cellular DNA. As workhorses of the cell, proteins compose structural and motor elements in the cell, and they serve as the catalysts for virtually every biochemical reaction that occurs in living things. This incredible array of functions derives from a startlingly simple code that specifies a hugely diverse set of structures."

"The building blocks of proteins are amino acids, which are small organic molecules that consist of an alpha (central) carbon atom linked to an amino group, a carboxyl group, a hydrogen atom, and a variable component called a side chain (see below). Within a protein, multiple amino acids are linked together by peptide bonds, thereby forming a long chain."

"Proteins are built from a set of only twenty amino acids, each of which has a unique side chain."

https://www.nature.com/scitable/topicpage/protein-structure-14122136/

<img src = "https://cdn.pixabay.com/photo/2013/07/12/17/38/dna-152136_1280.png" width = 450>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src = "https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH/DNA-Genetics/Transcription-TranslationDetails.png" width = 420>


https://cdn.pixabay.com/photo/2013/07/12/17/38/dna-152136_1280.png                    
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH/DNA-Genetics/Transcription-TranslationDetails.png


https://ocw.mit.edu/courses/biology/7-01sc-fundamentals-of-biology-fall-2011/molecular-biology/        
https://ocw.mit.edu/courses/health-sciences-and-technology/hst-161-molecular-biology-and-genetics-in-modern-medicine-fall-2007/lecture-notes/     
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH/DNA-Genetics/DNA-Genetics3.html





________
_______

<b><font color = "red">Exercise</font> <br></b>


The dictionary DNA_amino_acids_map contains the information that allows us to translate DNA code into a sequence of amino acids.
We want to use the mapping provided in the dictionary to translate the sequences in the second string and create a string of amino acids.
- Write a function that translates a DNA_sequence into a peptide (sequence of amino acids)
- Apply the function for each of the sequences in our DNA_sequence string




In [None]:
DNA_amino_acids_map = {'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

DNA_sequence = "GCCGCGCGT AGGATGCCTCCGCAA CCCCAGCGTGCT AAA AAACTA CGCGCGTAGGATGCCTCCGCAACCCCAGCGTGT"



___

<b> <font color = "red">Exercise</font></b>


##### Processing the worm genome file
You will extract the gene and mRNA information from the C. elegans genome and write them in a new file.   

The GFF3 file is `worm_genome_short.gff3` and is also available in the github repository (you should have it in the study session if you updated the repo). The GFF3 format is described on:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md.<br> 


After the comment and header lines (marked by "#"), a line in a GFF3 file (row of a table) is composed of 9 tab-delimited fields (columns). The first 8 are called features. These are all atomic (consist of only one element), so they get put into a dictionary features with no problems. You will need to define a key and you will have to generate those integers as you read the file in and add data to the dictionary.

The ninth field consists of tag-value pairs. **The tag-value pairs are separated by a semi-colon, ";". The the tag and value in a pair are  separated by equal signs, "=", and the values may consist of mulitiple, comma, ",", separated entries.** 

From the definition of the GFF3 we have these fields:   
`seqid, source, type, start, end, score, strand, phase`, `attributes`

The type field has information about the type of the genomic feature.    

* Process each line from the file that has a feature and extract the name from the lines that have the type gene.    
* Write those names in the file `worm_genes_short.txt`    
* Use a function to process a line and return the line if it meets the criteria or None otherwise.

* Update your analysis so that it also returns the ID for mRNAs in the file `worm_mRNAs_short.txt`



___
<b><font color = "red">Exercise</font> <br><br></b>
##### Gene regulatory network

"Formally speaking, a gene regulatory network or genetic regulatory network (GRN) is a collection of DNA segments in a cell which interact with each other (indirectly through their RNA and protein expression products) and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA. In general, each mRNA molecule goes on to make a specific protein (or set of proteins)."  
https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-9863-7_364

<img src = "https://media.springernature.com/original/springer-static/image/prt%3A978-1-4419-9863-7%2F7/MediaObjects/978-1-4419-9863-7_7_Part_Fig1-364_HTML.gif" width = 200/>




We have a gene regulatory network represented as a dictionary where the key is a gene from the network above and the value is a tuple of the genes the key gene directly regulates (through orange links only).    

- Write a function that uses the dictionary and computes and returns the number of genes that a given gene's expression is regualted by. 
    - In other words, the function would compute the number of incoming orange edges that the gene has in the above network.<BR><BR>

- Test the function for at least three cases.




In [None]:
network = {'Gene1': ('Gene2',),
 'Gene2': ('Gene6',),
 'Gene3': ('Gene1', 'Gene5'),
 'Gene4': ('Gene2',),
 'Gene5': ('Gene1',),
 'Gene6': ('Gene1',)}

In [None]:
network

___
<b><font color = "red">Exercise</font> <br><br></b>

Explain the following code:



In [None]:
amino_acid_code = { 
    "P": ("CCA", "CCC", "CCT", "CCG"),
    "Q": ("CAA", "CAG"),
    "M": ("ATG",)
}

p = "MMPQM"
x = set()
s = ""
for aa in p:
    s = s + amino_acid_code.get(aa,("",))[0]
print(s)

_______