# Problem set 9
### We will utilize basic Python functionality to handle common tasks in bioinformatics
***
## Rules
* It is ok to search for template code online. But please don't just copy-and-paste the code. **Try to rewrite it based on your understanding.**
* For question that asks you to write a function, please provide **at least three examples** to show that your function works. When selecting examples, consider what the code does and try pick examples that illustrate distinct cases.
***

## Q1
**Scenario**
* You receive a gene expression table that was manually typed in by another member of your lab
* The table was supposed to contain only numbers (which represent the expression levels of the genes)
* However, when you opened it, you found several types of errors:
  * Some values are `string` that were intended for human readers (like `"not measured"` or `"cannot find data"`)
  * Some values are `string` that were erroneously caused by extra spaces (like `"  911.03 "`) 

**Task**
* Write a function `clean_gene_expression()` that receive a `list` and return a new `list` that contain all numerical data that can be salvaged from the input.
* For example, if the input is `["not measured", 0.123, "  2.381"]`, the output should be `[0.123, 2.381]`
* You can assume that all non-number entries in the input list follow either one of the two patterns described above 

## Q2
**Scenario**
* You want to analyze a list of DNA sequences using a bioinformatics tool
* However, this tool requires that you convert the DNA sequences into RNA sequences (what a lazy programmer!)
* Furthermore, this tool will not consider the possibility that some of your input sequences were from the reverse strand. So you will have to generate the reverse complement sequences and include them as part of the input

**Task**
* Write a function `create_input_for_lazy_tool()` that receive a `list of DNA sequences` and return a new `list of string` that contains both the converted RNA sequences and the reverse complements 
* For example, if the input is `["ACTG", "GGGG"]`, the output should be `["ACUG", "GGGG", "CAGU", "CCCC"]` (or any other arrangement of these 4 entries)
* The ordering of the entries in the output does not matter
* You can assume that the input DNA sequences contain only A, C, G, and T

## Q3
**Scenario**
* A collaborator gave you a list of Ensemble accession IDs which contains some entries that do not follow the expected Ensemble human gene ID format
* You want to flag non-gene entries so that your collaborator can go back and double check their data

**Task**
* Write a function `flag_nongene_accession()` that receive a `list of string` and return a new `list of string` that contains only the entries that fail to match the expected Ensemble accession format for human genes
* Study the Ensemble accession format for human genes, like `"ENSG00000139618"` carefully and define the rules that must be followed
* For example, if the input is `["ESNG0000123162", "ESNT0000123162", "1231271"]`, the output should be `["ESNT0000123162", "1231271"]`
* The ordering of the entries in the output does not matter

## Q4
**Task**
* Write a function `count_dimer_GC()` that receive a `DNA sequence` and return a `dictionary` that contain the frequencies of all nucleotide dimers found in the input sequence (i.e., `AA`, `AC`, etc.) as well as `the GC content` of the input sequence as percentage
* For example, if the input is `"AAACTGCT"`, the output should be a dictionary `{"AA": 2, "AC": 1, "CT": 2, "TG": 1, "GC": 1}` and a number `0.375`
* You can assume that the input DNA sequence contains only A, C, G, and T

## Q5
**Scenario**
* You want to cluster DNA sequences from a shotgun metagenomics study based on GC content and dimer profiles (i.e., binning)
* The first step is to be able to compute the difference in dimer profiles between two DNA sequences

**Task**
* Write a function `dimer_difference()` that receive `two dictionaries of dimer frequencies` and return the difference between the two dictionaries. Here, the difference is the discrepancy between the two dimer frequencies.
* For example, the difference between `{"AA": 2, "AC": 1, "CT": 2, "TG": 1, "GC": 1}` and `{"AA": 1, "AC": 2, "CT": 3, "TG": 1}` should be `4` because the frequencies of `"AA"`, `"AC"`, `"CT"`, and `"GC"` differ by 1 each.

## [Extra] Q6
**Task**
* Write a function `is_binnable()` that receive `two DNA sequences` of the same length and return a boolean (`True` or `False`) whether the two DNA sequences could be grouped for binning under the following criteria:
  * The difference in dimer frequency should be less than 25% of the length of the DNA sequences
  * The difference in GC content should be less than 0.2
* The function should return `True` if both criteria were met, and return `False` otherwise
* Use your functions from **Q4** and **Q5** to help

## For Q7-Q9, you will need to read files into Python
Below is a template that will let you read the content of `filename.ext`, one line at a time, into a variable `line`

In [None]:
with open('filename.ext') as fin:
    for line in fin.readlines():
        print(line)

## Q7
**Task**
* Read the [Surface-glycoprotein.fasta](https://github.com/cmb-chula/comp-biol-3000788/blob/main/problem-sets/Surface-glycoprotein.fasta) and output the following:
  * The number of DNA sequences in the file
  * The average length of DNA sequences in the file
* Don't use BioPython or other module for reading FASTA file. Write your own code

## Q8
**Scenario**
* You want to be sure that all DNA sequences in [Surface-glycoprotein.fasta](https://github.com/cmb-chula/comp-biol-3000788/blob/main/problem-sets/Surface-glycoprotein.fasta) are valid protein-coding sequence

**Task**
* Given an input DNA sequence, how would you check whether it is a valid protein-coding sequence? In other words, what are the criteria that the input DNA sequence must satisfy?

## Q9
**Scenario**
* You want to be sure that all DNA sequences in [Surface-glycoprotein.fasta](https://github.com/cmb-chula/comp-biol-3000788/blob/main/problem-sets/Surface-glycoprotein.fasta) are valid protein-coding sequence

**Task**
* Write a function `is_protein_coding()` that receive a `DNA sequence` as an input and output a boolean (`True` or `False`) that indicate whether the input is a valid protein-coding sequence
* This function should follow the criteria that you formulated in **Q8**

## Q10
**Scenario**
* We will extend the number guessing function that we implemented in class into a more realistic game
* The game will keep asking for the player to guess a number until the correct answer is input (or the player gives up)
* A hint, whether the guess is too high or too low, will be displayed
* We will use the built-in `input()` function to receive inputs from the player

**Task**
* Some parts of the code have been provided below. Fill in the missing codes to make the game works as it should be

In [None]:
hidden_number = 123
game_message = 'Guess a number from 1-999. Or enter `Q` to give up: '

guess = input(game_message)

while not guess == str(hidden_number):
    guess = input(game_message)
    
    ## complete the missing codes