# Challenge: DNA matching strand

Write a function that finds the matching strand of a DNA sequence for *Saccharomyces cerevisiae* contained in the `dna_sequence.txt` file in the Datasets folder. This is an actual DNA sequence and you will need to handle the extra lines at the top of the file using Python. For this challenge I recommend sticking to the core Python module:
`open('dna_sequence.txt').read().split('\n')`. You can try Pandas if you want, but some steps might not be obvious at this point.

>Do not alter the file by hand, this will violate the entire purpose of using coding to generate reproducible research.

Use your knowledge of basic biology to match the DNA bases (i.e. A-T and C-G). For this exercise ignore potential base duplicates (e.g. G-G).

## Submitting assignment

**Option 1**. Upload the Jupyter Lab notebook directly using Canvas (preferred)

**Option 2**. Send me the Jupyter Lab notebook via email (in case option 1 does not work).
File name must be your first and last name, e.g. andrespatrignani.ipynb

**Deadline is Monday, February 25 at 11:59 PM**


## Notebook format

For this challenge you need to submit your code in four different cells:

**Cell 1**: Import modules

**Cell 2**: Navigate to directory, load file, handle extra header lines in file

**Cell 3**: Define `dnamatching` function

**Cell 4**: Call the function and store function output into a variable. Avoid printing the output string in the final version since the DNA strand is long and will make the file hard to read.

> Avoid navigating the directory and loading the file within the function. This way you will be able to use the function with files in different folders. Functions should be short and do something very specific.

 
## Rules

- Function name must be `dnamatching`

- The function must accept a single, long `string` of DNA bases.

- The function must return the matching DNA strand as a single, long `string` of DNA bases.

- The function must not print the string

- If you use a loop, you cannot use the range function

- If the function finds an unknown base (e.g. U for Uracil), then the function must return a message and the unknown base. For instance, the message could look like this: `Unknown base found: U`. It's important to stop the function at this point and prevent execution of subsequent lines of code or loop iterations.

- Function must have the following documentation:
    - A brief description of the purpose of the function (20 words or less)
    - A brief description about the format of the input variable
    - Author's full name
    - Date of creation


## EXAMPLE

Given the following input and function call:

```python
dna = 'GATCCTCCAT'
dna_matching_strand = dnamatching(dna)
```

The result should be:

    CTAGGAGGTA


where 

```python
print(type(dna_matching_strand))
``` 

must be of class `string`.

## SKILLS

- Directory navigation
- Read text file and handle headers
- Handle strings
- For loop
- If statement
- Define and use function

In [5]:
# Import modules
import glob


In [8]:
# Load and handle file
dataset_dir = '/Users/andrespatrignani/Dropbox/Teaching/Scientific programming/introcoding-spring-2019/Datasets/'
glob.os.chdir(dataset_dir)
dna = open('dna_sequence.txt').read().split('\n')
dna = ''.join(dna[8:])


In [15]:
# Define function
def dnamatching(strand):
    '''
    Function that finds the matching strand of a sequence of DNA bases
    Input: string of DNA bases
    Author: Andres Patrignani
    Date: 22-Feb-2019
    '''
    matching_strand = ''
    for base in strand:
        if base == 'A':
            matching_strand += 'T'
        elif base == 'T':
            matching_strand += 'A'
        elif base == 'C':
            matching_strand += 'G'
        elif base == 'G':
            matching_strand += 'C'
        else:
            return print('Unknown base found:', base)
        
    return matching_strand

In [18]:
# Call function
dna_matching_strand = dnamatching(dna)

print(dna_matching_strand)
print(type(dna_matching_strand))

CTAGGAGGTATATGTTGCCATAGAGGTGGAGTCCAAATCTAGAGTTGTTGCCTTGGTAACGGCTGTACTCTGTCAATCCATAGCAGCTCTCAATGTTCGATTTTGCTCGTCATCAGTCGAGACGTAGACTTCGGCGACTTCAAGATGATTCCCACCTATTGTAGTAGGCACGTTCTGGTTCTTGGCGGTTATCTGTTGTATACATTGTATAAATCCTATATGGAGCTTTTATTATTTGGCGGTGTGACAGTAATAATATTAATCTTTGTCTTGCGTTTTTAATAGGTGATATATTAAGTTTCTGCGCTTTTTTTTTCTTGTTGCGCAGTATCTTGAAAACCGTTAAGCGCAGTGTTTATTTAAAACCGTTGAATACAAAGGAGAAGCTCGTCATGAGCTCGGGACAGAGTTCTTACATTATTATGGGTAGCATCCATACCAATTTCTATCGTAGAGGTGTTGGAGTTTCGAGGAACGGCTCTCAGCGGGAGGAAACAGCTCATTAAAAGTGAAAAGTATACTCTTGAATAAAAGAATAAGAAATGAGAGTGTAGGACATCACTAACTGTGACGTTGTCGGTGGTAGTGATCTTCTTGTCTTGTTAATGAATTATCTTTTTAATATAGAAGGAGCTTTGCTAAAGGACGAAGGTTGTAGATGCATATAGTTCTTCGTAAGTGAATGGTACTGTGTCGAAGTCTAAAGTAATAACGACTGTCGATGATATAGTGATGAGGTAGATCATCACCGGTGCGGGATACTCCGTATAGGATAGCCTTTTGTTATGGGGGGTCACCGTTCTCAGTTACTTAGCAAATGTAAAGTTTAAAGGTTACTATGGATATTTAGCAGACATCTGTTCTGTCGAGTTTATTGTATGTTAACGAAGCTGAATGGCTCGACCGAAAGCAAACTGAGATCAAGATCTTGCAAGAGTCCACTTGGAAGAAGACTGAATGATAGACTACGCTTGTGGTGCAACATAAAGTTACATTATGA