# Lab 16: Motif Finding

**BACKGROUND:** You identified a human gene called **JUNB** that is relevant to your research. You now want to see how the gene is regulated by identifying transcription factors (TFs) that regulate the gene. You also want to identify any domains in its protein sequence.

### TASK 1 Objectives:

*Retrieve the promoter sequence and identify potential TF binding sites.*

**A.** Extract the promoter sequence from the gene.
1. Go to the [UCSC Table Browser](https://genome.ucsc.edu/cgi-bin/hgTables)
2. In the row that starts with the word "**clade**", choose "**human**" from the "**genome**" field
3. In the row starting with the word "**region**", enter **JUNB** in the *position* field, and click on the *Lookup* button
4. Click on the top link that references **JUNB**: 
   `JUNB (uc002mvc.4) at chr19:12791496-12793315 - Homo sapiens jun B proto-oncogene (JUNB), mRNA. (from RefSeq NM_002229)` -- this will bring you back to the Table Browser page.
5. When you are back in the Table Browser page, in the row "**region**" select "position" radio button, and in the row starting with "**output format**", select *sequence* from the dropdown
6. Click the **get output** button, choose *genomic*, then press the **submit** button
7. Check the **Promoter/Upstream** checkbox, and enter 500 in the text box
8. Uncheck all other boxes and press the **get sequence** button
9. Copy and paste the FASTA result into a file for later use.

**B.** Look for transcription factor binding site (TFBS) motifs.
1. Go to the **POSSUM** web page: http://zlab.bu.edu/~mfrith/possum/
2. Paste promoter sequence in the '**Enter DNA sequences**' box.
3. Select some/all of the cis-elements. *(Note: POSSUM won’t run unless you select cis-elements to search for.)*
4. Click **Go!**
5. POSSUM will display the cis-elements, the locations, the surrounding sequence, and the scores. 

**Question:** 
>What is the most common cis-element in the sequence?

### TASK 2 Objectives:

*Find motifs in protein sequences without looking at the promoter.*

**A.** Retrieve the protein sequence for **JUNB_HUMAN**. Search for it at: http://www.uniprot.org/

**B.** Save the sequence in a file; you will need it for **Task 3**. *Hint:* You can create new text files in jupyter by choosing `New -> Text File` in the file browser.

**C.** Search for domains:
1. Go to the [HMMER web page](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan)
2. Paste the JUNB protein sequence from above into the first box
3. Click '**Submit**' to search for protein motifs in your protein sequence.
4. Hmmer will display the protein sequence and the locations of significant protein motif hits. 

**Question:**
>Are the results surprising? 

>Are there multiple protein motifs?

### TASK 3 Objectives:
    
**BACKGROUND**:
A fast way to search for leucine zipper domains in a protein sequence is to check  if  there  is  a  small  pattern  containing  an  “L”  every  7  amino  acids  (A  leucine  zipper pattern  is  an  L  with  6  other  amino  acids,  and  then  another L with 6 other amino acids, and then another L, so on and so forth.).   Use your python skills to take a sequence and identify whether that pattern exists. 

**Note**:
Regular  expressions  are  very  powerful  and  can  get  complicated.  This  is  a  very  basic tutorial. For more in-depth knowledge, go to http://docs.python.org/library/re.html. You will use the regular expressions library for this exercise. But first you will practice basic expression and syntax. A  regular  expression  is  a  consensus  sequence-like  pattern.  Here  are  two  example  regular expressions: 

| Regular Expression | Description | Matching sequences |
|-|-|-|
| ```motif = 'A[CT]G[AG]'``` |Position 1: A<br />Position 2: C or T<br />Position 3: G<br />Position 4: A or G | ACGA<br />ATGA<br />ACGG<br />ATGG |
| ```motif = 'A.G.'``` | Position 1: A<br />Position  2: A or C or T or G<br />Position 3: G<br />Position  4: A or C or T or G | AAGA, AAGT, AAGC AAGG,<br />ATGA, ATGT ATGC, ATGG,<br />ACGA ACGT, ACGC, ACGG<br />AGGA, AGGT, AGGC AGGG |

A dot (.) is a wildcard that stands for any character at a single position. Another useful
wildcard often used with a dot is the asterisk (\*). A \* refers to repeats of any character.
[ACGT]* refers to repeats of (A, C, G, or T). The syntax in a Python script is:

```python
import re
sequence = 'TACACGTATAC'
motif = 'A.G.'
result = re.search(motif, sequence)
```

If the pattern is not found, result will be ‘None’. Search for ‘XYZ’ in the above sequence and print the result to see what happens when a pattern is not found. 

You can get the start and end positions of the
motif match:

```python
start = result.start()
end = result.end()
```

To extract the exact motif in the sequence, you can slice it out:

```python
sequence[start:stop]
```

### TASK 3 Objectives:

1. Write  python code below which operates on the sequence file `junb.fasta`. 
2. Use regular expressions. To do this, you must first ```import re```.  This needs to be done at the very beginning of your code. 
3. Read the sequence from the file and save it in a variable, e.g. sequence.
4. Create your motif variable for leucine zippers using dots (.). Hint: a motif with a lysine (K) repeated twice with two amino acids in the middle would be: ```motif = 'K..K'```. Remember that the leucine zipper pattern is an L with 6 other amino acids, and then an L with 6 other amino acids, and then another L, so on and so forth. Note: K is for lysine, L is for leucine! 
5. Use ```re.search``` to check if the leucine zipper pattern.  See syntax on previous page. 
6. Print  the  location  and  motif  within  the  sequence.    How  does  your  Python  script  result compare with the online tool result? 
7. Upload your jupyter notebook to Blackboard for lab credit. 

In [27]:
import re 
with open('jung.fasta', 'r') as file:      # open and read the file
 proseq = ''                               # initialize variable to collect protein sequences
 proID = ''                                # initialize to collect protein id
 for line in file:
      line=line.strip()
      if('>' in line):
        proID = line[4:10]                 # collect proten id
      else:   
        proseq = proseq + line   

 print('Sequence :', proseq,"\n")
 
 print('finditer function: ')

 results = list(re.finditer('(L......)+L', proseq)) #find the motif with the help of finditerfuncton()
  
 for match in results :
    print('location:',(match.start(), match.end()), 'motif :', match.group()) #print location and motif

Sequence : MCTKMEQPFYHDDSYTATGYGRAPGGLSLHDYKLLKPSLAVNLADPYRSLKAPGARGPGPEGGGGGSYFSGQGSDTGASLKLASSELERLIVPNSNGVITTTPTPPGQYFYPRGGGSGGGAGGAGGGVTEEQEGFADGFVKALDDLHKMNHVTPPNVSLGATGGPPAGPGGVYAGPEPPPVYTNLSSYSPASASSGGAGAAVGTGSSYPTTTISYLPHAPPFAGGHPAQLGLGRGASTFKEEPQTVPEARSRDATPPVSPINMEDQERIKVERKRLRNRLAATKCRKRKLERIARLEDKVKTLKAENAGLSSTAGLLREQVAQLKQKVMTHVSNGCQLLLGVKGHAF 

finditer function: 
location: (26, 34) motif : LSLHDYKL
location: (42, 50) motif : LADPYRSL
location: (79, 87) motif : LKLASSEL
location: (295, 324) motif : LEDKVKTLKAENAGLSSTAGLLREQVAQL


In [None]:
In comparison to online tool Hummmer, our python function also obtained four motif and its location.