# BF527: Applications in Bioinformatics

>**Note:** Please submit the Jupyter notebook through Blackboard. Your code should follow the guidelines laid out in class, including commenting. Partial credit will be given for nonfunctional code that is logical and well commented. This assignment must be completed on your own.

## Homework 7

### See [Blackboard](https://learn.bu.edu) for assignment and due dates

---

## Problem 7.1 (40%):

Explore the gene expression dataset in **Gene Expression Omnibus (GEO)** with the accession **GSE4115**. In this study, the authors compared the gene expression in histologically normal bronchial epithelium in 79 samples from smokers with lung cancer with 73 samples from smokers without lung cancer. The goal of the study was to identify a diagnostic gene expression profile that can distinguish between lung cancer and non-lung cancer samples.

Use GEO to identify the number of genes differentially expressed in smokers with lung cancer, versus smokers without lung cancer. Use a Two-tailed T-test with 0.05 as the Significance Level.

Use a web tool of your choice to identify any significant pathways or biological processes that may be affected in lung cancer patients. Do you find anything interesting; does it make sense? What would you do next as a follow up experiment (bioinformatics or biology)?

---

## Problem 7.2 (60%):

Mitogen-activated protein kinase 6 (MAPK6) is an enzyme that is a member of the Ser/Thr protein kinase family. MAPK6, along with other MAP kinases, are extracellular signal-regulated kinases, which are activated through protein phosphorylation. MAPK6 is known to contain one protein kinase domain (Pkinase), located at the N-terminus. This Pkinase domain is a linear motif binding (LMB) domain, and is known to bind the following short linear motifs (SLiMs):

<table>
  <tr>
    <th>LMB Domain</th>
    <th>SLiM</th> 
  </tr>
  <tr>
    <td rowspan="7">Pkinase</td>
    <td>N.E.K..N</td> 
  </tr>
  <tr>
    <td>N.Y....E</td>
  </tr>
  <tr>
    <td>S...D.PL</td>
  </tr>
  <tr>
    <td>S..SS</td>
  </tr>
  <tr>
    <td>S.S..S</td>
  </tr>
  <tr>
    <td>ST.S</td>
  </tr>
  <tr>
    <td>F.FP</td>
  </tr>
</table>

Write a Python script that integrates information about MAPK6’s interaction partners, their sequences, and the known motifs that the Pkinase binds, to determine how many of MAPK6’s interactions are mediated by a LMB domain-motif interaction. The file ```HW7.2_uniprot_proteins.fasta``` (available on blackboard) is a fasta file that contains proteins that are known to interact with MAPK6. Your code should print: **(1)** the Uniprot ID of the binding partner; **(2)** the motif found in the binding partner; and, **(3)** the location of the motif in the binding partner’s sequence. There may be more than one motif within one protein. There may also be no motifs in a protein.


In [29]:
import re

#parses fasta file into dictionary
proteins = {'name' : 'sequence'}
seq = ''
seqn = ''
with open ('uniprot_proteins.fasta', 'rt') as f:
    for line in f:
        if line[0] == '>':
            seqn = line.rstrip()
        else:
            seq += line.rstrip()
        if seqn not in proteins:
            proteins[seqn] = []
            seq = ''
        proteins[seqn] = seq
        

proteins
#[3:9]

{'name': 'sequence',
 '>sp|P30153|2AAA_HUMAN Serine/threonine-protein phosphatase 2A 65 kDa regulatory subunit A alpha isoform OS=Homo sapiens GN=PPP2R1A PE=1 SV=4': 'MAAADGDDSLYPIAVLIDELRNEDVQLRLNSIKKLSTIALALGVERTRSELLPFLTDTIYDEDEVLLALAEQLGTFTTLVGGPEYVHCLLPPLESLATVEETVVRDKAVESLRAISHEHSPSDLEAHFVPLVKRLAGGDWFTSRTSACGLFSVCYPRVSSAVKAELRQYFRNLCSDDTPMVRRAAASKLGEFAKVLELDNVKSEIIPMFSNLASDEQDSVRLLAVEACVNIAQLLPQEDLEALVMPTLRQAAEDKSWRVRYMVADKFTELQKAVGPEITKTDLVPAFQNLMKDCEAEVRAAASHKVKEFCENLSADCRENVIMSQILPCIKELVSDANQHVKSALASVIMGLSPILGKDNTIEHLLPLFLAQLKDECPEVRLNIISNLDCVNEVIGIRQLSQSLLPAIVELAEDAKWRVRLAIIEYMPLLAGQLGVEFFDEKLNSLCMAWLVDHVYAIREAATSNLKKLVEKFGKEWAHATIIPKVLAMSGDPNYLHRMTTLFCINVLSEVCGQDITTKHMLPTVLRMAGDPVANVRFNVAKSLQKIGPILDNSTLQSEVKPILEKLTQDQDVDVKYFAQEALTVLSLA',
 '>sp|P01023|A2MG_HUMAN Alpha-2-macroglobulin OS=Homo sapiens GN=A2M PE=1 SV=3': 'MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTVSASLESVRGNRSLFTDLEAENDVLHCVAFAVPKSSSNEEVMFLTVQVKGPTQEFKKRTTVMVKNEDSLVFVQTDKSIYKPGQTVKFRVVSMDENFHPLN

In [30]:
#prints out ID, Motif, Start, and Stop for all proteins in dict
motifs = ['N.E.K..N', 'N.Y....E', 'S...D.PL', 'S..SS', 'S.S..S', 'ST.S', 'F.FP']
s = ''
result = ''
r = {}
for i in proteins:
    s = proteins[i]
    name = i[4:10]
    r[name] = []
    for j in motifs:
        result = re.search(j, s)
        if result != None:
            data = j, result.start(), result.end()
            r[name].append(data)
        
for i in r:
    print('Uniprot ID:', i)
    for j in r[i]:
        print('Motif, Start, Stop:', j)



Uniprot ID: 
Uniprot ID: P30153
Uniprot ID: P01023
Motif, Start, Stop: ('S.S..S', 60, 66)
Motif, Start, Stop: ('ST.S', 781, 785)
Motif, Start, Stop: ('F.FP', 189, 193)
Uniprot ID: Q9Y478
Motif, Start, Stop: ('S..SS', 176, 181)
Uniprot ID: P42025
Uniprot ID: P63261
Uniprot ID: P49418
Uniprot ID: Q12955
Motif, Start, Stop: ('S..SS', 1522, 1527)
Motif, Start, Stop: ('S.S..S', 910, 916)
Motif, Start, Stop: ('ST.S', 1543, 1547)
Uniprot ID: Q9UJX4
Motif, Start, Stop: ('ST.S', 269, 273)
Uniprot ID: Q99767
Motif, Start, Stop: ('S.S..S', 235, 241)
Uniprot ID: P02647
Motif, Start, Stop: ('ST.S', 78, 82)
Uniprot ID: O15145
Uniprot ID: P18859
Uniprot ID: Q7Z3C6
Motif, Start, Stop: ('S..SS', 682, 687)
Uniprot ID: Q9HBU1
Motif, Start, Stop: ('S..SS', 94, 99)
Motif, Start, Stop: ('F.FP', 84, 88)
Uniprot ID: Q8TAM1
Motif, Start, Stop: ('S..SS', 154, 159)
Motif, Start, Stop: ('S.S..S', 585, 591)
Uniprot ID: O43570
Motif, Start, Stop: ('ST.S', 157, 161)
Uniprot ID: P55290
Uniprot ID: P27797
Uniprot ID: 

---

## EXTRA CREDIT (5 points):

Watch the 3 webinars hosted by George Church on Youtube (~20 minutes):

1. http://www.youtube.com/watch?v=mVZI7NBgcWM
2. http://www.youtube.com/watch?v=2r9DpthvNKM
3. http://www.youtube.com/watch?v=mgXAO8pv-X4

Discuss the potential benefits/detriments of getting your genome sequenced and the potential benefits/detriments of making your genome sequence public for all to see.

Would you get your genome sequenced? Would you make it public? Why or why not?