# Extracting experimental DNA-binding sequences from UniProt data

This notebook demonstrates how to use python and pandas to extract experimental DNA-binding sequences.

Emma Hatton-Ellis, Jun 2017.

In [1]:
import pandas as pd

Search UniProt for entries with experimental DNA-binding site annotations, and retrieve in tab format. The fields retrieved are accession, entry name, gene name, DNA-binding feature data, and the canonical protein sequence.

In [2]:
url = 'https://www.uniprot.org/uniprot/?query=annotation:(type:dna_bind+evidence:experimental)&format=tab&columns=id,entry_name,genes(PREFERRED),feature(DNA+BINDING),sequence'

In [3]:
df = pd.read_csv(url, sep='\t')

In [4]:
df.head()

Unnamed: 0,Entry,Entry name,Gene names (primary ),DNA binding,Sequence
0,A4JS72,BV3F_BURVG,,"DNA_BIND 89..95; /evidence=""ECO:0000269|PubMe...",MPVQGRENMDPKSPGYLALIAQRESLDAQIIAARKAEREVAIGQIK...
1,U2UMQ6,CS12A_ACISB,cas12a,"DNA_BIND 599..607; /note=""PAM-binding on targ...",MTQFEGFTNLYQVSKTLRFELIPQGKTLKHIQEQGFIEEDKARNDH...
2,Q6NT76,HMBX1_HUMAN,HMBOX1,"DNA_BIND 267..341; /note=""Homeobox""; /eviden...",MLSSFPVVLLETMSHYTDEPRFTIEQIDLLQRLRRTGMTKHEILHA...
3,Q00958,LFY_ARATH,LFY,"DNA_BIND 233..237; /evidence=""ECO:0000244|PDB...",MDPEGFTSGLFRWNPTRALVQAPPPVPPPLQQQPVTPQTAAFGMRL...
4,P52952,NKX25_HUMAN,NKX2-5,"DNA_BIND 138..197; /note=""Homeobox""; /eviden...",MFPSPALTPTPFSVKDILNLEQQQRSLAAAGELSARLEATLAPSSC...


Extract the amino acid positions from the DNA binding column. The regex uses named capture groups ("from" and "to") which are automatically set as column names by pandas.

In [5]:
positions = df['DNA binding'].str.extract(r'DNA_BIND\s(?P<from>\d+)\.\.(?P<to>\d+);', expand=True)

Check that the positions look ok.

In [6]:
positions.tail()

Unnamed: 0,from,to
21,79,147
22,285,300
23,246,259
24,1,35
25,58,238


Convert the amino acid positions from string to numeric data type.

In [7]:
df['from'] = pd.to_numeric(positions['from'])
df['to'] = pd.to_numeric(positions['to'])
df.head()

Unnamed: 0,Entry,Entry name,Gene names (primary ),DNA binding,Sequence,from,to
0,A4JS72,BV3F_BURVG,,"DNA_BIND 89..95; /evidence=""ECO:0000269|PubMe...",MPVQGRENMDPKSPGYLALIAQRESLDAQIIAARKAEREVAIGQIK...,89,95
1,U2UMQ6,CS12A_ACISB,cas12a,"DNA_BIND 599..607; /note=""PAM-binding on targ...",MTQFEGFTNLYQVSKTLRFELIPQGKTLKHIQEQGFIEEDKARNDH...,599,607
2,Q6NT76,HMBX1_HUMAN,HMBOX1,"DNA_BIND 267..341; /note=""Homeobox""; /eviden...",MLSSFPVVLLETMSHYTDEPRFTIEQIDLLQRLRRTGMTKHEILHA...,267,341
3,Q00958,LFY_ARATH,LFY,"DNA_BIND 233..237; /evidence=""ECO:0000244|PDB...",MDPEGFTSGLFRWNPTRALVQAPPPVPPPLQQQPVTPQTAAFGMRL...,233,237
4,P52952,NKX25_HUMAN,NKX2-5,"DNA_BIND 138..197; /note=""Homeobox""; /eviden...",MFPSPALTPTPFSVKDILNLEQQQRSLAAAGELSARLEATLAPSSC...,138,197


Slice the sequence string using the amino acid positions, and apply this as a new column.

In [8]:
df['extracted_dna_binding'] = df.apply(lambda x: x['Sequence'][x['from']:x['to']], 1)

In [9]:
df[['Entry', 'Entry name', 'extracted_dna_binding']]

Unnamed: 0,Entry,Entry name,extracted_dna_binding
0,A4JS72,BV3F_BURVG,GRQPAW
1,U2UMQ6,CS12A_ACISB,DAAKMIPK
2,Q6NT76,HMBX1_HUMAN,RGSRFTWRKECLAVMESYFNENQYPDEAKREEIANACNAVIQKPGK...
3,Q00958,LFY_ARATH,EHPF
4,P52952,NKX25_HUMAN,RKPRVLFSQAQVYELERRFKQQRYLSAPERDQLASVLKLTSTQVKI...
5,P51023,PNT_DROME,QLWQFLLELLLDKTCQSFISWTGDGWEFKLTDPDEVARRWGIRKNK...
6,P23874,HIPA_ECOLI,VLR
7,P0A1S2,HNS_SALTY,GRTPA
8,P03685,NP_BPPH2,AKMMQREITKTTVNVAKM
9,P23873,HIPB_ECOLI,QQNGWTQSELAKKIGIKQATISNFEN
