## Data Download from UniProt

After running this notebook, you should have 
- `data/raw/uniprot_raw.tsv` with enzyme sequences
- Columns: `entry`, `sequence`, `ec_number`

In [5]:
# Load config file

import yaml

with open("../configs/data.yaml", "r") as f:
    cfg = yaml.safe_load(f)

query_url = cfg["query_url"]
query_url

'https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=tsv&fields=accession,sequence,ec&query=(reviewed:true)%20AND%20(ec:*)'

## Data Download from UniProt

After running this notebook, you should have 
- `data/raw/uniprot_raw.tsv` with enzyme sequences
- Columns: `accession`, `sequence`, `ec_number`, `organism`, `name`, `ec_class`

In [6]:
# Download from UniProt

import requests
from io import BytesIO

response = requests.get(query_url)
bio = BytesIO(response.content)
bio

<_io.BytesIO at 0x1d02815f470>

In [7]:
# Load and clean raw data

import pandas
import os

os.makedirs("../data/raw", exist_ok=True)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df = df.dropna() # Drop proteins with missing columns

df.to_csv("../data/raw/uniprot_raw.tsv", sep="\t", index=False) # Save raw data as tsv
df.head(3)

Unnamed: 0,Entry,Sequence,EC number
0,A0A009IHW8,MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENA...,3.2.2.-; 3.2.2.6
1,A0A023I7E1,MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIF...,3.2.1.39
2,A0A024B7W1,MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRM...,2.1.1.56; 2.1.1.57; 2.7.7.48; 3.4.21.91; 3.6.1...
