# Dataset Generation

In this notebook I show how I gathered all the data contained in this folder from UniProt.

## Getting all the IDs

In this first cell, I requested all the results of querying UniProt for all the human protein entries that have been reviewed (these data have better quality than not-reviewed entries). The query I used was ```?query=*&fil=reviewed%3ayes+AND+organism%3a%22Homo+sapiens+(Human)+%5b9606%5d%22{}&columns=id%2centry+name%2creviewed%2cprotein+names%2cgenes%2corganism%2clength"```.

The IDs were stored as strings in a set.

In [1]:
import time
from math import ceil
import requests
from scrapy import Selector


base_url = ("https://www.uniprot.org/uniprot/?query=*&fil=reviewed%3ayes+AND+or"
            "ganism%3a%22Homo+sapiens+(Human)+%5b9606%5d%22{}&columns=id%2centr"
            "y+name%2creviewed%2cprotein+names%2cgenes%2corganism%2clength")

sel = Selector(requests.get(base_url.format("")))

results = int(sel.xpath('//strong[@class="queryResultCount"]/text()'
                        '').extract_first().replace(",",""))
n_pages = ceil(results/25)

print("Number of entries:", results, "",
      "Pages to request:", n_pages, "", "Expected time:", 
      # Expected time in format H:MM:SS
      f"{n_pages*3//3600}:{(n_pages*3%3600)//60}:{(n_pages*3%3600)%60}",
      "", "Pages fetched: ", sep = "\n")

all_human_up = set()
xid = '//td[@class="entryID"]/a/text()'

for i in range(n_pages):
    if i == 0:
        url = base_url.format("")
    else:
        url = base_url.format("&offset="+str(i*25))

    success = False
    while not success:
        try:
            time.sleep(3)
            sel = Selector(requests.get(url))
            success = True
        except:
            print("[oops]", end = ", ")
    
    all_human_up.update(sel.xpath(xid).extract()) # IDs are added
    
    if i % 20 == 19:
        print(i + 1, f"({len(all_human_up)} entries so far)", end = ", ")
    else:
        print(i + 1, end = ", ")

# IDs are written
with open("all_human_uniprot_ids_v2.csv", "w") as f:
    f.write("".join([str(n) + "\n" for n in all_human_up]))

Number of entries:
20395

Pages to request:
816

Expected time:
0:40:48

Pages fetched: 
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 (500 entries so far), 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40 (1000 entries so far), 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 (1500 entries so far), 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80 (2000 entries so far), 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 (2500 entries so far), 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120 (3000 entries so far), 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140 (3500 entries so far), 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160 (4000 entries so far), 161, 162, 163, 164, 165, 166, 167, 168, 169,

## Requesting all human entries in UniProt

With the help of the Uniprot.UniProt class I wrote, I gathered information about:
- Protein name
- Gene name
- Species
- Description
- GO annotations
- PDB IDs
- PubMed IDs that appear in the Description
- Sequence
- Sites of interest in the protein Sequence

I did this for all 20395 entries in UniProt at the moment this notebook was written.

First, I wrote a function to store the information in a clean manner within a CSV file (delimiter: ";"). For columns containing lists, the elements of the list were stored separated by "&" (first) and by "+" (second, if needed).

In [2]:
import io
def write_up(data, name):
    dat = ""
    for v in data.values():
        for d in [d for d in dir(v) if not d.startswith("_") and d != "page"]:
            at = getattr(v, d)
            if type(at) == type("a"):
                dat += at.replace(";", ",").replace("\n", "&")
            elif type(at)  in (type([1,2]), type((1,))):
                dat += "&".join([p.replace("&", "").replace(";", ",") for p in at])
            elif d == "go":
                dat += ";".join(["&".join(["+".join(l) for l in list(at.values())[l]]) for l in range(3)])
            elif type(at) == type({"a": 2}):
                if type(list(at.values())[0]) == type("a"):
                    dat += ";".join(list(at.values()))
                    """elif type(j) in (type([1,2]), type((1,))):
                        dat += "&".join([p.replace("&", "").replace(";", ",") for p in j])"""
            

            dat += ";"
        dat += "\n"
    with io.open(name, "w", encoding = "utf-8") as f:
        f.write(dat)

And I gathered information about all entries:

In [3]:
from Uniprot_v2 import *

data = {}

with open("all_human_uniprot_ids.csv", "r") as f:
    upids = f.readlines()
    upids = [n.strip() for n in upids]

for i,upid in enumerate(upids):
    print(f"{upid} ({i+1})", end = ", ")

    success = False
    n = 0
    while not success:
        try:
            time.sleep(3)
            data[upid] = Uniprot(upid)
            success = True
        except:
            n +=1
            print("a", end = "")

            # After requesting a webpage 4 times without success,
            # the entry would be ignored to avoid remaining in a
            # loop forever.
            if n>3:
                break
    if n > 3:
        continue
    
    # After 500 entries have been fetched, their information is written to a file
    # and the data object discards all this information to avoid crashing due to 
    # memory issues.
    
    if i % 500 == 499:
        print("CLEANING MEMORY...", end=", ")
        write_up(data, "up_data_{}.csv".format((i+1)//500))
        data = {}
# After all entries have been fetched, the last ones are also written to a file.
write_up(data, "up_data_42.csv")

19154), Q96A83 (19155), Q4G0I0 (19156), P04808 (19157), A0A1B0GUW6 (19158), Q9HC98 (19159), P41229 (19160), Q8NB50 (19161), Q15760 (19162), P23497 (19163), Q9HAN9 (19164), Q9BXW9 (19165), Q8IV38 (19166), Q8N2A8 (19167), P05141 (19168), O75554 (19169), A0A1B0GVG6 (19170), Q6ZPD9 (19171), Q9H0W8 (19172), P55895 (19173), Q8WWM1 (19174), P04075 (19175), Q9NXF7 (19176), Q96J84 (19177), Q8TBE1 (19178), Q8IUF8 (19179), Q8NGK2 (19180), P49326 (19181), O75448 (19182), Q969F2 (19183), Q9H190 (19184), P20339 (19185), Q05586 (19186), P63272 (19187), O00214 (19188), Q13976 (19189), P0DMR1 (19190), Q8N5C8 (19191), P05177 (19192), Q86VU5 (19193), Q96DC8 (19194), Q6GPH6 (19195), Q16762 (19196), Q3ZCQ2 (19197), Q6NVV1 (19198), P52848 (19199), Q99466 (19200), Q9UQL6 (19201), Q16864 (19202), A6NMN3 (19203), Q9UJX3 (19204), Q9GZZ1 (19205), F2Z333 (19206), Q6UX41 (19207), Q9UF83 (19208), Q6ECI4 (19209), Q3MJ62 (19210), O75452 (19211), Q8N594 (19212), P01889 (19213), P01574 (19214), P12645 (19215), Q96QT6 (

And this is how all the data were gathered. Note that this code might be easily adapted to obtein more information about UniProt entries.