##### Extract peptide sequence from UniParc or UniRef

UniRef (UniProt Reference Clusters):
UniRef is a clustering system for protein sequences that helps in reducing redundancy and speeding up sequence similarity searches. It consists of three databases:

UniRef100: Contains all sequences from UniProtKB (Swiss-Prot and TrEMBL), as well as selected UniParc sequences, without any redundancy.
UniRef90: Clusters sequences that have at least 90% sequence identity to each other and cover 80% of the longest sequence. This reduces redundancy by grouping highly similar sequences together.
UniRef50: Further reduces redundancy by clustering sequences that have at least 50% identity to each other.
The UniRef databases make it easier to search protein sequences and analyze large datasets by removing highly similar sequences and presenting representative clusters.

UniParc (UniProt Archive):
UniParc is a comprehensive protein sequence archive that contains all publicly available protein sequences from various sources, such as UniProtKB, Ensembl, RefSeq, PDB, and many others. Its goal is to capture and store every protein sequence ever published, independent of any annotation or quality check.

UniParc assigns a unique identifier to each distinct protein sequence and tracks all updates or changes to that sequence over time. This allows researchers to access historical versions of protein sequences and follow the evolution of sequence data across different databases.

we only need the sequences within a specific length range from a specific data sources (UniRef and UniParc).
For example, download data from UniRef; we selected UniRef first and searched for
(length:[* TO 50]) AND (identity:1.0)  #########  length between 0 - 50 from UniRef100
(length:[* TO 50]) AND (identity:0.9)  #########  length between 0 - 50 from UniRef90
(length:[* TO 50]) AND (identity:0.5)  #########  length between 0 - 50 from UniRef50

2. Selecting Data from UniParc

For UniParc, the search is simpler:
(length:[* TO 50])               #########  length between 0 - 50 from UniPrac

3. Downloading the Results
Once the search is complete (expect millions of results—e.g., ~19M sequences in this example):
	1.	Click Download.
	2.	Set Format to TSV.
	3.	Under Customize Columns, select:
	•	Sequence ✅
	•	Length (optional, but may be useful for validation)
(We only require the sequence, but other columns can be kept for reference.)
	4.	Enable Compressed format for faster transfer.
	5.	Click Generate URL for API — this gives you a direct query URL, which we’ll call query_url.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/MyDrive/peptideBERT_dataset_construction')

Mounted at /content/drive


#### download data from UniParc with length 0 - 50

In [None]:
import requests
query_url = 'https://rest.uniprot.org/uniparc/stream?compressed=true&fields=upi%2Clength%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29'
uniprot_request = requests.get(query_url)
from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df.to_parquet('peptide_UniPrac_0_50.parquet')

#### download data from UniRef100 with length 0 - 50

In [None]:
import requests
query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29+AND+%28identity%3A1.0%29%29'
uniprot_request = requests.get(query_url)
from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df.to_parquet('peptide_UniRef100_0_50.parquet')

#### download data from UniRef90 with length 0 - 50

In [None]:
import requests
query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29+AND+%28identity%3A0.9%29'
uniprot_request = requests.get(query_url)
from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df
df.to_parquet('peptide_UniRef90_0_50.parquet')

#### download data from UniRef50 with length 0 - 50

In [None]:
import requests
query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29+AND+%28identity%3A0.5%29'
uniprot_request = requests.get(query_url)
from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df.to_parquet('peptide_UniRef50_0_50.parquet')