# Dataset Schema

This notebook covers where we obtained all the data and how we processed them to be utilized for our study.

---

***This notebook creates tables into a specified sqlite DB.***

***RUNNING THIS NOTEBOOK MAY OVERRIDE OR OVERPOPULATE EXISTING TABLES***

---

# Uniref50 All Representatives Database
- sqlite database file: "/cta/share/users/uniprot/uniref/uniref50_representatives.db"
- Only 1 table: **uniref50_representatives** created from UniRef50 fasta file downloaded in 19.03.2025
  - ['id', 'unique_identifier', 'cluster_name', 'members', 'taxon', 'representative_member', 'sequence']
  - Contains ~70M protein sequences.

# Uniprot Human Taxonomy Database

- sqlite database file: "/cta/share/users/uniprot/human/human.db"
- Main Tables:
  - **proteins (205104):** Uniprot entries from human taxonomy.
    - ['Entry', 'Reviewed', 'Entry Name', 'Protein names', 'Gene Names', 'Organism', 'Organism (ID)', 'Gene Ontology (GO)', 'Gene Ontology (biological process)', 'Gene Ontology (cellular component)', 'Gene Ontology (molecular function)', 'Subcellular location [CC]', 'Intramembrane', 'Topological domain', 'Transmembrane', 'Coiled coil', 'Compositional bias', 'Domain [CC]', 'Domain [FT]', 'Motif', 'Protein families', 'Region', 'Repeat', 'Zinc finger', 'Length', 'Sequence']
  - **ted_entries (250405):** TED entries of proteins from **proteins table**. 
    - API access: https://ted.cathdb.info/access - https://ted.cathdb.info/api/v1/uniprot/summary/{uniprot_id}
    - script: create_ted_db.py
    - ['ted_id', 'uniprot_acc', 'md5_domain', 'consensus_level', 'chopping', 'nres_domain', 'num_segments', 'plddt', 'num_helix_strand_turn', 'num_helix', 'num_strand', 'num_helix_strand', 'num_turn', 'proteome_id', 'cath_label', 'cath_assignment_level', 'cath_assignment_method', 'packing_density', 'norm_rg', 'tax_common_name', 'tax_scientific_name']
  - **uniref50:** This table is in '/cta/share/users/uniprot/uniref/uniref.db' database. It is created via '/cta/share/users/uniprot/uniref/uniref50.xml' downloaded from https://www.uniprot.org/help/downloads. Includes all proteins of UniRef50 ~30GB. 
    - ['uniprot_id',	'uniprot_accession', 'uniparc_id',	'is_representative',	'representative_member',	'taxon_id',	'common_taxon_id']
  - **uniref90:** This table is in '/cta/share/users/uniprot/uniref/uniref90.db' database. It is created via '/cta/share/users/uniprot/uniref/uniref90.xml' downloaded from https://www.uniprot.org/help/downloads. Includes all proteins of UniRef90 ~60GB. 
    - ['uniprot_id',	'uniprot_accession', 'uniparc_id',	'is_representative',	'representative_member',	'taxon_id',	'common_taxon_id']
---

- Summary Tables
  - **uniprot_features (694082):** Uniprot features extractred from **proteins table**.
    - ['uniprot_id', 'type', 'start_index', 'end_index', 'note', 'evidence']
  - **interpro_entries (1624342):** Interpro entries of proteins from **proteins table**.
    - Distilled from the protein2ipr.dat.gz file (https://www.ebi.ac.uk/interpro/download/)
    - ['uniprot_id', 'interpro_id', 'description', 'cross_db_id', 'start_index', 'end_index']
  - **interpro_entries_v2 (1159453):** Interpro entries for Human Uniprot ids collected using https://www.ebi.ac.uk/interpro/api/entry/interpro/protein/uniprot/{uniprot_id} API with collect_interpro_human.py script. 
    - ['uniprot_id', 'interpro_id', 'description', 'type', 'member_databases', 'start_index', 'end_index']
  - **ted_entries_summary (288517):** Summary of the **ted_entries** table. 
    - ['uniprot_id', 'ted_id', 'consensus_level', 'plddt', 'cath_label', 'start_index', 'end_index']
  - **uniprot_quickgo_annotations (1296718)**: QuickGO annotations of proteins form **proteins table**. Collected using https://www.ebi.ac.uk/QuickGO/services/annotation/search.
    - ['uniprot_id', 'go_id', 'go_name', 'go_aspect', 'go_evidence', 'evidence_code', 'qualifier', 'assigned_by', 'date_created']
  - **interpro_go_mapping (30204)**: GO mappings of InterPro entries obtained from https://current.geneontology.org/ontology/external2go/interpro2go.
    - ['interpro_id', 'interpro_description', 'go_id', 'go_name']
  - **prosite_entries (620075)**: Prosite entries of Uniprot entries obtained from https://ftp.expasy.org/databases/prosite/prosite_alignments.tar.gz.
    - ['prosite_id', 'uniprot_id', 'prosite_name', 'uniprot_name', 'sequence_start', 'sequence_end', 'score', 'sequence', 'aligned_sequence']
  - **uniref50_all:** UniRef50 proteins which have taxon_id=9606. Filtered from **uniref50** of **uniref.db**.
    - ['uniprot_id',	'uniprot_accession', 'uniparc_id',	'is_representative',	'representative_member',	'taxon_id',	'common_taxon_id']
  - **uniref50_distilled (70901):** UniRef50 Human proteins distilled from **uniref50_all**. It includes one protein from each cluster. If the protein is a representative member then take this protein. If there are members of a cluster without a representative member then take the first member for the cluster.
    - ['uniprot_id',	'uniprot_accession', 'uniparc_id',	'is_representative',	'representative_member',	'taxon_id',	'common_taxon_id']
  - **uniref90_all:** UniRef90 proteins which have taxon_id=9606. Filtered from **uniref90** of **uniref90.db**.
    - ['uniprot_id',	'uniprot_accession', 'uniparc_id',	'is_representative',	'representative_member',	'taxon_id',	'common_taxon_id']
  - **uniref90_distilled (94019):** UniRef90 Human proteins distilled from **uniref90_all**. It includes one protein from each cluster. If the protein is a representative member then take this protein. If there are members of a cluster without a representative member then take the first member for the cluster.
    - ['uniprot_id',	'uniprot_accession', 'uniparc_id',	'is_representative',	'representative_member',	'taxon_id',	'common_taxon_id']
  

In [1]:
import sqlite3
import pandas as pd
import argparse
import random
from Bio import SeqIO
import pickle
import os
import json
from tqdm import tqdm
import ast
import xml.etree.ElementTree as ET

### Sample Queries

In [2]:
# Connect to DB
db_file = "/cta/share/users/uniprot/human/human.db"
conn = sqlite3.connect(db_file)

In [None]:
pd.read_sql(f"SELECT name FROM sqlite_master WHERE type='table'", conn)

Unnamed: 0,name
0,interpro_entries
1,ted_entries
2,uniprot_features
3,interpro_entries_v2
4,ted_entries_summary
5,uniref50_all
6,uniref90_all
7,uniref50_distilled
8,uniref90_distilled
9,uniref50_domain_sliced_plddt70


In [16]:
pd.read_sql(f"SELECT * FROM proteins", conn)

Unnamed: 0,Entry,Reviewed,Entry Name,Protein names,Gene Names,Organism,Organism (ID),Gene Ontology (GO),Gene Ontology (biological process),Gene Ontology (cellular component),...,Compositional bias,Domain [CC],Domain [FT],Motif,Protein families,Region,Repeat,Zinc finger,Length,Sequence
0,A0A024R1X5,unreviewed,A0A024R1X5_HUMAN,Beclin-1,BECN1 hCG_16958,Homo sapiens (Human),9606,autophagosome [GO:0005776]; cytosol [GO:000582...,amyloid-beta metabolic process [GO:0050435]; a...,autophagosome [GO:0005776]; cytosol [GO:000582...,...,,,"DOMAIN 105..129; /note=""Beclin-1 BH3""; /eviden...",,Beclin family,"REGION 48..72; /note=""Disordered""; /evidence=""...",,,450,MEGSKTSNNSTMQVSFVCQRCSQPLKLDTSFKILDRVTIQELTAPL...
1,A0A024R274,unreviewed,A0A024R274_HUMAN,Mothers against decapentaplegic homolog (MAD h...,SMAD4,Homo sapiens (Human),9606,centrosome [GO:0005813]; cytosol [GO:0005829];...,adrenal gland development [GO:0030325]; atriov...,centrosome [GO:0005813]; cytosol [GO:0005829];...,...,,,,,Dwarfin/SMAD family,,,,552,MDNMSITNTPTSNDACLSIVHSLMCHRQGGESETFAKRAIESLVKK...
2,A0A024R324,unreviewed,A0A024R324_HUMAN,Transforming protein RhoA,RHOA hCG_20136,Homo sapiens (Human),9606,cleavage furrow [GO:0032154]; cytosol [GO:0005...,alpha-beta T cell lineage commitment [GO:00023...,cleavage furrow [GO:0032154]; cytosol [GO:0005...,...,,,,,"Small GTPase superfamily, Rho family",,,,193,MAAIRKKLVIVGDGACGKTCLLIVFSKDQFPEVYVPTVFENYVADI...
3,A0A024R6A3,unreviewed,A0A024R6A3_HUMAN,Presenilin (EC 3.4.23.-),PSEN1,Homo sapiens (Human),9606,cell cortex [GO:0005938]; cell junction [GO:00...,amyloid-beta formation [GO:0034205]; apoptotic...,cell cortex [GO:0005938]; cell junction [GO:00...,...,,DOMAIN: The PAL motif is required for normal a...,,,Peptidase A22A family,,,,467,MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGH...
4,A0A024R7I7,unreviewed,A0A024R7I7_HUMAN,Ras-related protein Rab-3,,Homo sapiens (Human),9606,acrosomal vesicle [GO:0001669]; endosome [GO:0...,axonogenesis [GO:0007409]; calcium-ion regulat...,acrosomal vesicle [GO:0001669]; cytosol [GO:00...,...,,,,,"Small GTPase superfamily, Rab family",,,,220,MASATDSRYGQKESSDQNFDYMFKILIIGNSSVGKTSFLFRYADDS...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205099,X6RLT1,unreviewed,X6RLT1_HUMAN,Negative elongation factor complex member C/D,NELFCD,Homo sapiens (Human),9606,nucleus [GO:0005634]; negative regulation of D...,negative regulation of DNA-templated transcrip...,nucleus [GO:0005634],...,"COMPBIAS 28..44; /note=""Acidic residues""; /evi...",,,,NELF-D family,"REGION 16..47; /note=""Disordered""; /evidence=""...",,,593,XEGMAGAVPGAIMDEDYYGSAAEWGDEADGGQQEDDSGEGEDDAEV...
205100,X6RLU5,unreviewed,X6RLU5_HUMAN,Calcium voltage-gated channel auxiliary subuni...,CACNA2D4,Homo sapiens (Human),9606,,,,...,,,"DOMAIN 111..189; /note=""Voltage-dependent calc...",,,,,,192,XELVREVLFDAVVTAPMEAYWTALALNMSEESEHVVDMAFLGTRAG...
205101,X6RLV5,unreviewed,X6RLV5_HUMAN,DEAD-box helicase 5,DDX5,Homo sapiens (Human),9606,,,,...,"COMPBIAS 1..16; /note=""Basic and acidic residu...",,,,,"REGION 1..39; /note=""Disordered""; /evidence=""E...",,,96,MSGYSSDRDRGRDRGFGAPRFGGSRAGPLSGKKFGNPGEKLVKKKW...
205102,X6RLY7,unreviewed,X6RLY7_HUMAN,Calcium voltage-gated channel auxiliary subuni...,CACNA2D4,Homo sapiens (Human),9606,,,,...,,,"DOMAIN 3..255; /note=""Voltage-dependent calciu...",,,,,,282,MKLEFLQRKFWAATRQCSTVDGPCTQSCEDSDLDCFVIDNNGFILI...


In [None]:
df_protein = pd.read_sql(f"""SELECT Entry as uniprot_id, Sequence as sequence
                          FROM proteins
                          WHERE Entry IN (SELECT uniprot_accession FROM uniref50_distilled)""", conn)
df_protein = df_protein[df_protein['sequence'].str.len() < 3000].reset_index(drop=True)

df_protein_sliced = pd.read_sql(f"SELECT uniprot_id, sequence FROM uniref50_domain_sliced_plddt70", conn)
df_protein_sliced = df_protein_sliced[df_protein_sliced['uniprot_id'].isin(df_protein['uniprot_id'])].reset_index(drop=True)

In [19]:
df_protein

Unnamed: 0,uniprot_id,sequence
0,A0A087WZT3,MELSAEYLREKLQRDLEAEHVLPSPGGVGQVRGETAASETQLGS
1,A0A087X1C5,MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNL...
2,A0A087X296,MSRSLLLWFLLFLLLLPPLPVLLADPGAPTPVNPCCYYPCQHQGIC...
3,A0A0B4J2F0,MFRRLTFAQLLFATVLGIAGGVYIFQPVFEQYAKDQKELKEKMQLV...
4,A0A0C5B5G6,MRWQEMGYIFYPRKLR
...,...,...
70687,X6RL83,MLQEWLAAVGDDYAAVVWRPEGEPRFYPDEEGPKHWTKERHQFLME...
70688,X6RLN4,EVKGLFKSENCPKVISCEFAHNSNWYITFQSDTDAQQAFKYLREEV...
70689,X6RLR1,MAGLTDLQRLQARVEELERWVYGPGGARGSRKVADGLVKVQVALGN...
70690,X6RLV5,MSGYSSDRDRGRDRGFGAPRFGGSRAGPLSGKKFGNPGEKLVKKKW...


In [15]:
conn.close()

# TED Entries Summary


In [7]:
df_ted = pd.read_sql(f"SELECT * FROM ted_entries", conn)

In [8]:
df_ted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250405 entries, 0 to 250404
Data columns (total 21 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   ted_id                  250405 non-null  object 
 1   uniprot_acc             250405 non-null  object 
 2   md5_domain              250405 non-null  object 
 3   consensus_level         250405 non-null  object 
 4   chopping                250405 non-null  object 
 5   nres_domain             250405 non-null  int64  
 6   num_segments            250405 non-null  int64  
 7   plddt                   250405 non-null  float64
 8   num_helix_strand_turn   250405 non-null  int64  
 9   num_helix               250405 non-null  int64  
 10  num_strand              250405 non-null  int64  
 11  num_helix_strand        250405 non-null  int64  
 12  num_turn                250405 non-null  int64  
 13  proteome_id             250405 non-null  int64  
 14  cath_label          

In [14]:
df_ted['num_segments'].sum()

np.int64(288517)

In [40]:
df_ted_summary = df_ted[['uniprot_acc', 'ted_id', 'chopping', 'consensus_level', 'plddt', 'cath_label']].rename(columns={'uniprot_acc': 'uniprot_id'})
df_ted_summary['chopping'] = df_ted_summary['chopping'].str.split('_')
df_ted_summary = df_ted_summary.explode('chopping', ignore_index=True)
df_ted_summary[['start_index', 'end_index']]  = df_ted_summary['chopping'].str.split('-', expand=True)
df_ted_summary = df_ted_summary.drop(columns=['chopping'])
df_ted_summary['start_index'] = df_ted_summary['start_index'].astype(int)
df_ted_summary['end_index'] = df_ted_summary['end_index'].astype(int)

In [57]:
df_ted_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288517 entries, 0 to 288516
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   uniprot_id       288517 non-null  object 
 1   ted_id           288517 non-null  object 
 2   consensus_level  288517 non-null  object 
 3   plddt            288517 non-null  float64
 4   cath_label       288517 non-null  object 
 5   start_index      288517 non-null  int64  
 6   end_index        288517 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 15.4+ MB


In [66]:
df_ted_summary

Unnamed: 0,uniprot_id,ted_id,consensus_level,plddt,cath_label,start_index,end_index
0,A0A024R1X5,AF-A0A024R1X5-F1-model_v4_TED01,high,83.2282,1.10.418.40,278,448
1,A0A024R274,AF-A0A024R274-F1-model_v4_TED01,high,93.7552,3.90.520.10,15,134
2,A0A024R274,AF-A0A024R274-F1-model_v4_TED02,high,95.0662,2.60.200.10,288,296
3,A0A024R274,AF-A0A024R274-F1-model_v4_TED02,high,95.0662,2.60.200.10,315,442
4,A0A024R274,AF-A0A024R274-F1-model_v4_TED02,high,95.0662,2.60.200.10,494,539
...,...,...,...,...,...,...,...
288512,X6RLL4,AF-X6RLL4-F1-model_v4_TED02,medium,88.1399,-,149,296
288513,X6RLN4,AF-X6RLN4-F1-model_v4_TED01,high,92.1171,-,3,54
288514,X6RLR1,AF-X6RLR1-F1-model_v4_TED01,medium,81.0586,1.20.5,118,168
288515,X6RLY7,AF-X6RLY7-F1-model_v4_TED01,high,92.0818,3.30.450,33,89


In [64]:
# Write the DataFrame to the SQLite database
# df_ted_summary.to_sql("ted_entries_summary", conn, if_exists="replace", index=False)

288517

# Uniprot Features

## uniprot_features table creation

- Intramembrane
  - INTRAMEM 362..373; /note="Helical; Name=Pore helix"; /evidence="ECO:0000250|UniProtKB:P63142"; INTRAMEM 374..381; /evidence="ECO:0000250|UniProtKB:P63142"
- Topological domain
  - TOPO_DOM 30..697; /note="Extracellular"; /evidence="ECO:0000255"; TOPO_DOM 719..950; /note="Cytoplasmic"; /evidence="ECO:0000255"  
- Transmembrane
  - TRANSMEM 30..56; /note="Helical"; /evidence="ECO:0000256|RuleBase:RU362117"; TRANSMEM 77..98; /note="Helical"; /evidence="ECO:0000256|RuleBase:RU362117"; 
- Coiled coil
  - COILED 3..63; /evidence="ECO:0000256|SAM:Coils"; COILED 256..290; /evidence="ECO:0000256|SAM:Coils" 
- Compositional bias
  - COMPBIAS 1..15; /note="Basic and acidic residues"; /evidence="ECO:0000256|SAM:MobiDB-lite" 
- Domain [FT]
  - DOMAIN 8..82; /note="MHC class II beta chain N-terminal"; /evidence="ECO:0000259|SMART:SM00921"
- Motif
  - MOTIF 45..73; /note="Q motif"; /evidence="ECO:0000256|PROSITE-ProRule:PRU00552"  
- Region
  - REGION 335..362; /note="Disordered"; /evidence="ECO:0000256|SAM:MobiDB-lite"
- Repeat
  - REPEAT 49..80; /note="WD"; /evidence="ECO:0000256|PROSITE-ProRule:PRU00221"
- Zinc finger
  - ZN_FING 2..29; /note="C3H1-type"; /evidence="ECO:0000256|PROSITE-ProRule:PRU00723"

In [141]:
import pandas as pd
import re

column_name_to_identifier = {
    'Intramembrane': 'INTRAMEM',
    'Topological domain': 'TOPO_DOM',
    'Transmembrane': 'TRANSMEM',
    'Coiled coil': 'COILED',
    'Compositional bias': 'COMPBIAS',
    'Domain [FT]': 'DOMAIN',
    'Motif': 'MOTIF',
    'Region': 'REGION',
    'Repeat': 'REPEAT',
    'Zinc finger': 'ZN_FING'
}

def uniprot_structure_to_dict(row):
    structure_list_of_dicts = []
    for col_name, identifier in column_name_to_identifier.items():
        if row[col_name] is not None:
            result = _uniprot_structure_string_to_dict(row[col_name], row['Entry'], col_name, identifier)
            structure_list_of_dicts.extend(result)
    return structure_list_of_dicts

def _uniprot_structure_string_to_dict(input_string, uniprot_id, col_name, identifier):
    # Split the string into individual segments based on '{identifier}', '/note', and '/evidence'
    segments = re.split(rf'({identifier} \d+\.\.\d+); ', input_string)

    # Parse the segments into a structured list
    entries = []
    for i in range(1, len(segments), 2):
        location = segments[i]
        attributes = segments[i + 1] if i + 1 < len(segments) else ''

        # Extract start and end indices
        match = re.match(rf'{identifier} (\d+)\.\.(\d+)', location)
        start_index, end_index = match.groups() if match else (None, None)

        # Extract note and evidence
        note_match = re.search(r'/note="(.*?)"', attributes)
        evidence_match = re.search(r'/evidence="(.*?)"', attributes)
        note = note_match.group(1) if note_match else None
        evidence = evidence_match.group(1) if evidence_match else None

        # Append the parsed entry
        entries.append({
            'uniprot_id': uniprot_id,
            'type': col_name,
            'start_index': int(start_index),
            'end_index': int(end_index),
            'note': note,
            'evidence': evidence
        })
    return entries

# Input string
# input_string = 'INTRAMEM 362..373; /note="Helical; Name=Pore helix"; /evidence="ECO:0000250|UniProtKB:P63142"; INTRAMEM 374..381; /evidence="ECO:0000250|UniProtKB:P63142"'
input_string = df_proteins['Region'].dropna().iloc[3]
pd.DataFrame(_uniprot_structure_string_to_dict(input_string, 'AA', 'Region', 'REGION'))

Unnamed: 0,uniprot_id,type,start_index,end_index,note,evidence
0,AA,Region,102,189,Disordered,ECO:0000256|SAM:MobiDB-lite
1,AA,Region,295,396,Disordered,ECO:0000256|SAM:MobiDB-lite
2,AA,Region,825,1081,Disordered,ECO:0000256|SAM:MobiDB-lite
3,AA,Region,1095,1587,Disordered,ECO:0000256|SAM:MobiDB-lite


In [None]:
uniprot_structure_dict_list = df_proteins.apply(uniprot_structure_to_dict, axis=1)
df_uniprot_structures = pd.json_normalize(uniprot_structure_dict_list.explode())
df_uniprot_structures = df_uniprot_structures.dropna(subset=['uniprot_id'], ignore_index=True)
df_uniprot_structures['start_index'] = df_uniprot_structures['start_index'].astype('int')
df_uniprot_structures['end_index'] = df_uniprot_structures['end_index'].astype('int')
df_uniprot_structures

Unnamed: 0,uniprot_id,type,start_index,end_index,note,evidence
0,A0A024R1X5,Coiled coil,145,267,,ECO:0000256|SAM:Coils
1,A0A024R1X5,Domain [FT],105,129,Beclin-1 BH3,ECO:0000259|Pfam:PF15285
2,A0A024R1X5,Domain [FT],135,261,Atg6/beclin coiled-coil,ECO:0000259|Pfam:PF17675
3,A0A024R1X5,Domain [FT],264,445,Atg6 BARA,ECO:0000259|Pfam:PF04111
4,A0A024R1X5,Region,48,72,Disordered,ECO:0000256|SAM:MobiDB-lite
...,...,...,...,...,...,...
694077,X6RLU5,Domain [FT],111,189,Voltage-dependent calcium channel alpha-2/delt...,ECO:0000259|Pfam:PF08473
694078,X6RLV5,Compositional bias,1,16,Basic and acidic residues,ECO:0000256|SAM:MobiDB-lite
694079,X6RLV5,Region,1,39,Disordered,ECO:0000256|SAM:MobiDB-lite
694080,X6RLY7,Domain [FT],3,255,Voltage-dependent calcium channel alpha-2/delt...,ECO:0000259|Pfam:PF08473


In [144]:
# Write the DataFrame to the SQLite database
# df_uniprot_structures.to_sql("uniprot_features", conn, if_exists="replace", index=False)

694082

In [87]:
df_uniprot_structures

Unnamed: 0,uniprot_id,type,start_index,end_index,note,evidence
0,A0A024R1X5,Coiled coil,145,267,,ECO:0000256|SAM:Coils
1,A0A024R1X5,Domain [FT],105,129,Beclin-1 BH3,ECO:0000259|Pfam:PF15285
2,A0A024R1X5,Domain [FT],135,261,Atg6/beclin coiled-coil,ECO:0000259|Pfam:PF17675
3,A0A024R1X5,Domain [FT],264,445,Atg6 BARA,ECO:0000259|Pfam:PF04111
4,A0A024R1X5,Region,48,72,Disordered,ECO:0000256|SAM:MobiDB-lite
...,...,...,...,...,...,...
722173,X6RLU5,Domain [FT],111,189,Voltage-dependent calcium channel alpha-2/delt...,ECO:0000259|Pfam:PF08473
722174,X6RLV5,Compositional bias,1,16,Basic and acidic residues,ECO:0000256|SAM:MobiDB-lite
722175,X6RLV5,Region,1,39,Disordered,ECO:0000256|SAM:MobiDB-lite
722176,X6RLY7,Domain [FT],3,255,Voltage-dependent calcium channel alpha-2/delt...,ECO:0000259|Pfam:PF08473


# Interpro Entries V2

Interpro entries for Uniprot ids collected using https://www.ebi.ac.uk/interpro/api/entry/interpro/protein/uniprot/{uniprot_id} with collect_interpro_human.py script.

In [156]:
# Extract relevant data
def extract_data(uniprot_id, data):
    if not data or "results" not in data:
        return None
    extracted_data = []
    for result in data["results"]:
        metadata = result.get("metadata", {})
        proteins = result.get("proteins", [])
        fragments = []
        for protein in proteins:
            for location in protein.get("entry_protein_locations", []):
                fragments.extend(location.get("fragments", []))
        extracted_data.append({
            "uniprot_id": uniprot_id,
            "interpro_id": metadata.get("accession"),
            "description": metadata.get("name"),
            # "source_database": metadata.get("source_database"),
            "type": metadata.get("type"),
            "member_databases": str(metadata.get("member_databases")),
            "fragments": str(fragments)
        })
    return extracted_data

In [157]:
interpro_entries_v2_list = []

interpro_entries_json_dir = "/cta/share/users/uniprot/human/interpro_entries"
for filename in tqdm(os.listdir(interpro_entries_json_dir)):
    with open(f"{interpro_entries_json_dir}/{filename}", 'r') as fp:
        interpro_entry = json.loads(fp.read())
    uniprot_id = filename[:-5]
    interpro_entries_v2_list.extend(extract_data(uniprot_id, interpro_entry))
df_interpro_entries_v2 = pd.DataFrame(interpro_entries_v2_list)

100%|██████████| 181639/181639 [00:10<00:00, 16539.51it/s]


In [158]:
df_interpro_entries_v2['member_databases'] = df_interpro_entries_v2['member_databases'].apply(ast.literal_eval)
df_interpro_entries_v2['member_databases'] = df_interpro_entries_v2['member_databases'].apply(lambda data: [list(inner_dict.keys())[0] for inner_dict in data.values()])
df_interpro_entries_v2['member_databases'] = df_interpro_entries_v2['member_databases'].apply(str) # to be able to write to db

In [191]:
df_interpro_entries_v2['fragments'] = df_interpro_entries_v2['fragments'].apply(ast.literal_eval)
df_interpro_entries_v2 = df_interpro_entries_v2.explode('fragments', ignore_index=True)
fragments_df = pd.json_normalize(df_interpro_entries_v2['fragments'])
fragments_df = fragments_df[['start','end']].rename(columns={'start':'start_index', 'end':'end_index'})
df_interpro_entries_v2 = pd.concat([df_interpro_entries_v2.drop(columns=['fragments']), fragments_df], axis=1)

In [225]:
df_interpro_entries_v2

Unnamed: 0,uniprot_id,interpro_id,description,type,member_databases,start_index,end_index
0,B4DNH0,IPR002126,Cadherin-like,domain,"['PS50268', 'SM00112', 'PF00028', 'PR00205']",53,203
1,B4DNH0,IPR002126,Cadherin-like,domain,"['PS50268', 'SM00112', 'PF00028', 'PR00205']",205,329
2,B4DNH0,IPR002126,Cadherin-like,domain,"['PS50268', 'SM00112', 'PF00028', 'PR00205']",319,429
3,B4DNH0,IPR014868,Cadherin prodomain,domain,"['SM01055', 'PF08758']",27,116
4,B4DNH0,IPR015919,Cadherin-like superfamily,homologous_superfamily,['SSF49313'],27,173
...,...,...,...,...,...,...,...
1159448,A0A0A7C699,IPR001039,"MHC class I alpha chain, alpha1 alpha2 domains",domain,['PR01638'],156,174
1159449,A0A0A7C699,IPR011161,MHC class I-like antigen recognition-like,domain,['PF16497'],1,178
1159450,A0A0A7C699,IPR011162,MHC classes I/II-like antigen recognition protein,homologous_superfamily,['SSF54452'],1,180
1159451,A0A0A7C699,IPR037055,MHC class I-like antigen recognition-like supe...,homologous_superfamily,['G3DSA:3.30.500.10'],1,180


In [199]:
# df_interpro_entries_v2.to_sql("interpro_entries_v2", conn, if_exists="replace", index=False)

1159453

# Uniref 50 XML Parsing

uniref50_xml_to_db.py

In [None]:
db_file = '/cta/share/users/uniprot/uniref/uniref.db'
conn = sqlite3.connect(db_file)
df_uniref50_human = pd.read_sql(f"SELECT * FROM uniref50 WHERE taxon_id='9606'", conn)
conn.close()

In [None]:
df_uniref50_human

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef50_Q8WZ42,9606,1
1,Q8WZ42-8,Q8WZ42-8,UPI000255FF54,False,UniRef50_Q8WZ42,9606,1
2,Q8WZ42-2,Q8WZ42-2,UPI000255FF50,False,UniRef50_Q8WZ42,9606,1
3,Q8WZ42-7,Q8WZ42-7,UPI000255FF53,False,UniRef50_Q8WZ42,9606,1
4,C0JYZ2_HUMAN,C0JYZ2,UPI0001981E0D,False,UniRef50_Q8WZ42,9606,1
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef50_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef50_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef50_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef50_Q16175,9606,9606


In [None]:
db_file = "/cta/share/users/uniprot/human/human.db"
conn = sqlite3.connect(db_file)
df_uniref50_human.to_sql("uniref50_all", conn, if_exists="replace", index=False)
conn.close()

---

In [114]:
df_uniref50_human_uniprot = df_uniref50_human.dropna(subset=['uniprot_id'])
df_uniref50_human_uniprot

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef50_Q8WZ42,9606,1
1,Q8WZ42-8,Q8WZ42-8,UPI000255FF54,False,UniRef50_Q8WZ42,9606,1
2,Q8WZ42-2,Q8WZ42-2,UPI000255FF50,False,UniRef50_Q8WZ42,9606,1
3,Q8WZ42-7,Q8WZ42-7,UPI000255FF53,False,UniRef50_Q8WZ42,9606,1
4,C0JYZ2_HUMAN,C0JYZ2,UPI0001981E0D,False,UniRef50_Q8WZ42,9606,1
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef50_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef50_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef50_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef50_Q16175,9606,9606


In [115]:
df_uniref50_human_uniprot = df_uniref50_human_uniprot[~df_uniref50_human_uniprot['uniprot_accession'].str.contains('-')]
df_uniref50_human_uniprot

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef50_Q8WZ42,9606,1
4,C0JYZ2_HUMAN,C0JYZ2,UPI0001981E0D,False,UniRef50_Q8WZ42,9606,1
6,H0Y4J7_HUMAN,H0Y4J7,UPI0032CBC510,False,UniRef50_Q8WZ42,9606,1
7,A2TKE3_HUMAN,A2TKE3,UPI0000F0A0C2,False,UniRef50_Q8WZ42,9606,1
8,H7C0U7_HUMAN,H7C0U7,UPI0032D05761,False,UniRef50_Q8WZ42,9606,1
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef50_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef50_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef50_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef50_Q16175,9606,9606


In [116]:
df_representative_members = df_uniref50_human_uniprot[df_uniref50_human_uniprot['is_representative']=='True']
df_representative_members

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef50_Q8WZ42,9606,1
61,MUC16_HUMAN,Q8WXI7,UPI000065CA24,True,UniRef50_Q8WXI7,9606,9526
72,A0AA34QVW0_HUMAN,A0AA34QVW0,UPI002645ACA0,True,UniRef50_A0AA34QVW0,9606,314293
73,MUC3B_HUMAN,Q9H195,UPI00257942D1,True,UniRef50_Q9H195,9606,9606
78,A0A1W2PS37_HUMAN,A0A1W2PS37,UPI00097BA564,True,UniRef50_A0A1W2PS37,9606,117571
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef50_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef50_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef50_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef50_Q16175,9606,9606


In [117]:
uniref50_human_uniprot_distilled_list = df_representative_members.to_dict('records')

In [118]:
representative_member_list = list(df_representative_members['representative_member'])

In [120]:
not_reprsented_proteins = df_uniref50_human_uniprot[df_uniref50_human_uniprot['is_representative']=='False']['representative_member'].apply(
    lambda x: x not in representative_member_list)

In [121]:
df_uniref50_human_uniprot[
    df_uniref50_human_uniprot['is_representative']=='False'
    ][not_reprsented_proteins].drop_duplicates(subset=['representative_member'])

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
52,A0A0A0MRA3_HUMAN,A0A0A0MRA3,UPI0004621223,False,UniRef50_Q8WZ42-9,9606,117571
81,Q8WXY3_HUMAN,Q8WXY3,UPI000006DB3D,False,UniRef50_A0A091E0S6,9606,1437010
90,H0YKM8_HUMAN,H0YKM8,UPI00022F8581,False,UniRef50_A0AAV6PQT6,9606,117571
271,Q8WXY4_HUMAN,Q8WXY4,UPI000006DDCA,False,UniRef50_Q9QXZ0,9606,32524
275,Q86T18_HUMAN,Q86T18,UPI000019A3BD,False,UniRef50_A0A2J8SSR4,9606,9604
...,...,...,...,...,...,...,...
324322,A0A075B6W7_HUMAN,A0A075B6W7,UPI00021CF13D,False,UniRef50_A0A8D2ATL9,9606,1437010
324487,C1KEM4_HUMAN,C1KEM4,UPI0001998A5D,False,UniRef50_C1KEM2,9606,207598
324528,A0A075B6U9_HUMAN,A0A075B6U9,UPI00021CF15B,False,UniRef50_A0A0D9SE82,9606,9526
324532,A0AAQ5BIC9_HUMAN,A0AAQ5BIC9,UPI0032C69A6C,False,UniRef50_A0AAQ4VMV8,9606,314146


In [122]:
distilled_members_list = df_uniref50_human_uniprot[
    df_uniref50_human_uniprot['is_representative']=='False'
    ][not_reprsented_proteins].drop_duplicates(subset=['representative_member']).to_dict('records')

In [123]:
uniref50_human_uniprot_distilled_list.extend(distilled_members_list)

In [124]:
df_uniref50_human_uniprot_distilled = pd.DataFrame(uniref50_human_uniprot_distilled_list)
df_uniref50_human_uniprot_distilled

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef50_Q8WZ42,9606,1
1,MUC16_HUMAN,Q8WXI7,UPI000065CA24,True,UniRef50_Q8WXI7,9606,9526
2,A0AA34QVW0_HUMAN,A0AA34QVW0,UPI002645ACA0,True,UniRef50_A0AA34QVW0,9606,314293
3,MUC3B_HUMAN,Q9H195,UPI00257942D1,True,UniRef50_Q9H195,9606,9606
4,A0A1W2PS37_HUMAN,A0A1W2PS37,UPI00097BA564,True,UniRef50_A0A1W2PS37,9606,117571
...,...,...,...,...,...,...,...
70896,A0A075B6W7_HUMAN,A0A075B6W7,UPI00021CF13D,False,UniRef50_A0A8D2ATL9,9606,1437010
70897,C1KEM4_HUMAN,C1KEM4,UPI0001998A5D,False,UniRef50_C1KEM2,9606,207598
70898,A0A075B6U9_HUMAN,A0A075B6U9,UPI00021CF15B,False,UniRef50_A0A0D9SE82,9606,9526
70899,A0AAQ5BIC9_HUMAN,A0AAQ5BIC9,UPI0032C69A6C,False,UniRef50_A0AAQ4VMV8,9606,314146


In [128]:
df_uniref50_human_uniprot_distilled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70901 entries, 0 to 70900
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   uniprot_id             70901 non-null  object
 1   uniprot_accession      70901 non-null  object
 2   uniparc_id             70901 non-null  object
 3   is_representative      70901 non-null  object
 4   representative_member  70901 non-null  object
 5   taxon_id               70901 non-null  object
 6   common_taxon_id        70901 non-null  object
dtypes: object(7)
memory usage: 3.8+ MB


In [129]:
# db_file = "/cta/share/users/uniprot/human/human.db"
# conn = sqlite3.connect(db_file)
# df_uniref50_human_uniprot_distilled.to_sql("uniref50_distilled", conn, if_exists="replace", index=False)
# conn.close()

---

# Uniref 90 XML Parsing

uniref90_xml_to_db.py

In [38]:
db_file = '/cta/share/users/uniprot/uniref/uniref90.db'
conn = sqlite3.connect(db_file)
df_uniref90_human = pd.read_sql(f"SELECT * FROM uniref90 WHERE taxon_id='9606'", conn)
conn.close()

In [39]:
df_uniref90_human

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,Q8WZ42-12,Q8WZ42-12,UPI000264F4A1,True,UniRef90_Q8WZ42-12,9606,9604
1,A2TKE6_HUMAN,A2TKE6,UPI0000F0A0C5,False,UniRef90_Q8WZ42-12,9606,9604
2,C9JQJ2_HUMAN,C9JQJ2,UPI00046209C6,False,UniRef90_Q8WZ42-12,9606,9604
3,A0A0C4DG59_HUMAN,A0A0C4DG59,UPI0032CD4A7F,False,UniRef90_Q8WZ42-12,9606,9604
4,A0AAQ5BIC8_HUMAN,A0AAQ5BIC8,UPI0032D032FE,False,UniRef90_Q8WZ42-12,9606,9604
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef90_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef90_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef90_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef90_Q16175,9606,9606


In [40]:
db_file = "/cta/share/users/uniprot/human/human.db"
conn = sqlite3.connect(db_file)
df_uniref90_human.to_sql("uniref90_all", conn, if_exists="replace", index=False)
conn.close()

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef50_Q8WZ42,9606,1
61,MUC16_HUMAN,Q8WXI7,UPI000065CA24,True,UniRef50_Q8WXI7,9606,9526
72,A0AA34QVW0_HUMAN,A0AA34QVW0,UPI002645ACA0,True,UniRef50_A0AA34QVW0,9606,314293
73,MUC3B_HUMAN,Q9H195,UPI00257942D1,True,UniRef50_Q9H195,9606,9606
78,A0A1W2PS37_HUMAN,A0A1W2PS37,UPI00097BA564,True,UniRef50_A0A1W2PS37,9606,117571
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef50_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef50_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef50_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef50_Q16175,9606,9606


---

In [134]:
df_uniref90_human_uniprot = df_uniref90_human.dropna(subset=['uniprot_id'])
df_uniref90_human_uniprot

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,Q8WZ42-12,Q8WZ42-12,UPI000264F4A1,True,UniRef90_Q8WZ42-12,9606,9604
1,A2TKE6_HUMAN,A2TKE6,UPI0000F0A0C5,False,UniRef90_Q8WZ42-12,9606,9604
2,C9JQJ2_HUMAN,C9JQJ2,UPI00046209C6,False,UniRef90_Q8WZ42-12,9606,9604
3,A0A0C4DG59_HUMAN,A0A0C4DG59,UPI0032CD4A7F,False,UniRef90_Q8WZ42-12,9606,9604
4,A0AAQ5BIC8_HUMAN,A0AAQ5BIC8,UPI0032D032FE,False,UniRef90_Q8WZ42-12,9606,9604
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef90_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef90_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef90_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef90_Q16175,9606,9606


In [135]:
df_uniref90_human_uniprot = df_uniref90_human_uniprot[~df_uniref90_human_uniprot['uniprot_accession'].str.contains('-')]
df_uniref90_human_uniprot

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
1,A2TKE6_HUMAN,A2TKE6,UPI0000F0A0C5,False,UniRef90_Q8WZ42-12,9606,9604
2,C9JQJ2_HUMAN,C9JQJ2,UPI00046209C6,False,UniRef90_Q8WZ42-12,9606,9604
3,A0A0C4DG59_HUMAN,A0A0C4DG59,UPI0032CD4A7F,False,UniRef90_Q8WZ42-12,9606,9604
4,A0AAQ5BIC8_HUMAN,A0AAQ5BIC8,UPI0032D032FE,False,UniRef90_Q8WZ42-12,9606,9604
12,H7C0U7_HUMAN,H7C0U7,UPI0032D05761,True,UniRef90_H7C0U7,9606,207598
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef90_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef90_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef90_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef90_Q16175,9606,9606


In [138]:
df_representative_members = df_uniref90_human_uniprot[df_uniref90_human_uniprot['is_representative']=='True']
df_representative_members

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
12,H7C0U7_HUMAN,H7C0U7,UPI0032D05761,True,UniRef90_H7C0U7,9606,207598
13,A0A1B0GXE3_HUMAN,A0A1B0GXE3,UPI0032CA12C3,True,UniRef90_A0A1B0GXE3,9606,9606
15,H0Y4J7_HUMAN,H0Y4J7,UPI0032CBC510,True,UniRef90_H0Y4J7,9606,9606
21,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef90_Q8WZ42,9606,9526
61,A0A3Q8B178_HUMAN,A0A3Q8B178,UPI000F859C53,True,UniRef90_A0A3Q8B178,9606,207598
...,...,...,...,...,...,...,...
325191,V9H0A7_HUMAN,V9H0A7,UPI000011E266,True,UniRef90_V9H0A7,9606,9606
325192,Q16427_HUMAN,Q16427,UPI000006DE79,True,UniRef90_Q16427,9606,9606
325193,A6QL42_HUMAN,A6QL42,UPI0001574DCB,True,UniRef90_A6QL42,9606,9606
325194,Q16175_HUMAN,Q16175,UPI000006CDA4,True,UniRef90_Q16175,9606,9606


In [139]:
uniref90_human_uniprot_distilled_list = df_representative_members.to_dict('records')

In [140]:
representative_member_list = list(df_representative_members['representative_member'])

In [141]:
not_reprsented_proteins = df_uniref90_human_uniprot[df_uniref90_human_uniprot['is_representative']=='False']['representative_member'].apply(
    lambda x: x not in representative_member_list)

In [142]:
df_uniref90_human_uniprot[
    df_uniref90_human_uniprot['is_representative']=='False'
    ][not_reprsented_proteins].drop_duplicates(subset=['representative_member'])

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
1,A2TKE6_HUMAN,A2TKE6,UPI0000F0A0C5,False,UniRef90_Q8WZ42-12,9606,9604
52,A0A0A0MRA3_HUMAN,A0A0A0MRA3,UPI0004621223,False,UniRef90_Q8WZ42-9,9606,1437010
79,Q8WXY3_HUMAN,Q8WXY3,UPI000006DB3D,False,UniRef90_A0A091E0S6,9606,1437010
82,E7ERX3_HUMAN,E7ERX3,UPI0001E8F3D2,False,UniRef90_A0A2J8SSS3,9606,1437010
83,Q8WXY4_HUMAN,Q8WXY4,UPI000006DDCA,False,UniRef90_A0A091EB58,9606,32524
...,...,...,...,...,...,...,...
324150,A0A3B3ITF9_HUMAN,A0A3B3ITF9,UPI000E6F2053,False,UniRef90_A0A5F5Y587,9606,1437010
324315,A0A075B6W7_HUMAN,A0A075B6W7,UPI00021CF13D,False,UniRef90_A0A8D2ATL9,9606,1437010
324485,C1KEM4_HUMAN,C1KEM4,UPI0001998A5D,False,UniRef90_C1KEM2,9606,207598
324527,A0A075B6U9_HUMAN,A0A075B6U9,UPI00021CF15B,False,UniRef90_A0A0D9SE82,9606,9526


In [143]:
distilled_members_list = df_uniref90_human_uniprot[
    df_uniref90_human_uniprot['is_representative']=='False'
    ][not_reprsented_proteins].drop_duplicates(subset=['representative_member']).to_dict('records')

In [144]:
uniref90_human_uniprot_distilled_list.extend(distilled_members_list)

In [145]:
df_uniref90_human_uniprot_distilled = pd.DataFrame(uniref90_human_uniprot_distilled_list)
df_uniref90_human_uniprot_distilled

Unnamed: 0,uniprot_id,uniprot_accession,uniparc_id,is_representative,representative_member,taxon_id,common_taxon_id
0,H7C0U7_HUMAN,H7C0U7,UPI0032D05761,True,UniRef90_H7C0U7,9606,207598
1,A0A1B0GXE3_HUMAN,A0A1B0GXE3,UPI0032CA12C3,True,UniRef90_A0A1B0GXE3,9606,9606
2,H0Y4J7_HUMAN,H0Y4J7,UPI0032CBC510,True,UniRef90_H0Y4J7,9606,9606
3,TITIN_HUMAN,Q8WZ42,UPI00025287CD,True,UniRef90_Q8WZ42,9606,9526
4,A0A3Q8B178_HUMAN,A0A3Q8B178,UPI000F859C53,True,UniRef90_A0A3Q8B178,9606,207598
...,...,...,...,...,...,...,...
94014,A0A3B3ITF9_HUMAN,A0A3B3ITF9,UPI000E6F2053,False,UniRef90_A0A5F5Y587,9606,1437010
94015,A0A075B6W7_HUMAN,A0A075B6W7,UPI00021CF13D,False,UniRef90_A0A8D2ATL9,9606,1437010
94016,C1KEM4_HUMAN,C1KEM4,UPI0001998A5D,False,UniRef90_C1KEM2,9606,207598
94017,A0A075B6U9_HUMAN,A0A075B6U9,UPI00021CF15B,False,UniRef90_A0A0D9SE82,9606,9526


In [146]:
df_uniref90_human_uniprot_distilled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94019 entries, 0 to 94018
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   uniprot_id             94019 non-null  object
 1   uniprot_accession      94019 non-null  object
 2   uniparc_id             94019 non-null  object
 3   is_representative      94019 non-null  object
 4   representative_member  94019 non-null  object
 5   taxon_id               94019 non-null  object
 6   common_taxon_id        94019 non-null  object
dtypes: object(7)
memory usage: 5.0+ MB


In [147]:
# db_file = "/cta/share/users/uniprot/human/human.db"
# conn = sqlite3.connect(db_file)
# df_uniref90_human_uniprot_distilled.to_sql("uniref90_distilled", conn, if_exists="replace", index=False)
# conn.close()

## Generate Fasta Files

In [32]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

def df_to_fasta_biopython(df, output_file, id_col='uniprot_id', seq_col='sequence', description_cols=None):
    """
    Convert a DataFrame containing sequences to FASTA format using Biopython
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing sequence data
    output_file : str
        Path to output FASTA file
    id_col : str
        Name of the column containing sequence identifiers
    seq_col : str
        Name of the column containing sequences
    description_cols : list
        List of column names to include in the description field
    """
    records = []
    
    for idx, row in df.iterrows():
        # Create description string if description columns are provided
        if description_cols:
            description = ' | '.join(f"{col}={row[col]}" for col in description_cols)
        else:
            description = ''
            
        # Create sequence record
        record = SeqRecord(
            Seq(row[seq_col]),
            id=str(row[id_col]),
            description=description
        )
        records.append(record)
    
    # Write to FASTA file
    SeqIO.write(records, output_file, "fasta")

In [25]:
# Uniref50
db_file = "/cta/share/users/uniprot/human/human.db"
conn = sqlite3.connect(db_file)
df_protein = pd.read_sql(f"""SELECT Entry as uniprot_id, Sequence as sequence
                          FROM proteins
                          WHERE Entry IN (SELECT uniprot_accession FROM uniref50_distilled)""", conn)
df_protein = df_protein[df_protein['sequence'].str.len() < 3000].reset_index(drop=True)

df_protein_sliced = pd.read_sql(f"SELECT * FROM uniref50_domain_sliced_plddt70", conn)
df_protein_sliced = df_protein_sliced[df_protein_sliced['uniprot_id'].isin(df_protein['uniprot_id'])].reset_index(drop=True)
df_protein_sliced['occurrence'] = df_protein_sliced.groupby('uniprot_id').cumcount() + 1

conn.close()


In [44]:
df_protein

Unnamed: 0,uniprot_id,sequence
0,A0A087WZT3,MELSAEYLREKLQRDLEAEHVLPSPGGVGQVRGETAASETQLGS
1,A0A087X1C5,MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNL...
2,A0A087X296,MSRSLLLWFLLFLLLLPPLPVLLADPGAPTPVNPCCYYPCQHQGIC...
3,A0A0B4J2F0,MFRRLTFAQLLFATVLGIAGGVYIFQPVFEQYAKDQKELKEKMQLV...
4,A0A0C5B5G6,MRWQEMGYIFYPRKLR
...,...,...
70687,X6RL83,MLQEWLAAVGDDYAAVVWRPEGEPRFYPDEEGPKHWTKERHQFLME...
70688,X6RLN4,EVKGLFKSENCPKVISCEFAHNSNWYITFQSDTDAQQAFKYLREEV...
70689,X6RLR1,MAGLTDLQRLQARVEELERWVYGPGGARGSRKVADGLVKVQVALGN...
70690,X6RLV5,MSGYSSDRDRGRDRGFGAPRFGGSRAGPLSGKKFGNPGEKLVKKKW...


In [33]:
df_to_fasta_biopython(df_protein, "../RSRC/uniref_50.fasta")

In [42]:
df_protein_sliced

Unnamed: 0,uniprot_id,sequence,source,occurrence
0,A0A1W2PS37,MSMERRMKIEETWRLW,out_of_domain,1
1,A8CLL2,VMDNPLVMHQLRCNGVLEGIRICRKGFPNRILYGDFRQ,IPR001609,1
2,Q68DN1,MELTPGAQQQGINYQELTSGWQDVKSMMLVPEPTRKFPSGPLLTSV...,out_of_domain,1
3,Q5W8V9,M,out_of_domain,1
4,Q5W8V9,PKLLQGVITVIDVFYQYATQHGEYDTLNKAELKELLENEFHQILKN...,IPR034325,2
...,...,...,...,...
236828,A0A075B6W7,XNAGNNRKLIWGLGTSLAVNP,out_of_domain,1
236829,C1KEM4,NVKSEGSGQRGGSMAVLVWLHM,out_of_domain,1
236830,A0A075B6U9,XNTGGTIDKLTFGKGTHVFIIS,out_of_domain,1
236831,A0AAQ5BIC9,MSDSDSRTEKRKKKRPNGKATF,out_of_domain,1


In [43]:
df_to_fasta_biopython(df_protein_sliced, "../RSRC/uniref_50_pretokenized.fasta", description_cols=['occurrence', 'source'])