# Training machine learning models on pairs of substrates in individual organisms

After creating the set of primary ChEBI substrates from a set of transmembrane transport proteins, the next task is to create a pipeline for generating these datasets automatically. Next, the 

## Generalized dataset creation method

- filter uniprot proteins generally
- filter again for transmembrane transporter activity
- get network of GO terms that are descendants of transmembrane transporter activity
    - filter network for proteins with in dataset with term
    - annotate with number of proteins
    - annotate with ChEBI substrates
- get chebi network
    - filter for GO terms in filtered GO network
- 70% sequence identity?
- feature generation
- svm pipeline

End results: 
- Annotated GO network for protein dataset
- Annotated chebi network for protein dataset
- Dataframe with accession, organism id, sequence, transport go terms, and transported substrate chebi term

In [12]:
from subpred.util import load_df
from subpred.graph import preprocess_data

In [13]:
dataset_name_to_organism_ids = {"human": {9606}, "athaliana": {3702}, "ecoli": {83333}, "yeast": {559292}}
dataset_name_to_organism_ids["all"] = {list(s)[0] for s in dataset_name_to_organism_ids.values() if len(s) == 1}
dataset_name_to_organism_ids

{'human': {9606},
 'athaliana': {3702},
 'ecoli': {83333},
 'yeast': {559292},
 'all': {3702, 9606, 83333, 559292}}

In [17]:
organism_ids = dataset_name_to_organism_ids["athaliana"]
df_uniprot, df_uniprot_goa, graph_go_filtered, graph_chebi_filtered = preprocess_data(
    organism_ids=organism_ids, datasets_folder_path="../data/datasets"
)


43248
164519
60547


Create sequence-based features

In [18]:
df_uniprot.sequence

Uniprot
A0A1P8AUY4    MAIDGSFNLKLALETFSVRCPKVAAFPCFTSILSKGGEVVDNEEVI...
A0MES8        MDPLASQHQHNHLEDNNQTLTHNNPQSDSTTDSSTSSAQRKRKGKG...
A1A6M1        MASSSLPLSLPFPLRSLTSTTRSLPFQCSPLFFSIPSSIVCFSTQN...
A7XGN8        MAEGVVLFGVHKLWELLNRESARLNGIGEQVDGLKRQLGRLQSLLK...
A8MQN2        MSDLYIHELGDYLSDEFHGNDDGIVPDSAYEDGGQFPILVSNRKKR...
                                    ...                        
P0C0B0        MTLKGALSVKFDVKCPADKFFSAFVEDTNRPFEKNGKTEIEAVDLV...
Q8RXM6        MATDDFFSLQREYWRQVAESDGFDIESVQIPPSMYGRINGLIPHNC...
Q9C7N3        MALSGGDDQVPPGGPMTVSTRLIDAGSTIMQRRLPVKQMNTHVSTF...
Q9LX56        MEANKKAKPSYGDVISNLPNDLLCRILSYLSTKEAALTSILSKRWS...
Q9SGS5        MDIKHHTSTSKPKNKNKLLKMLPKAMSFGHRVPPFSPGRDLHHNNH...
Name: sequence, Length: 7583, dtype: string

In [19]:
from subpred.compositions import calculate_aac, calculate_paac

df_aac = calculate_aac(df_uniprot.sequence)
df_paac = calculate_paac(df_uniprot.sequence)

In [22]:
from subpred.pssm import calculate_pssm_feature

df_pssm = calculate_pssm_feature(
    df_uniprot.sequence,
    tmp_folder="../data/intermediate/blast/pssm_uniref50_1it",
    blast_db="../data/raw/uniref/uniref50/uniref50.fasta",
    iterations=1,
    psiblast_threads=16,
    verbose=True
)

PSSM for accession A0A1P8AUY4 was not found in tmp folder ../data/intermediate/blast/pssm_uniref50_1it, calling psiblast


KeyboardInterrupt: 

In [15]:
graph_go.nodes()

NodeView(('GO:0022836', 'GO:0008381', 'GO:0015086', 'GO:0015106', 'GO:0008520', 'GO:0015180', 'GO:0042910', 'GO:0005283', 'GO:0061459', 'GO:0015450', 'GO:0044736', 'GO:0005351', 'GO:0015349', 'GO:0008513', 'GO:0015374', 'GO:0022841', 'GO:1905030', 'GO:0005297', 'GO:0080122', 'GO:0015651', 'GO:1901702', 'GO:0015233', 'GO:0015390', 'GO:0033285', 'GO:0005228', 'GO:0015098', 'GO:0042887', 'GO:0015218', 'GO:0015655', 'GO:0061768', 'GO:0005260', 'GO:0015228', 'GO:0044769', 'GO:0015095', 'GO:0015112', 'GO:0000064', 'GO:0022803', 'GO:0086057', 'GO:0008521', 'GO:0140831', 'GO:0015208', 'GO:1901682', 'GO:0099142', 'GO:0005343', 'GO:1901680', 'GO:0015552', 'GO:0022857', 'GO:0015271', 'GO:0015187', 'GO:0005452', 'GO:0000514', 'GO:0005427', 'GO:0008320', 'GO:0033283', 'GO:0005277', 'GO:0015234', 'GO:0015526', 'GO:0015279', 'GO:0046715', 'GO:0005221', 'GO:0043250', 'GO:0005223', 'GO:0015203', 'GO:0005326', 'GO:0015272', 'GO:0015665', 'GO:0015278', 'GO:0015139', 'GO:0005375', 'GO:0140930', 'GO:001519