In [1]:
from subpred.util import load_df

In [2]:
# TODO Different ways to filter for passive transporters, different definitions
# TODO What to do about multiple substrates? Create stats on them
# TODO Plots: Dim reduction, Hclust, Kmeans
# TODO stats on 2.B, 2.C, 2.D
# TODO possible to make bacteria-only dataset?

When clustering the entire dataset of *E. coli* transporters in the previous notebook, we discovered that clusters form based on transport mechanism or protein family, not on substrates, which is what we wanted. The idea now is to only look at one type of transmembrane transporter, to find out if we can divide that set into substrates. We already had the suspicion in Manuscript 1 that the *A. thaliana* model worked to well because virtually all of the transporters in our dataset were secondary active. 

In this notebook, we will try creating datasets that only contain secondary active transporters, which typically transport small hydrophilic molecules across membranes. This excludes channels (passive transporters), binding proteins from transport complexes such as ABC transporters, primary active transporters, and others. We will look at those in other notebooks.

Secondary active transporters are gradient-driven transporters, typically with alpha-helical structures. The biggest category should be the Porters, in the form of Uniporters, Symporters and Antiporters. The latter two come with the challenge that they fall into two or more substrate classes. We should create different categories for the individual combinations, and see if they make good substrate classes. The mechanism usually works by a conformational change in the protein, through the presence of a substrate.

We will also explore four different ways of creating the passive transporter dataset: TCDB, Gene Ontology, Keywords and Interpro domains. The respective datasets will be compared in terms of size and overlap. TCDB might have the most biologically accurate data, but is not available for most organisms. *E. coli* is a rare example where most of the transporters have an entry. 

## Protein dataset

Loading the Uniprot data for all organisms:

In [3]:
df_uniprot = load_df("uniprot")
df_uniprot

Unnamed: 0_level_0,gene_names,protein_names,reviewed,protein_existence,sequence,organism_id
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A0A0C5B5G6,MT-RNR1,Mitochondrial-derived peptide MOTS-c (Mitochon...,True,1,MRWQEMGYIFYPRKLR,9606
A0A1B0GTW7,CIROP LMLN2,Ciliated left-right organizer metallopeptidase...,True,1,MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKS...,9606
A0JNW5,BLTP3B KIAA0701 SHIP164 UHRF1BP1L,Bridge-like lipid transfer protein family memb...,True,1,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,9606
A0JP26,POTEB3,POTE ankyrin domain family member B3,True,1,MVAEVCSMPAASAVKKPFDLRSKMGKWCHHRFPCCRGSGKSNMGTS...,9606
A0PK11,CLRN2,Clarin-2,True,1,MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDL...,9606
...,...,...,...,...,...,...
X5L4R4,NOD-2,Nucleotide-binding oligomerization domain-cont...,False,2,MSPGCYKGWPFNCHLSHEEDKRRNETLLQEAETSNLQITASFVSGL...,586796
X5MBL2,GT34D,"Putative galacto(Gluco)mannan alpha-1,6-galact...",False,2,KVLYDRAFNSSDDQSALVYLLLKEKDKWADRIFIEHKYYLNGYWLD...,3352
X5MFI4,GT34D,"Putative galacto(Gluco)mannan alpha-1,6-galact...",False,2,MDEDVLCKGPLHGGSARSLKGSLKRLKRIMESLNDGLIFMGGAVSA...,3352
X5MI49,GT34A,"Putative galacto(Gluco)mannan alpha-1,6-galact...",False,2,MVNDSKLETISGNMVQKRKSFDGLPFWTVSIAGGLLLCWSLWRICF...,3352


Filtering for *E coli K12*, since that strain has the highest number of functional annotations.

In [4]:
df_uniprot_ecoli = df_uniprot[df_uniprot.organism_id == 83333].drop("organism_id", axis=1)
df_uniprot_ecoli

Unnamed: 0_level_0,gene_names,protein_names,reviewed,protein_existence,sequence
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
P00509,aspC b0928 JW0911,Aspartate aminotransferase (AspAT) (EC 2.6.1.1...,True,1,MFENITAAPADPILGLADLFRADERPGKINLGIGVYKDETGKTPVL...
P00803,lepB b2568 JW2552,Signal peptidase I (SPase I) (EC 3.4.21.89) (L...,True,1,MANMFALILVIATLVTGILWCVDKFFFAPKRRERQAAAQAAAGDSL...
P00804,lspA lsp b0027 JW0025,Lipoprotein signal peptidase (EC 3.4.23.36) (P...,True,1,MSQSICSTGLRWLWLVVVVLIIDLGSKYLILQNFALGDTVPLFPSL...
P00861,lysA b2838 JW2806,Diaminopimelate decarboxylase (DAP decarboxyla...,True,1,MPHSLFSTDTDLTAENLLRLPAEFGCPVWVYDAQIIRRQIAALKQF...
P00946,manA pmi b1613 JW1605,Mannose-6-phosphate isomerase (EC 5.3.1.8) (Ph...,True,1,MQKLINSVQNYAWGSKTALTELYGMENPSSQPMAELWMGAHPKSSS...
...,...,...,...,...,...
P76154,ydfK b1544 JW1537,Cold shock protein YdfK,True,2,MKSKDTLKWFPAQLPEVRIILGDAVVEVAKQGRPINTRTLLDYIEG...
P0AEG8,dsrB b1952 JW1936,Protein DsrB,True,2,MKVNDRVTVKTDGGPRRPGVVLAVEEFSEGTMYLVSLEDYPLGIWF...
P33668,ybbC b0498 JW0487,Uncharacterized protein YbbC,True,2,MKYSSIFSMLSFFILFACNETAVYGSDENIIFMRYVEKLHLDKYSV...
A0A7H2C7B0,speFL ECK4660 b4803,Leader peptide SpeFL (Arrest peptide SpeFL),False,2,MENNSRTMPHIRRTTHIMKFAHRNSFDFHFFNAR


What is the distribution of evidence codes among Swissprot and TrEMBL?

In [5]:
df_uniprot_ecoli.groupby(["reviewed", "protein_existence"]).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,reviewed,protein_existence,count
0,False,1,1
1,False,2,1
2,True,1,3118
3,True,2,164


This organism seems to be researched very well. TrEMBL only adds two additional samples to the dataset. Which ones are those?

In [6]:
df_uniprot_ecoli[~df_uniprot_ecoli.reviewed]

Unnamed: 0_level_0,gene_names,protein_names,reviewed,protein_existence,sequence
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A0A7H2C7B0,speFL ECK4660 b4803,Leader peptide SpeFL (Arrest peptide SpeFL),False,2,MENNSRTMPHIRRTTHIMKFAHRNSFDFHFFNAR
A0A0A6YVN8,D-tagatose 3-epimerase,D-tagatose 3-epimerase,False,1,MNKVGMFYTYWSTEWMVDFPATAKRIAGLGFDLMEISLGEFHNLSD...


A metal ion binding protein with otherwise unknown function or location, and an expression factor. Nothing related to transport, we can remove them.

In [44]:
df_uniprot_ecoli = df_uniprot_ecoli[df_uniprot_ecoli.reviewed].drop(["reviewed"], axis=1)
df_uniprot_ecoli

Unnamed: 0_level_0,gene_names,protein_names,protein_existence,sequence
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P00509,aspC b0928 JW0911,Aspartate aminotransferase (AspAT) (EC 2.6.1.1...,1,MFENITAAPADPILGLADLFRADERPGKINLGIGVYKDETGKTPVL...
P00803,lepB b2568 JW2552,Signal peptidase I (SPase I) (EC 3.4.21.89) (L...,1,MANMFALILVIATLVTGILWCVDKFFFAPKRRERQAAAQAAAGDSL...
P00804,lspA lsp b0027 JW0025,Lipoprotein signal peptidase (EC 3.4.23.36) (P...,1,MSQSICSTGLRWLWLVVVVLIIDLGSKYLILQNFALGDTVPLFPSL...
P00861,lysA b2838 JW2806,Diaminopimelate decarboxylase (DAP decarboxyla...,1,MPHSLFSTDTDLTAENLLRLPAEFGCPVWVYDAQIIRRQIAALKQF...
P00946,manA pmi b1613 JW1605,Mannose-6-phosphate isomerase (EC 5.3.1.8) (Ph...,1,MQKLINSVQNYAWGSKTALTELYGMENPSSQPMAELWMGAHPKSSS...
...,...,...,...,...
P77564,ydhW b1672 JW1662,Uncharacterized protein YdhW,2,MGKMNHQDELPLAKVSEVDEAKRQWLQGMRHPVDTVTEPEPAEILA...
P76157,ynfN b1551 JW5254,Uncharacterized protein YnfN,2,MREYPNGEKTHLTVMAAGFPSLTGDHKVIYVAADRHVTSEEILEAA...
P76154,ydfK b1544 JW1537,Cold shock protein YdfK,2,MKSKDTLKWFPAQLPEVRIILGDAVVEVAKQGRPINTRTLLDYIEG...
P0AEG8,dsrB b1952 JW1936,Protein DsrB,2,MKVNDRVTVKTDGGPRRPGVVLAVEEFSEGTMYLVSLEDYPLGIWF...


It could also be interesting to have a bacteria-only dataset, and compare that to *E. coli*. Is there a tsv file that maps organism id to kingdom?

## Annotations for filtering out passive transporters

Different databases have different definitions of transporter classes, and different annotations.

#### TCDB:

The *TCDB class 2* represents the *Electrochemical Potential-driven Transporters*. These are split into subclasses 2.A (Porters) and 2.B, 2.C and 2.D . Class 2.A seems to correspond to the typical definition of secondary active transporters, so we should look at the remaining three subclasses.

#### Gene Ontology:

The term *secondary active transmembrane transporter activity* (GO:0015291) is a molecular function annotation. The definition is:

*Enables the transfer of a solute from one side of a membrane to the other, up its concentration gradient. The transporter binds the solute and undergoes a series of conformational changes. Transport works equally well in either direction and is driven by a chemiosmotic source of energy, not direct ATP coupling. Secondary active transporters include symporters and antiporters.*

The GO entry cites the TCDB paper as its source, so it seems to use the same definitions.

Child terms include: symporter activity, uniporter activity, antiporter activity.

We should look into electronically inferred annotations (IEA). How would they impact the number of secondary active transport annotations in our dataset?

Qualifiers for the annotations should be filtered by "enables", which means that the protein is directly responsible for the function.

#### Keywords:

- Symport (KW-0769): Protein involved in the transport of solutes across a biological membrane in one direction, which depends on the transport of another solute in the same direction. One molecule can move up an electrochemical gradient because the movement of the other molecule is more favorable. Example: the sodium/glucose co-transport.
- Antiport (KW-0050): Protein involved in the transport of a solute across a biological membrane coupled, directly, to the transport of a different solute in the opposite direction.
- There is no keywords for uniport, when looking at examples it seems like they are just annotated with their substrate keyword and "Transport". Some Uniporters in Uniprot seem to be annotated with Symport or Antiport in GO. Uniprot typically depends on a concentration gradient between the two compartments, independently of the movement of any other molecular species. Example is the GLUT family of sugar transporters, which is responsible for sugar uptake in mammals (Reviews [1](https://doi.org/10.3390/ijms23158698), [2](https://doi.org/10.1007/s00424-020-02411-3)).


#### Interpro:

The biggest superfamily of secondary active transporters is the MFS family. All members of that family contain the same sequence domain. Generally, the TCDB is structured along protein families, which are typically defined by the existence of a particula domain. If we create a dataset of all secondary active transport families, we can use Interpro annotations as well. 
