# PSSM Feature evaluation

During the dataset evaluation, we found that E Coli transports form its own cluster in the PCA plot. How does the model perform without E Coli transporters?

# Imports

In [1]:
from subpred.transporter_dataset import create_dataset
from subpred.eval import (
    get_independent_test_set,
    optimize_hyperparams,
    preprocess_pandas,
    models_quick_compare,
    get_confusion_matrix,
    get_classification_report,
    full_test,
    get_cv_scores
)
from subpred.pssm import calculate_pssms_notebook

# Dataset

In [2]:
outliers = (
    ["Q9HBR0", "Q07837"]  + ["O81775", "Q9SW07", "Q9FHH5", "Q8S8A0", "Q3E965", "Q3EAV6", "Q3E8L0"]
    
)
df = create_dataset(
    keywords_substrate_filter=["Amino-acid transport", "Sugar transport"],
    keywords_component_filter=["Transmembrane"],
    keywords_transport_filter=["Transport"],
    input_file="../data/raw/swissprot/uniprot-reviewed_yes.tab.gz",
    multi_substrate="integrate",
    verbose=True,
    tax_ids_filter=[3702, 9606, 559292],
    output_log="../logs/meta_amino_sugar_dataset.log",
    outliers=outliers,
    sequence_clustering=70
)
taxid_to_organism = {
    3702: "A. thaliana",
    9606: "Human",
    559292: "Yeast",
}
df = df.assign(organism=df.organism_id.map(taxid_to_organism))


cd-hit: clustered 314 sequences into 249 clusters at threshold 70


# Feature generation

In [3]:
labels = df.keywords_transport
labels.value_counts()

Sugar transport         134
Amino-acid transport    115
Name: keywords_transport, dtype: int64

In [4]:
df_pssm = calculate_pssms_notebook(df.sequence)
df_pssm

Unnamed: 0_level_0,AA_50_1,AR_50_1,AN_50_1,AD_50_1,AC_50_1,AQ_50_1,AE_50_1,AG_50_1,AH_50_1,AI_50_1,...,VL_90_3,VK_90_3,VM_90_3,VF_90_3,VP_90_3,VS_90_3,VT_90_3,VW_90_3,VY_90_3,VV_90_3
Uniprot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q9SFG0,0.784223,0.252900,0.327146,0.238979,0.394432,0.350348,0.276102,0.545244,0.227378,0.317865,...,0.434307,0.381387,0.421533,0.578467,0.357664,0.390511,0.392336,0.512774,0.656934,0.417883
Q08986,0.734091,0.259091,0.313636,0.220455,0.393182,0.295455,0.234091,0.529545,0.265909,0.415909,...,0.425047,0.345351,0.402277,0.584440,0.282732,0.351044,0.351044,0.605313,0.759013,0.387097
Q9BRV3,0.676768,0.488215,0.508418,0.464646,0.602694,0.511785,0.478114,0.565657,0.511785,0.612795,...,0.484375,0.403125,0.471875,0.706250,0.368750,0.443750,0.440625,0.568750,0.856250,0.478125
Q84WN3,0.664740,0.416185,0.462428,0.427746,0.624277,0.445087,0.456647,0.526012,0.479769,0.543353,...,0.383260,0.264317,0.374449,0.726872,0.215859,0.286344,0.312775,0.493392,0.982379,0.352423
O04249,0.735484,0.286022,0.352688,0.281720,0.479570,0.352688,0.318280,0.531183,0.279570,0.417204,...,0.476898,0.415842,0.471947,0.592409,0.387789,0.415842,0.422442,0.514851,0.702970,0.450495
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q94EI9,0.807471,0.396552,0.425287,0.350575,0.718391,0.465517,0.410920,0.589080,0.393678,0.591954,...,0.469697,0.412121,0.463636,0.660606,0.378788,0.433333,0.445455,0.633333,0.903030,0.463636
Q92536,0.800000,0.343243,0.351351,0.278378,0.556757,0.354054,0.327027,0.581081,0.310811,0.483784,...,0.462547,0.425094,0.460674,0.597378,0.376404,0.425094,0.423221,0.528090,0.820225,0.458801
F4IHS9,0.745981,0.495177,0.520900,0.450161,0.649518,0.520900,0.485531,0.578778,0.469453,0.604502,...,0.533654,0.492788,0.543269,0.639423,0.492788,0.526442,0.524038,0.661058,0.713942,0.533654
Q04162,0.786925,0.305085,0.392252,0.295400,0.513317,0.372881,0.341404,0.624697,0.278450,0.411622,...,0.527721,0.501027,0.503080,0.694045,0.435318,0.525667,0.519507,0.624230,0.837782,0.517454


## Independent test set

In [5]:
X, y, feature_names, sample_names = preprocess_pandas(
    df_pssm, labels, return_names=True
)
(
    X_train,
    X_test,
    y_train,
    y_test,
    sample_names_train,
    sample_names_test,
) = get_independent_test_set(X, y, sample_names=sample_names, test_size=0.2)



## Model comparison

PSSM seems to work better than the sequence-based features. SVC looks the most promising.

In [6]:
models_quick_compare(X_train, y_train)

Unnamed: 0_level_0,0,1,2,3,4,mean,std
est,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GaussianNB(),0.824,0.623,0.771,0.646,0.692,0.711,0.085
KNeighborsClassifier(),0.825,0.848,0.771,0.847,0.794,0.817,0.034
"LinearSVC(class_weight='balanced', max_iter=1000000.0, random_state=0)",0.975,0.774,0.975,0.949,0.897,0.914,0.085
"LinearSVC(max_iter=1000000.0, random_state=0)",0.975,0.774,0.975,0.949,0.897,0.914,0.085
"RandomForestClassifier(class_weight='balanced', random_state=0)",0.95,0.699,0.768,0.795,0.866,0.816,0.096
RandomForestClassifier(random_state=0),0.95,0.725,0.768,0.822,0.894,0.832,0.091
SGDClassifier(random_state=0),0.975,0.8,0.871,0.924,0.845,0.883,0.068
"SVC(class_weight='balanced', random_state=0)",0.925,0.825,0.873,0.899,0.897,0.884,0.038
SVC(random_state=0),0.95,0.85,0.873,0.899,0.897,0.894,0.037


## Parameter tuning

#### Custom transformer

Here, we try the multi-pssm feature, which tries all combinations of feature generation parameters, and selects the best ones based on the training set. First without the transformer:

In [7]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="linear",
    dim_reduction=None,
    C=[0.01, 0.1, 1, 10],
)

{'linearsvc__C': 10, 'linearsvc__class_weight': 'balanced', 'linearsvc__dual': False, 'linearsvc__max_iter': 100000000.0}
0.929


The pssmselector increases the scores even further:

In [8]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="linear",
    dim_reduction=None,
    feature_transformer="pssm", 
    feature_names = feature_names,
    C=[0.001, 0.01, 0.1, 1]
)

{'linearsvc__C': 0.1, 'linearsvc__class_weight': 'balanced', 'linearsvc__dual': True, 'linearsvc__max_iter': 100000000.0, 'pssmselector__iterations': 3, 'pssmselector__uniref_threshold': 50}
0.949


The RBF kernel improves the results further, compared to just the linear kernel:

In [9]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="rbf",
    dim_reduction=None,
    C=[0.1, 1, 10, 100],
)

{'svc__C': 10, 'svc__class_weight': 'balanced', 'svc__gamma': 'scale'}
0.939


The linear and RBF kernels actually perform similarly on this dataset.

In [10]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="rbf",
    dim_reduction=None,
    C=[0.1, 1, 10, 100],
    feature_transformer="pssm",
    feature_names=feature_names,
)
best_estimator_rbf = gsearch

{'pssmselector__iterations': 3, 'pssmselector__uniref_threshold': 'all', 'svc__C': 10, 'svc__class_weight': 'balanced', 'svc__gamma': 'scale'}
0.949


RBF is the best one so far.

## Dimensionality reduction

In [11]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="linear",
    dim_reduction="pca",
)

{'linearsvc__C': 10, 'linearsvc__class_weight': None, 'linearsvc__dual': False, 'linearsvc__max_iter': 100000000.0, 'pca__n_components': 0.97}
0.929


In [12]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="linear",
    dim_reduction="pca",
    C=[10, 1, 0.1, 0.01],
    feature_transformer="pssm",
    feature_names=feature_names,
)
best_estimator_linearsvc_pca = gsearch

{'linearsvc__C': 0.1, 'linearsvc__class_weight': 'balanced', 'linearsvc__dual': True, 'linearsvc__max_iter': 100000000.0, 'pca__n_components': 0.98, 'pssmselector__iterations': 3, 'pssmselector__uniref_threshold': 50}
0.964


In [13]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="rbf",
    dim_reduction="pca",
    C=[0.1, 1, 10, 100],
    # gamma=["scale"],
)

{'pca__n_components': 0.96, 'svc__C': 1, 'svc__class_weight': 'balanced', 'svc__gamma': 'scale'}
0.949


That already looks good, now with the PSSMSelector:

In [14]:
gsearch = optimize_hyperparams(
    X_train,
    y_train,
    kernel="rbf",
    dim_reduction="pca",
    feature_transformer="pssm",
    feature_names=feature_names,
    # C=[1, 0.1, 10],
    # gamma=["scale"],
)
best_estimator_svc_pca = gsearch

{'pca__n_components': 0.99, 'pssmselector__iterations': 3, 'pssmselector__uniref_threshold': 'all', 'svc__C': 1, 'svc__class_weight': 'balanced', 'svc__gamma': 'scale'}
0.965


### Conclusion training set

PSSMselector with RBF and PCA lead to a good score with default SVM parameters, and 99% of the variance. In almost all cases, the model trained on jsut Eukaryotes outperforms the model that also uses E coli data. 

## Validation


### RBF kernel without feature selection

In [15]:
get_confusion_matrix(X_test, y_test, best_estimator_rbf, labels=labels)

predicted,Amino-acid transport,Sugar transport
observed,Unnamed: 1_level_1,Unnamed: 2_level_1
Amino-acid transport,21,2
Sugar transport,2,25


In [16]:
get_classification_report(X_test, y_test, best_estimator_rbf, labels=labels)

Unnamed: 0,precision,recall,f1-score,support
Amino-acid transport,0.913,0.913,0.913,23
Sugar transport,0.926,0.926,0.926,27
macro avg,0.919,0.919,0.919,50
weighted avg,0.92,0.92,0.92,50


### Linear kernel with PCA


In [17]:
get_confusion_matrix(X_test, y_test, best_estimator_linearsvc_pca, labels=labels)

predicted,Amino-acid transport,Sugar transport
observed,Unnamed: 1_level_1,Unnamed: 2_level_1
Amino-acid transport,20,3
Sugar transport,1,26


In [18]:
get_classification_report(X_test, y_test, best_estimator_linearsvc_pca, labels=labels)

Unnamed: 0,precision,recall,f1-score,support
Amino-acid transport,0.952,0.87,0.909,23
Sugar transport,0.897,0.963,0.929,27
macro avg,0.924,0.916,0.919,50
weighted avg,0.922,0.92,0.92,50


### RBF + PCA

RBF kernel and pca leads to the best model.


In [19]:
df_cm = get_confusion_matrix(X_test, y_test, best_estimator_svc_pca, labels=labels)
df_cm

predicted,Amino-acid transport,Sugar transport
observed,Unnamed: 1_level_1,Unnamed: 2_level_1
Amino-acid transport,21,2
Sugar transport,1,26


In [20]:
from numpy import diag, fill_diagonal
acc = diag(df_cm) / df_cm.sum(axis=1)
acc.round(3)

observed
Amino-acid transport    0.913
Sugar transport         0.963
dtype: float64

In [21]:
get_classification_report(X_test, y_test, best_estimator_svc_pca, labels=labels, add_balanced_accuracy=True)

Unnamed: 0,precision,recall,f1-score,support
Amino-acid transport,0.955,0.913,0.933,23
Sugar transport,0.929,0.963,0.945,27
macro avg,0.942,0.938,0.939,50
weighted avg,0.941,0.94,0.94,50
balanced accuracy,,,0.938,50


Training scores:

In [22]:
get_classification_report(X_train, y_train, best_estimator_svc_pca, labels=labels, add_balanced_accuracy=True)

Unnamed: 0,precision,recall,f1-score,support
Amino-acid transport,1.0,1.0,1.0,92
Sugar transport,1.0,1.0,1.0,107
macro avg,1.0,1.0,1.0,199
weighted avg,1.0,1.0,1.0,199
balanced accuracy,,,1.0,199


Training CV scores

In [23]:
df_cv_scores = get_cv_scores(X_train,y_train,gsearch.best_estimator_, labels=labels.unique())
df_cv_scores

Unnamed: 0,1,2,3,4,5,avg,sdev
fit_time,0.065,0.056,0.063,0.041,0.022,0.05,0.018
score_time,0.013,0.013,0.007,0.006,0.006,0.009,0.004
test_acc,1.0,0.925,0.975,0.925,1.0,0.965,0.038
test_acc_Amino-acid transport,1.0,0.895,0.944,0.944,1.0,0.957,0.044
test_acc_Sugar transport,1.0,0.952,1.0,0.909,1.0,0.972,0.041
test_acc_bal,1.0,0.924,0.972,0.927,1.0,0.965,0.038
test_f1_Amino-acid transport,1.0,0.919,0.971,0.919,1.0,0.962,0.041
test_f1_Sugar transport,1.0,0.93,0.978,0.93,1.0,0.968,0.035
test_f1_macro,1.0,0.925,0.975,0.925,1.0,0.965,0.038
test_precision_Amino-acid transport,1.0,0.944,1.0,0.895,1.0,0.968,0.047


### Conclusion

Without the E Coli transporters, all SVM kernels and preprocessing methods we tested led to good performance on training and test set, always 0.02-0.05 better than with E coli, in terms of average F1 score.

## Estimating validation variance

How much did the result depend on choosing the training and test sets?

Mean and standard deviation for randomly selected training and validation sets.

#### RBF+PCA 

In [24]:
df_scores, df_params = full_test(
    df_pssm,
    labels,
    dim_reduction="pca",
    kernel="rbf",
    repetitions=10,
    feature_transformer="pssm",
)
df_scores_gr = df_scores.groupby(["label", "dataset"], as_index=False)
print("Mean F1")
display(df_scores_gr.mean().pivot(index="label", columns="dataset", values="F1 score"))
print("Sdev F1")
display(df_scores_gr.std().pivot(index="label", columns="dataset", values="F1 score"))
print("Parameters")
display(df_params)

Mean F1


dataset,test,train
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Amino-acid transport,0.9332,0.964
Sugar transport,0.9513,0.97


Sdev F1


dataset,test,train
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Amino-acid transport,0.072278,0.010541
Sugar transport,0.044791,0.008718


Parameters


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
pca__n_components,0.92,0.98,0.98,0.96,0.95,0.99,0.98,0.99,0.99,0.97
pssmselector__iterations,3,all,3,3,3,all,all,3,all,3
pssmselector__uniref_threshold,all,50,50,all,50,50,50,50,50,50
svc__C,1,10,10,1,10,1,1,10,10,10
svc__class_weight,,balanced,balanced,balanced,balanced,balanced,balanced,balanced,balanced,balanced
svc__gamma,0.1,0.01,scale,scale,scale,scale,0.01,scale,scale,scale


Removing E coli from the dataset improves the PSSM model by up to 0.05 (F1).