<a href="https://colab.research.google.com/github/brenoslivio/MDNE_2024/blob/main/Project_1/Projeto1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SCC5920 - Mineração de Dados Não Estruturados (2024)


**Projeto 1: Mineração de Textos - Classificação de Peptídeos Anti Câncer**

Prof. Ricardo Marcacini

**Aluno:** Breno Livio Silva de Almeida

**NUSP:** 10276675

---

Começamos instalando bibliotecas não disponíveis no Google Colab, como BioPython, usado para ler dados de sequências biológicas, e Optuna, um framework estado-da-arte para otimização Bayesiana.

In [1]:
!pip install biopython
!pip install optuna



Em seguida importamos as bibliotecas que vamos usar.

In [2]:
from Bio import SeqIO
import requests, io
import polars as pl
import optuna
import numpy as np
from sklearn.svm import SVC
from itertools import product
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from transformers import T5EncoderModel, T5Tokenizer
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, accuracy_score, f1_score, matthews_corrcoef, recall_score, confusion_matrix
import torch
import re
import joblib

A partir do repositório do GitHub, convertemos os arquivos FASTA para um DataFrame do Polars. FASTA é um arquivo de texto, mas conhecido para armazenar sequências biológicas. BioPython é usado para ler os arquivos.

In [3]:
labels = ["anticancer", "non"]

df_seqs = pl.DataFrame()

for label in labels:
  response = requests.get(f"https://raw.githubusercontent.com/brenoslivio/MDNE_2024/refs/heads/main/Project_1/{label}.fasta")

  headers, seqs, labels = [], [], []

  if response.status_code == 200:
      handle = io.StringIO(response.text)
      for record in SeqIO.parse(handle, "fasta"):
          headers.append(record.description)
          seqs.append(str(record.seq))
  else:
      print(f"Failed to download file. Status code: {response.status_code}")

  df_seqs = pl.concat([df_seqs, pl.DataFrame({"header": headers, "sequence": seqs}).with_columns(label=pl.lit(label))])

df_seqs

header,sequence,label
str,str,str
"""ACP_1""","""GLWSKIKEVGKEAAKAAAKAAGKAALGAVS…","""anticancer"""
"""ACP_2""","""GLFDIIKKIAESI""","""anticancer"""
"""ACP_3""","""GLLDIVKKVVGAFGSL""","""anticancer"""
"""ACP_4""","""GLFDIVKKVVGALGSL""","""anticancer"""
"""ACP_5""","""GLFDIVKKVVGTLAGL""","""anticancer"""
…,…,…
"""non-ACP_202""","""TDTPLDLAIQQLQNLAIESIPDPPTNTPEA…","""non"""
"""non-ACP_203""","""LRLIHFLHQTTDPYPQGPGTANQRRRR""","""non"""
"""non-ACP_204""","""PVDTPLDLAIQQLQGLAIEELPDPPTSAPE…","""non"""
"""non-ACP_205""","""NTNVTPHLLAGMRLIAVQQPEDPLRVL""","""non"""


### Bag of Words



Aqui utilizamos a abordagem de Bag of Words, onde fazemos a contagem de aminoácidos nas sequências.

In [4]:
chars = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
            'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'X']

def dict_kmer(seq, k):
    counts = {''.join(comb): 0 for comb in product(chars, repeat= k)}
    L = len(seq)
    for i in range(L - k + 1):
        counts[seq[i:i+k]] += 1

    return counts

df_seqs_bow = df_seqs.with_columns(pl.col("sequence").map_elements(lambda x: dict_kmer(x, 1)).alias("aac")).unnest("aac")
df_seqs_bow



header,sequence,label,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X
str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
"""ACP_1""","""GLWSKIKEVGKEAAKAAAKAAGKAALGAVS…","""anticancer""",11,0,0,0,0,0,3,4,0,1,2,6,0,0,0,2,0,1,0,3,0
"""ACP_2""","""GLFDIIKKIAESI""","""anticancer""",1,0,0,1,0,0,1,1,0,4,1,2,0,1,0,1,0,0,0,0,0
"""ACP_3""","""GLLDIVKKVVGAFGSL""","""anticancer""",1,0,0,1,0,0,0,3,0,1,3,2,0,1,0,1,0,0,0,3,0
"""ACP_4""","""GLFDIVKKVVGALGSL""","""anticancer""",1,0,0,1,0,0,0,3,0,1,3,2,0,1,0,1,0,0,0,3,0
"""ACP_5""","""GLFDIVKKVVGTLAGL""","""anticancer""",1,0,0,1,0,0,0,3,0,1,3,2,0,1,0,0,1,0,0,3,0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""non-ACP_202""","""TDTPLDLAIQQLQNLAIESIPDPPTNTPEA…","""non""",3,0,2,4,1,3,2,0,0,3,5,0,0,0,5,1,4,0,0,0,0
"""non-ACP_203""","""LRLIHFLHQTTDPYPQGPGTANQRRRR""","""non""",1,5,1,1,0,3,0,2,2,1,3,0,0,1,3,0,3,0,1,0,0
"""non-ACP_204""","""PVDTPLDLAIQQLQGLAIEELPDPPTSAPE…","""non""",3,0,1,4,0,3,3,1,0,2,6,0,0,0,7,1,2,0,0,2,0
"""non-ACP_205""","""NTNVTPHLLAGMRLIAVQQPEDPLRVL""","""non""",2,2,2,1,0,2,1,1,1,1,5,0,1,0,3,0,2,0,0,3,0


Com a representação estruturada anterior, vamos utilizar o Optuna com o algoritmo SVM para maximizar a acurácia média usando 10-fold Cross-Validation.

In [5]:
def objective(trial):
    C = trial.suggest_float("svm_C", 1e-6, 1e3, log=True)
    gamma = trial.suggest_categorical("svm_gamma", ["scale", "auto"])
    kernel = trial.suggest_categorical("svm_kernel", ["linear", "poly", "rbf", "sigmoid"])

    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("svm", SVC(C=C, kernel=kernel, gamma=gamma, random_state=0))
    ])

    scores = cross_val_score(pipeline, X, y, cv=10, scoring="accuracy")

    return np.mean(scores)

X, y = df_seqs_bow.select(pl.nth(range(3, len(df_seqs_bow.columns)))).to_numpy(), df_seqs_bow["label"].to_numpy()

sampler = optuna.samplers.TPESampler(seed=0)
study1 = optuna.create_study(direction="maximize", sampler=sampler)
study1.optimize(objective, n_trials=100)

print("Best hyperparameters:", study1.best_params)
print(f"Best accuracy score: {study1.best_value:.4f}")

[I 2024-09-28 19:35:31,370] A new study created in memory with name: no-name-2831e9bd-a101-4aec-a202-13bb9f56ac54
[I 2024-09-28 19:35:31,597] Trial 0 finished with value: 0.747563025210084 and parameters: {'svm_C': 0.0869604013210559, 'svm_gamma': 'scale', 'svm_kernel': 'rbf'}. Best is trial 0 with value: 0.747563025210084.
[I 2024-09-28 19:35:31,735] Trial 1 finished with value: 0.7563865546218487 and parameters: {'svm_C': 106.15904599003998, 'svm_gamma': 'scale', 'svm_kernel': 'sigmoid'}. Best is trial 1 with value: 0.7563865546218487.
[I 2024-09-28 19:35:31,954] Trial 2 finished with value: 0.5988235294117648 and parameters: {'svm_C': 4.35837428774127e-06, 'svm_gamma': 'scale', 'svm_kernel': 'sigmoid'}. Best is trial 1 with value: 0.7563865546218487.
[I 2024-09-28 19:35:32,049] Trial 3 finished with value: 0.7564705882352941 and parameters: {'svm_C': 15.574964948467398, 'svm_gamma': 'auto', 'svm_kernel': 'sigmoid'}. Best is trial 3 with value: 0.7564705882352941.
[I 2024-09-28 19:35

Best hyperparameters: {'svm_C': 5.031188522375328, 'svm_gamma': 'auto', 'svm_kernel': 'rbf'}
Best accuracy score: 0.9189


Usando os hiperparâmetros do melhor modelo encontrado, reproduzimos ele novamente para verificar outras métricas como F1-score, MCC, Sensitivity, Specificity.

In [6]:
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, pos_label='anticancer'),
    'mcc': make_scorer(matthews_corrcoef),
    'sensitivity': make_scorer(recall_score, pos_label='anticancer'),
    'specificity': make_scorer(recall_score, pos_label='non')
}

best_params = study1.best_params

best_model = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(C=best_params["svm_C"], kernel=best_params["svm_kernel"], gamma=best_params["svm_gamma"]))
])

cv_results = cross_validate(best_model, X, y, cv=10, scoring=scoring)

print(f"Accuracy: {cv_results['test_accuracy'].mean():.4f} +- {cv_results['test_accuracy'].std():.4f}")
print(f"F1-score: {cv_results['test_f1'].mean():.4f} +- {cv_results['test_f1'].std():.4f}")
print(f"MCC: {cv_results['test_mcc'].mean():.4f} +- {cv_results['test_mcc'].std():.4f}")
print(f"Sensitivity: {cv_results['test_sensitivity'].mean():.4f} +- {cv_results['test_sensitivity'].std():.4f}")
print(f"Specificity: {cv_results['test_specificity'].mean():.4f} +- {cv_results['test_specificity'].std():.4f}")

Accuracy: 0.9189 +- 0.0507
F1-score: 0.8955 +- 0.0620
MCC: 0.8358 +- 0.1005
Sensitivity: 0.8681 +- 0.0818
Specificity: 0.9517 +- 0.0710


### TF-IDF

Aqui utilizamos a abordagem com TF-IDF, considerando cada aminoácido como um token, usando Optuna e SVM novamente.

In [7]:
def amino_acid_tokenizer(sequence):
    return list(sequence)

def objective(trial):
    C = trial.suggest_float("svm_C", 1e-6, 1e3, log=True)
    gamma = trial.suggest_categorical("svm_gamma", ["scale", "auto"])
    kernel = trial.suggest_categorical("svm_kernel", ["linear", "poly", "rbf", "sigmoid"])

    pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(tokenizer=amino_acid_tokenizer, lowercase=False, token_pattern=None)),
        ("svm", SVC(C=C, kernel=kernel, gamma=gamma, random_state=0))
    ])

    scores = cross_val_score(pipeline, X, y, cv=10, scoring="accuracy")

    return np.mean(scores)

X, y = df_seqs["sequence"].to_numpy(), df_seqs["label"].to_list()

sampler = optuna.samplers.TPESampler(seed=0)
study2 = optuna.create_study(direction="maximize", sampler=sampler)
study2.optimize(objective, n_trials=100)

print("Best hyperparameters:", study2.best_params)
print(f"Best accuracy score: {study2.best_value:.4f}")

[I 2024-09-28 19:36:27,768] A new study created in memory with name: no-name-73a9c37f-f944-4c20-9af0-4f3882e06601
[I 2024-09-28 19:36:27,994] Trial 0 finished with value: 0.8437815126050421 and parameters: {'svm_C': 0.0869604013210559, 'svm_gamma': 'scale', 'svm_kernel': 'rbf'}. Best is trial 0 with value: 0.8437815126050421.
[I 2024-09-28 19:36:28,162] Trial 1 finished with value: 0.6750420168067227 and parameters: {'svm_C': 106.15904599003998, 'svm_gamma': 'scale', 'svm_kernel': 'sigmoid'}. Best is trial 0 with value: 0.8437815126050421.
[I 2024-09-28 19:36:28,405] Trial 2 finished with value: 0.5988235294117648 and parameters: {'svm_C': 4.35837428774127e-06, 'svm_gamma': 'scale', 'svm_kernel': 'sigmoid'}. Best is trial 0 with value: 0.8437815126050421.
[I 2024-09-28 19:36:28,569] Trial 3 finished with value: 0.8900840336134455 and parameters: {'svm_C': 15.574964948467398, 'svm_gamma': 'auto', 'svm_kernel': 'sigmoid'}. Best is trial 3 with value: 0.8900840336134455.
[I 2024-09-28 19:

Best hyperparameters: {'svm_C': 0.47286259734624975, 'svm_gamma': 'auto', 'svm_kernel': 'linear'}
Best accuracy score: 0.8958


Usando os hiperparâmetros do melhor modelo encontrado, reproduzimos ele novamente para verificar outras métricas como F1-score, MCC, Sensitivity, Specificity.

In [8]:
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, pos_label='anticancer'),
    'mcc': make_scorer(matthews_corrcoef),
    'sensitivity': make_scorer(recall_score, pos_label='anticancer'),
    'specificity': make_scorer(recall_score, pos_label='non')
}

best_params = study2.best_params

best_model = Pipeline([
    ("tfidf", TfidfVectorizer(tokenizer=amino_acid_tokenizer, lowercase=False, token_pattern=None)),
    ("svm", SVC(C=best_params["svm_C"], kernel=best_params["svm_kernel"], gamma=best_params["svm_gamma"]))
])

cv_results = cross_validate(best_model, X, y, cv=10, scoring=scoring)

print(f"Accuracy: {cv_results['test_accuracy'].mean():.4f} +- {cv_results['test_accuracy'].std():.4f}")
print(f"F1-score: {cv_results['test_f1'].mean():.4f} +- {cv_results['test_f1'].std():.4f}")
print(f"MCC: {cv_results['test_mcc'].mean():.4f} +- {cv_results['test_mcc'].std():.4f}")
print(f"Sensitivity: {cv_results['test_sensitivity'].mean():.4f} +- {cv_results['test_sensitivity'].std():.4f}")
print(f"Specificity: {cv_results['test_specificity'].mean():.4f} +- {cv_results['test_specificity'].std():.4f}")

Accuracy: 0.8958 +- 0.0495
F1-score: 0.8583 +- 0.0742
MCC: 0.7867 +- 0.1010
Sensitivity: 0.8121 +- 0.1160
Specificity: 0.9517 +- 0.0426


### Embeddings

Aqui vamos utilizar da abordagem com embeddings. Para isso vamos utilizar de um modelo de linguagem para proteínas pré-treinado 2,1 bilhões de proteínas, o ProtT5-XL-U50. O modelo utiliza da arquitetura Transformer.

In [9]:
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc", do_lower_case=False)

model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc")

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
model = model.eval()

def get_embeddings(sequence):
    sequences = [" ".join(list(re.sub(r"[UZOB]", "X", sequence)))]

    ids = tokenizer.batch_encode_plus(sequences, add_special_tokens=True, pad_to_max_length=True)
    input_ids = torch.tensor(ids['input_ids']).to(device)
    attention_mask = torch.tensor(ids['attention_mask']).to(device)

    with torch.no_grad():
        embedding = model(input_ids=input_ids, attention_mask=attention_mask)[0]

    embedding = embedding.cpu().numpy()

    seq_len = (attention_mask[0] == 1).sum()
    seq_emd = embedding[0][:seq_len-1]

    avg_pool = seq_emd.mean(axis=0)

    dict_embeddings = {f"embedding_{i}": embed for i, embed in enumerate(avg_pool)}

    return dict_embeddings

df_seqs_embeddings = df_seqs.with_columns(pl.col("sequence").map_elements(lambda x: get_embeddings(x)).alias("embeddings")).unnest("embeddings")
df_seqs_embeddings

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


header,sequence,label,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,embedding_8,embedding_9,embedding_10,embedding_11,embedding_12,embedding_13,embedding_14,embedding_15,embedding_16,embedding_17,embedding_18,embedding_19,embedding_20,embedding_21,embedding_22,embedding_23,embedding_24,embedding_25,embedding_26,embedding_27,embedding_28,embedding_29,embedding_30,embedding_31,embedding_32,embedding_33,…,embedding_987,embedding_988,embedding_989,embedding_990,embedding_991,embedding_992,embedding_993,embedding_994,embedding_995,embedding_996,embedding_997,embedding_998,embedding_999,embedding_1000,embedding_1001,embedding_1002,embedding_1003,embedding_1004,embedding_1005,embedding_1006,embedding_1007,embedding_1008,embedding_1009,embedding_1010,embedding_1011,embedding_1012,embedding_1013,embedding_1014,embedding_1015,embedding_1016,embedding_1017,embedding_1018,embedding_1019,embedding_1020,embedding_1021,embedding_1022,embedding_1023
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""ACP_1""","""GLWSKIKEVGKEAAKAAAKAAGKAALGAVS…","""anticancer""",-0.019323,-0.002599,0.130414,-0.075928,0.001824,-0.20331,0.021721,-0.094691,-0.109846,-0.034279,-0.065482,-0.020153,-0.160478,-0.091489,0.064163,0.023135,0.098003,-0.030922,0.03341,-0.096055,-0.119683,0.048285,-0.073786,0.065923,0.117198,-0.054174,-0.102073,-0.11074,-0.150871,0.002395,-0.020469,-0.00795,-0.145422,-0.076635,…,0.106716,-0.073939,-0.002574,-0.040431,-0.061274,0.082331,-0.016459,-0.089693,0.027867,-0.041129,-0.088048,-0.152166,0.062855,0.010831,0.01739,0.043263,0.088654,0.130139,0.076391,0.0215,-0.083522,0.045705,-0.016611,0.074871,-0.106587,0.117661,-0.073917,-0.10261,0.076547,0.126747,-0.055616,-0.105252,-0.096375,-0.027492,0.022106,-0.027703,-0.014583
"""ACP_2""","""GLFDIIKKIAESI""","""anticancer""",0.026106,-0.047088,0.055684,-0.054381,0.095308,-0.104832,-0.059624,-0.166802,-0.064655,0.000479,0.021701,-0.006763,-0.272851,0.044901,0.057929,0.210747,0.189701,-0.035808,0.030314,-0.137399,-0.149178,0.163169,-0.037329,-0.028955,0.13306,-0.084929,-0.093998,-0.047631,-0.166771,-0.099899,0.146678,-0.03236,-0.170556,0.06241,…,0.083569,0.011731,0.025899,0.018802,-0.103451,0.117415,0.01803,-0.060621,-0.091746,-0.153904,-0.122273,-0.318796,-0.070174,-0.110321,-0.056395,-0.194036,0.176199,0.132554,-0.004925,0.17127,-0.00089,0.21939,0.117478,0.035164,-0.002884,0.133629,0.126259,-0.019074,0.021489,0.126369,-0.264691,-0.016279,-0.070543,0.072851,0.073174,-0.027636,-0.081745
"""ACP_3""","""GLLDIVKKVVGAFGSL""","""anticancer""",-0.04655,0.026907,0.093185,-0.013866,-0.006329,-0.096084,-0.089391,-0.117143,-0.077145,-0.048794,0.049056,0.001643,-0.316676,0.004841,0.094172,0.113139,0.191189,-0.063296,0.00156,-0.125597,-0.124917,0.159636,-0.036154,0.031103,0.055007,-0.073215,-0.056273,-0.015267,-0.056054,-0.162778,0.052015,-0.042005,-0.164715,0.017658,…,0.053228,0.037828,0.025897,0.030871,-0.058719,0.126066,0.009878,-0.063738,-0.135373,-0.139417,-0.070759,-0.282681,-0.07936,-0.060174,-0.122023,-0.234252,0.238804,0.131539,-0.006006,0.089226,-0.039177,0.095129,0.073832,0.094882,-0.079774,0.096696,0.184683,-0.072877,0.046215,0.039431,-0.210029,-0.121189,-0.03889,-0.014502,0.052977,-0.020669,-0.000186
"""ACP_4""","""GLFDIVKKVVGALGSL""","""anticancer""",-0.050253,0.032823,0.086164,-0.02196,0.002551,-0.116478,-0.09336,-0.10176,-0.090514,-0.045479,0.059444,-0.004367,-0.323056,-0.012521,0.080398,0.136752,0.197317,-0.058988,0.016893,-0.101754,-0.141592,0.151487,-0.036233,0.009952,0.061943,-0.083107,-0.03802,-0.037597,-0.065409,-0.160795,0.03377,-0.041164,-0.163096,0.011319,…,0.064586,0.019812,-0.001945,0.023871,-0.069566,0.131595,-0.012397,-0.061908,-0.13541,-0.139694,-0.10826,-0.28004,-0.056485,-0.063998,-0.116124,-0.222508,0.243304,0.132536,0.006766,0.090709,-0.059587,0.107153,0.05713,0.107943,-0.097312,0.117711,0.185513,-0.054115,0.043497,0.037987,-0.211409,-0.135884,-0.035141,-0.015229,0.052926,-0.026794,0.029529
"""ACP_5""","""GLFDIVKKVVGTLAGL""","""anticancer""",-0.023627,0.045314,0.087997,-0.039884,0.025747,-0.126297,-0.087218,-0.109523,-0.076332,-0.033944,0.065591,-0.027439,-0.331323,-0.030035,0.102651,0.108456,0.200854,-0.073057,-0.010352,-0.107437,-0.170394,0.129797,-0.029912,0.011345,0.066405,-0.066004,-0.0099,-0.027443,-0.041042,-0.16352,0.024214,-0.0478,-0.163505,0.020145,…,0.057572,0.004956,-0.037287,0.069639,-0.063601,0.128616,-0.024082,-0.050992,-0.156378,-0.116389,-0.106268,-0.288029,-0.088477,-0.091337,-0.07992,-0.227669,0.21035,0.132463,-0.013089,0.088801,-0.073826,0.083157,0.074469,0.111204,-0.082519,0.155004,0.167264,-0.036502,0.051365,0.056889,-0.206001,-0.126025,-0.028914,-0.011678,0.044418,-0.009174,0.045671
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""non-ACP_202""","""TDTPLDLAIQQLQNLAIESIPDPPTNTPEA…","""non""",0.050742,-0.059714,0.016372,-0.020049,0.042029,0.000091,0.034749,-0.033896,0.002914,-0.049319,-0.046869,0.04289,0.04548,0.025109,0.014048,0.031173,-0.035155,0.050002,-0.09718,-0.102892,0.021198,0.031281,-0.090418,0.085432,0.113974,-0.079858,-0.013431,-0.041602,-0.054379,-0.049968,0.101042,0.011318,-0.1613,-0.006619,…,-0.023743,-0.087017,0.01377,-0.087582,-0.050301,0.0894,-0.031598,0.031218,-0.049093,-0.129286,0.023503,-0.010029,-0.099032,-0.013515,0.07646,-0.127167,0.044473,0.14367,-0.016932,0.046787,-0.083339,-0.017095,0.064361,-0.076279,0.002779,0.069929,-0.019064,-0.007609,-0.007798,0.076769,-0.11821,0.057585,-0.01256,-0.049747,0.011665,0.044846,-0.031156
"""non-ACP_203""","""LRLIHFLHQTTDPYPQGPGTANQRRRR""","""non""",0.02803,0.036604,-0.091647,-0.044611,0.043603,-0.022159,-0.026853,-0.111063,-0.035969,-0.014594,-0.014162,0.089065,-0.054728,-0.031313,0.043199,0.025919,-0.07761,0.009925,-0.037679,-0.053494,-0.125051,-0.040866,-0.044984,0.08078,0.052453,-0.023825,-0.061118,-0.059617,-0.017638,-0.060611,0.066968,-0.05079,-0.171635,-0.131143,…,-0.002923,-0.028772,0.035465,-0.025414,-0.01245,0.126289,-0.04858,0.101017,-0.049296,-0.136719,0.082758,-0.113213,-0.026212,-0.016412,0.049403,-0.163355,0.108071,0.143516,-0.026324,0.123617,0.031383,-0.070451,0.064663,-0.045421,-0.031978,0.116602,0.008426,-0.005496,0.038938,0.068255,-0.147706,-0.00726,-0.044979,-0.058679,0.115736,-0.009271,0.028007
"""non-ACP_204""","""PVDTPLDLAIQQLQGLAIEELPDPPTSAPE…","""non""",0.032758,-0.061579,0.0367,-0.026298,0.054131,-0.016103,0.035231,-0.036354,-0.00412,-0.029976,-0.009819,0.013979,0.006027,0.023361,0.029725,-0.012117,-0.039289,0.045128,-0.089149,-0.069391,0.00291,0.050829,-0.098885,0.090713,0.09728,-0.093768,-0.003327,-0.03654,-0.069333,-0.064352,0.095182,0.019191,-0.158955,0.012327,…,-0.036704,-0.04229,-0.019134,-0.025342,-0.041917,0.077848,-0.049748,0.033543,-0.029412,-0.129531,0.031265,-0.045039,-0.068032,0.031738,0.05199,-0.151383,0.112019,0.142814,-0.033489,0.020125,-0.077296,0.015719,0.069621,-0.088167,0.019959,0.066854,-0.003788,0.014421,0.008973,0.068116,-0.131336,0.02278,0.008647,-0.039617,0.016245,0.069241,-0.026764
"""non-ACP_205""","""NTNVTPHLLAGMRLIAVQQPEDPLRVL""","""non""",0.055526,0.089413,-0.055178,-0.027356,-0.03772,-0.033324,-0.018848,-0.043817,0.013107,0.070539,-0.003536,0.02082,0.007038,0.110897,0.019737,-0.046077,0.006275,-0.00708,-0.069891,-0.039204,-0.081543,0.051694,-0.010941,0.027294,0.143106,-0.083454,-0.012298,-0.070407,-0.038823,-0.045802,0.062846,-0.035068,-0.15591,-0.021579,…,0.015802,-0.094142,0.005524,0.05276,0.008497,0.096758,0.049283,0.027458,-0.056929,-0.166345,-0.007201,-0.059847,-0.084699,0.050306,0.091511,-0.095444,0.129113,0.137334,0.000229,0.087269,-0.070135,-0.038336,0.050547,-0.117702,0.053361,0.041716,0.089981,-0.063646,0.012448,-0.036425,-0.052472,0.002024,-0.049243,0.011945,-0.016162,-0.013265,0.046174


Com a representação em embeddings, vamos utilizar o Optuna com SVM para maximizar a acurácia média do 10-fold Cross-Validation.

In [10]:
def objective(trial):
    C = trial.suggest_float("svm_C", 1e-6, 1e3, log=True)
    gamma = trial.suggest_categorical("svm_gamma", ["scale", "auto"])
    kernel = trial.suggest_categorical("svm_kernel", ["linear", "poly", "rbf", "sigmoid"])

    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("svm", SVC(C=C, kernel=kernel, gamma=gamma, random_state=0))
    ])

    scores = cross_val_score(pipeline, X, y, cv=10, scoring="accuracy")

    return np.mean(scores)

X, y = df_seqs_embeddings.select(pl.nth(range(3, len(df_seqs_embeddings.columns)))).to_numpy(), df_seqs_embeddings["label"].to_numpy()

sampler = optuna.samplers.TPESampler(seed=0)
study3 = optuna.create_study(direction="maximize", sampler=sampler)
study3.optimize(objective, n_trials=100)

print("Best hyperparameters:", study3.best_params)
print(f"Best accuracy score: {study3.best_value:.4f}")

[I 2024-09-28 19:47:14,959] A new study created in memory with name: no-name-e194b296-a0e9-4cde-b65e-faf4d4220042
[I 2024-09-28 19:47:15,463] Trial 0 finished with value: 0.8603361344537814 and parameters: {'svm_C': 0.0869604013210559, 'svm_gamma': 'scale', 'svm_kernel': 'rbf'}. Best is trial 0 with value: 0.8603361344537814.
[I 2024-09-28 19:47:15,644] Trial 1 finished with value: 0.8836974789915967 and parameters: {'svm_C': 106.15904599003998, 'svm_gamma': 'scale', 'svm_kernel': 'sigmoid'}. Best is trial 1 with value: 0.8836974789915967.
[I 2024-09-28 19:47:16,102] Trial 2 finished with value: 0.5988235294117648 and parameters: {'svm_C': 4.35837428774127e-06, 'svm_gamma': 'scale', 'svm_kernel': 'sigmoid'}. Best is trial 1 with value: 0.8836974789915967.
[I 2024-09-28 19:47:16,286] Trial 3 finished with value: 0.8926050420168066 and parameters: {'svm_C': 15.574964948467398, 'svm_gamma': 'auto', 'svm_kernel': 'sigmoid'}. Best is trial 3 with value: 0.8926050420168066.
[I 2024-09-28 19:

Best hyperparameters: {'svm_C': 0.0002361810760149506, 'svm_gamma': 'scale', 'svm_kernel': 'linear'}
Best accuracy score: 0.9301


Usando os hiperparâmetros do melhor modelo encontrado, reproduzimos ele novamente para verificar outras métricas como F1-score, MCC, Sensitivity, Specificity.

In [11]:
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, pos_label='anticancer'),
    'mcc': make_scorer(matthews_corrcoef),
    'sensitivity': make_scorer(recall_score, pos_label='anticancer'),
    'specificity': make_scorer(recall_score, pos_label='non')
}

best_params = study3.best_params

best_model = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(C=best_params["svm_C"], kernel=best_params["svm_kernel"], gamma=best_params["svm_gamma"]))
])

cv_results = cross_validate(best_model, X, y, cv=10, scoring=scoring)

print(f"Accuracy: {cv_results['test_accuracy'].mean():.4f} +- {cv_results['test_accuracy'].std():.4f}")
print(f"F1-score: {cv_results['test_f1'].mean():.4f} +- {cv_results['test_f1'].std():.4f}")
print(f"MCC: {cv_results['test_mcc'].mean():.4f} +- {cv_results['test_mcc'].std():.4f}")
print(f"Sensitivity: {cv_results['test_sensitivity'].mean():.4f} +- {cv_results['test_sensitivity'].std():.4f}")
print(f"Specificity: {cv_results['test_specificity'].mean():.4f} +- {cv_results['test_specificity'].std():.4f}")

Accuracy: 0.9301 +- 0.0844
F1-score: 0.8801 +- 0.1787
MCC: 0.8581 +- 0.1728
Sensitivity: 0.8209 +- 0.2209
Specificity: 1.0000 +- 0.0000


### Uso do Conhecimento

Podemos fazer o deploy de um de nossos modelo mais robusto como uma aplicação em Streamlit. Vamos utilizar do modelo baseado em Bag of Words por ter sido o 2º melhor, e ser computacionalmente possível de usar no Streamlit.

In [12]:
X, y = df_seqs_bow.select(pl.nth(range(3, len(df_seqs_bow.columns)))).to_numpy(), df_seqs_bow["label"].to_numpy()

best_params = study1.best_params

best_model = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(C=best_params["svm_C"], kernel=best_params["svm_kernel"], gamma=best_params["svm_gamma"]))
])

model = best_model.fit(X, y)

joblib.dump(model, "best_model.pkl")

['best_model.pkl']

O código da aplicação a ser desenvolvido necessitará apenas de fazer a conversão para Bag of Words e utilizar o modelo para predição.