# Benchmark comparison of protein sequence preprocessing effect on learning task for Pfam family classification

Notebook below allows to play with parameters of the workflow and yield different results.
As a default state it is set with parameters described in the paper as well as used as a base for plots.

In [19]:
import os

import ipywidgets as widgets
from IPython.display import display
from ipywidgets import interact, interactive, interact_manual, fixed
import sys
from urllib import request
from scripts.data_preparation_support import data_preparation, split_to_train_test
from scripts.shorten_encoding import compress_protein_data_original, compress_protein_data_singletons, compress_protein_data_triplets, compress_protein_data_sum_of_triplets, compress_protein_data_sum_of_k_mers
from scripts.prepare_vectors import split_data_to_classes, prepare_biovec_model, load_biovec_model_with_classes
from scripts.prepare_input_stats import input_analysis
from scripts.model_scripts.decision_trees import decision_tree_func
from scripts.model_scripts.random_tree import random_tree_func
from scripts.model_scripts.mlp import mlp_func
from scripts.model_scripts.nearest_neighbours import nearest_neighbours_func
from scripts.model_scripts.deep_learning import deep_learning_func
from scripts.plotting_support import results_plot_benchmark, plot_sizes
from time import time

## Dataset
In this project I provide already prepared - cleaned dataset used for the report, so download part below is commented out. However, if desired, parameters can be tweaked to obtain altered version.

Raw data for this project is easily accessible by the Swissprot part of the Uniprot Database.
Unfortunately, full file weights around 250Mb, so instead I provide link for download.

In case server can't handle the download through Python, it is still possiblle to download tab-separated file directly
from the website: [LINK](https://www.uniprot.org/uniprot/?query=reviewed%3Ayes&columns=id%2Cdatabase(Pfam)%2Corganism%2Csequence)

It suggested, although not neccessary, to put downloaded file into ./data/full directory for clarity and ease of use of the default values below.

In [7]:
# url="https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&columns=id,database(Pfam),organism,sequence"
# request.urlretrieve(url, "./data/full/uniprot-reviewed_yes.tab")

In [8]:
def data_prepare_widget():
    org_w = widgets.BoundedIntText(value=2000, min=1, max=5000, step=1, description="Organisms")
    fam_w = widgets.BoundedIntText(value=10, min=1, max=200, step=1, description="Families")
    infile = widgets.Text(value="./data/full/uniprot-reviewed_yes.tab",  description="Input path")
    outfile = widgets.Text(value="./data/data_file.fasta", description="Output path")

    widget = interact_manual.options(manual_name="Prepare data")
    widget(data_preparation, n_org=org_w, n_fam=fam_w, file_path=infile, outfile_path=outfile)

In [9]:
data_prepare_widget()

interactive(children=(Text(value='./data/full/uniprot-reviewed_yes.tab', description='Input path'), Text(value…

At this point of code raw Swissprot data was filtered by the top number of organisms and picked proteins containing single domain from top biggest families.

However, before splitting our data to train and test, we want to first adress very simmilar sequences.
It can be done using CD-HIT package.

In [10]:
def run_cdhit(input_f="./data/to_cluster/data_file.fasta", output_f="./data/to_cluster/data_file_outttt.fasta", c="0.99"):
    c = str(c)
    print("Begin CD-HIT")
    os.system(f"./CD-HIT/cd-hit -i {input_f} -o {output_f} -c {c}")


def run_cdhit_widget():
    c = widgets.BoundedFloatText(value=0.99, min=0.7, max=1, description="Simmilarity")
    infile = widgets.Text(value="./data/data_file.fasta",  description="Input path")
    outfile = widgets.Text(value="./data/clustering/data_file_clustered.fasta", description="Output path")

    widget = interact_manual.options(manual_name="Prepare data")
    widget(run_cdhit, c=c, input_f=infile, output_f=outfile)

In [11]:
run_cdhit_widget()

interactive(children=(Text(value='./data/data_file.fasta', description='Input path'), Text(value='./data/clust…

Now our data is cleaned out and ready to process.

## Data processing

For this project data is processed in multiple ways.

 - Standard single-letter encoding
 - Conversion into numbers and using int-8 encoding
 - Conversion into 3-mers, encoding each 3-mer as a number and using int-16 encoding
 - Conversion into 3-mers, calculating count of the most popular triplets in sequences.
 - Conversion into k-mers, grouping with allowed edit distance and counting existance of fragments in group. (7-mers with edit distance 3)
 - Using Biovec vector encoder

First we need to split our data into training and test parts.
To avoid dominance of the biggest families, training data will contain the same number of sequences for each PFAM family.

### Warning!
Please analyse the histogram first before providing number of sequences per family to train.
Families with number of representatives lower than provided will be filtered out!

Sometimes it might be beneficial to filter out couple families with low coverage.

In [12]:
def histogram_widget():
    infile = widgets.Text(value="./data/clustering/data_file_clustered.fasta",  description="Input path")
    widget = interact_manual.options(manual_name="Prepare histogram")

    widget(input_analysis, input_file=infile)

In [13]:
histogram_widget()

interactive(children=(Text(value='./data/clustering/data_file_clustered.fasta', description='Input path'), But…

In [14]:
def split_to_train_test_widget():
    train_val = widgets.BoundedIntText(value=790, min=1, max=5000, step=1, description="N. to train")
    infile = widgets.Text(value="./data/clustering/data_file_clustered.fasta",  description="Input path")
    outfile = widgets.Text(value="./data/clean_dataset.pkl", description="Output path")

    widget = interact_manual.options(manual_name="Prepare data")
    new_val = widget(split_to_train_test, train_val=train_val, infile=infile, outfile=outfile)
    return new_val

In [15]:
train_val_global = split_to_train_test_widget()

interactive(children=(BoundedIntText(value=650, description='N. to train', max=5000, min=1), Text(value='./dat…

This process will save data into convenient pickle object. This way instead of keeping 4 separate files or trying to split our data in one file we can easily load a list prepared to use.

In [16]:
compress_protein_data_original(infile='./data/clean_dataset.pkl', outfile='./data/clean_dataset_original.pkl')

In [17]:
compress_protein_data_singletons(infile='./data/clean_dataset.pkl', outfile='./data/clean_dataset_singletons.pkl')

In [18]:
compress_protein_data_triplets(infile='./data/clean_dataset.pkl', outfile='./data/clean_dataset_triplets.pkl')

In [None]:
compress_protein_data_sum_of_triplets(infile='./data/clean_dataset.pkl', outfile='./data/clean_dataset_sum_triplets.pkl')

## Shorten sum k-mer

Plugin below allows for customized summed k-mer encoding. Parameters:

- input, output files
- K-mer length - size of the moving, overalpping window
- Min. occurences - How many times certain k-mer must exist in all sequences to be taken into consideration (filtering out highly mutated fragments, picking "popularity" strength)
- Allowed edit distance - Fragments with edit distance equal or lower than provided to already found fragments will be united into one group. Example: with value 1 strings: AAAAA and AAAAB will be united while AAAAA and AAABB treated as separate groups.

In [None]:
def shorten_kmer_widget():
    infile = widgets.Text(value="./data/clean_dataset.pkl",  description="Input path")
    outfile = widgets.Text(value="./data/clean_dataset_sum_k_mers.pkl", description="Output path")
    n_val = widgets.BoundedIntText(value=7, min=3, max=15, step=1, description="k-mer length")
    k_val = widgets.BoundedIntText(value=20, min=1, max=1000, step=1, description="min. occurences")
    edit_val = widgets.BoundedIntText(value=2, min=1, max=3, step=1, description="allowed edit distance")

    widget = interact_manual.options(manual_name="Prepare data")
    widget(compress_protein_data_sum_of_k_mers, n=n_val, k=k_val, edit=edit_val, infile=infile, outfile=outfile)

In [None]:
shorten_kmer_widget()

## Biovec

Biovec model is built on top of the original implementation, thus data handling must be a bit different, to fit authors requirements.
Sequences must be split into class fasta files, and one combined with identical names.
Combined file will be used to generate model, wchich will be then saved.
This saved model, is next loaded again, but this time, we also provide family information.

After this procedure we are finally left with a data ready to split into training and test, perform learning and predictions.


### Warning !
This time widget is not provided, because of the ./data/vectors/combined_corpus.fasta file.
If Biovec recognizes this file it will skip creating a new model.
That's why paths here are fixed to ensure, old corpus file is deleted.

### Warning !
It might happen, that during loading process there will be information, that model did not train on certain triplets.
It is connected with fragments containing extended alphabet like "X" and are ignored.

In [20]:
split_data_to_classes(infile="./data/clean_dataset.pkl", output_folder="./data/vectors/class_folder",
                          output_combined_file="./data/vectors/combined.fasta")

In [21]:
prepare_biovec_model(infile="./data/vectors/combined.fasta", outfile="./data/vectors/ProtVec_model.model")

In [22]:
def load_biovec_widget():
    train_val = widgets.BoundedIntText(value=790, min=1, max=5000, step=1, description="N. to train")

    widget = interact_manual.options(manual_name="Prepare data")

    new_val = widget(load_biovec_model_with_classes, train_size=train_val, input_model=fixed("./data/vectors/ProtVec_model.model"), class_folder=fixed("./data/vectors/class_folder"), outfile=fixed('./data/clean_dataset_biovec.pkl'))

In [23]:
load_biovec_widget()

interactive(children=(BoundedIntText(value=650, description='N. to train', max=5000, min=1), Button(descriptio…

## Evaluation

Now, that we have all data prepared, we can evaluate them both in accuracy and runtime.

Several models will be created - please note, that not all of them are equally suitable for this kind of data - the point is in efficiency comparison.

- Decision trees
- Random trees
- Nearest Neighbours
- MLP
- Simple Machine Learning with Dense layers

In [24]:
all_times = []
all_accs = []

In [25]:
s1 = time()
acc1 = decision_tree_func("./data/clean_dataset_original.pkl")
t1 = time() - s1

s2 = time()
acc2 = decision_tree_func("./data/clean_dataset_singletons.pkl")
t2 = time() - s2

s3 = time()
acc3 = decision_tree_func("./data/clean_dataset_triplets.pkl")
t3 = time() - s3

s4 = time()
acc4 = decision_tree_func("./data/clean_dataset_sum_triplets.pkl")
t4 = time() - s4

s5 = time()
acc5 = decision_tree_func("./data/clean_dataset_sum_k_mers.pkl")
t5 = time() - s5

s6 = time()
acc6 = decision_tree_func("./data/clean_dataset_biovec.pkl")
t6 = time() - s6

t1 = round(t1, 2)
t2 = round(t2, 2)
t3 = round(t3, 2)
t4 = round(t4, 2)
t5 = round(t5, 2)
t6 = round(t6, 2)

all_times.append([t1, t2, t3, t4, t5, t6])
all_accs.append([acc1, acc2, acc3, acc4, acc5, acc6])

Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best model:
- max_depth: 70
----- Model accuracy: 0.237
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best model:
- max_depth: 70
----- Model accuracy: 0.709
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best model:
- max_depth: 40
----- Model accuracy: 0.664
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best model:
- max_depth: 40
----- Model accuracy: 0.842
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best model:
- max_depth: 100
----- Model accuracy: 0.327
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best model:
- max_depth: 20
----- Model accuracy: 0.681


In [26]:
print(f"Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec")
print(f"----------|----------|------------|----------|--------------|------------|--------")
print(f"Time      | {t1} \t | {t2} \t | {t3} \t | {t4} \t | {t5} \t | {t6} \t")
print(f"Accuracy  | {acc1} \t | {acc2} \t | {acc3} \t | {acc4} \t | {acc5} \t | {acc6} \t")

Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec
----------|----------|------------|----------|--------------|------------|--------
Time      | 16.03 	 | 11.45 	 | 7.56 	 | 34.54 	 | 52.46 	 | 0.39 	
Accuracy  | 0.2371 	 | 0.7086 	 | 0.6638 	 | 0.8419 	 | 0.3267 	 | 0.6809 	


In [27]:
s1 = time()
acc1 = random_tree_func("./data/clean_dataset_original.pkl")
t1 = time() - s1

s2 = time()
acc2 = random_tree_func("./data/clean_dataset_singletons.pkl")
t2 = time() - s2

s3 = time()
acc3 = random_tree_func("./data/clean_dataset_triplets.pkl")
t3 = time() - s3

s4 = time()
acc4 = random_tree_func("./data/clean_dataset_sum_triplets.pkl")
t4 = time() - s4

s5 = time()
acc5 = random_tree_func("./data/clean_dataset_sum_k_mers.pkl")
t5 = time() - s5

s6 = time()
acc6 = random_tree_func("./data/clean_dataset_biovec.pkl")
t6 = time() - s6

t1 = round(t1, 2)
t2 = round(t2, 2)
t3 = round(t3, 2)
t4 = round(t4, 2)
t5 = round(t5, 2)
t6 = round(t6, 2)

all_times.append([t1, t2, t3, t4, t5, t6])
all_accs.append([acc1, acc2, acc3, acc4, acc5, acc6])

Fitting 2 folds for each of 15 candidates, totalling 30 fits
Best model:
- random_state: 1234
- max_depth: 30
- max_features: 20
----- Model accuracy: 0.406
Fitting 2 folds for each of 15 candidates, totalling 30 fits
Best model:
- random_state: 1234
- max_depth: 40
- max_features: 15
----- Model accuracy: 0.798
Fitting 2 folds for each of 15 candidates, totalling 30 fits
[CV 1/2] END ......................max_depth=10;, score=0.626 total time=   0.9s
[CV 2/2] END ......................max_depth=20;, score=0.632 total time=   1.0s
[CV 2/2] END ......................max_depth=30;, score=0.633 total time=   0.9s
[CV 2/2] END ......................max_depth=40;, score=0.634 total time=   0.9s
[CV 2/2] END ......................max_depth=50;, score=0.626 total time=   0.9s
[CV 2/2] END ......................max_depth=60;, score=0.630 total time=   0.9s
[CV 2/2] END ......................max_depth=70;, score=0.637 total time=   0.9s
[CV 2/2] END ......................max_depth=80;, score=0.

Best model:
- random_state: 1234
- max_depth: 30
- max_features: 20
----- Model accuracy: 0.763
Fitting 2 folds for each of 15 candidates, totalling 30 fits
Best model:
- random_state: 1234
- max_depth: 20
- max_features: 20
----- Model accuracy: 0.974
Fitting 2 folds for each of 15 candidates, totalling 30 fits
Best model:
- random_state: 1234
- max_depth: 50
- max_features: 15
----- Model accuracy: 0.365
Fitting 2 folds for each of 15 candidates, totalling 30 fits


  self.best_estimator_.fit(X, y, **fit_params)


Best model:
- random_state: 1234
- max_depth: 20
- max_features: 10
----- Model accuracy: 0.889


In [28]:
print(f"Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec")
print(f"----------|----------|------------|----------|--------------|------------|--------")
print(f"Time      | {t1} \t | {t2} \t | {t3} \t | {t4} \t | {t5} \t | {t6} \t")
print(f"Accuracy  | {acc1} \t | {acc2} \t | {acc3} \t | {acc4} \t | {acc5} \t | {acc6} \t")

Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec
----------|----------|------------|----------|--------------|------------|--------
Time      | 26.09 	 | 22.49 	 | 26.34 	 | 29.85 	 | 36.25 	 | 6.0 	
Accuracy  | 0.4057 	 | 0.7981 	 | 0.7629 	 | 0.9743 	 | 0.3648 	 | 0.8886 	


In [29]:
s1 = time()
acc1 = nearest_neighbours_func("./data/clean_dataset_original.pkl")
t1 = time() - s1

s2 = time()
acc2 = nearest_neighbours_func("./data/clean_dataset_singletons.pkl")
t2 = time() - s2

s3 = time()
acc3 = nearest_neighbours_func("./data/clean_dataset_triplets.pkl")
t3 = time() - s3

s4 = time()
acc4 = nearest_neighbours_func("./data/clean_dataset_sum_triplets.pkl")
t4 = time() - s4

s5 = time()
acc5 = nearest_neighbours_func("./data/clean_dataset_sum_k_mers.pkl")
t5 = time() - s5

s6 = time()
acc6 = nearest_neighbours_func("./data/clean_dataset_biovec.pkl")
t6 = time() - s6

t1 = round(t1, 2)
t2 = round(t2, 2)
t3 = round(t3, 2)
t4 = round(t4, 2)
t5 = round(t5, 2)
t6 = round(t6, 2)

all_times.append([t1, t2, t3, t4, t5, t6])
all_accs.append([acc1, acc2, acc3, acc4, acc5, acc6])

Best model:
- weights: distance
- algorithm: auto
----- Model accuracy: 0.395
Best model:
- weights: distance
- algorithm: auto
----- Model accuracy: 0.758
Best model:
- weights: distance
- algorithm: auto
----- Model accuracy: 0.679
Best model:
- weights: distance
- algorithm: auto
----- Model accuracy: 0.681
Best model:
- weights: distance
- algorithm: auto
----- Model accuracy: 0.334


  return self._fit(X, y)


Best model:
- weights: distance
- algorithm: auto
----- Model accuracy: 0.866


In [30]:
print(f"Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec")
print(f"----------|----------|------------|----------|--------------|------------|--------")
print(f"Time      | {t1} \t | {t2} \t | {t3} \t | {t4} \t | {t5} \t | {t6} \t")
print(f"Accuracy  | {acc1} \t | {acc2} \t | {acc3} \t | {acc4} \t | {acc5} \t | {acc6} \t")

Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec
----------|----------|------------|----------|--------------|------------|--------
Time      | 8.95 	 | 4.93 	 | 2.51 	 | 10.27 	 | 8.43 	 | 0.31 	
Accuracy  | 0.3952 	 | 0.7581 	 | 0.679 	 | 0.681 	 | 0.3343 	 | 0.8659 	


In [31]:
s1 = time()
acc1 = mlp_func("./data/clean_dataset_original.pkl")
t1 = time() - s1

s2 = time()
acc2 = mlp_func("./data/clean_dataset_singletons.pkl")
t2 = time() - s2

s3 = time()
acc3 = mlp_func("./data/clean_dataset_triplets.pkl")
t3 = time() - s3

s4 = time()
acc4 = mlp_func("./data/clean_dataset_sum_triplets.pkl")
t4 = time() - s4

s5 = time()
acc5 = mlp_func("./data/clean_dataset_sum_k_mers.pkl")
t5 = time() - s5

s6 = time()
acc6 = mlp_func("./data/clean_dataset_biovec.pkl")
t6 = time() - s6

t1 = round(t1, 2)
t2 = round(t2, 2)
t3 = round(t3, 2)
t4 = round(t4, 2)
t5 = round(t5, 2)
t6 = round(t6, 2)

all_times.append([t1, t2, t3, t4, t5, t6])
all_accs.append([acc1, acc2, acc3, acc4, acc5, acc6])

Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best model:
- hidden_layer_sizes: 128
- activation: relu
- solver: adam
----- Model accuracy: 0.297
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best model:
- hidden_layer_sizes: 128
- activation: relu
- solver: adam
----- Model accuracy: 0.75
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best model:
- hidden_layer_sizes: 128
- activation: relu
- solver: adam
----- Model accuracy: 0.412
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best model:
- hidden_layer_sizes: 64
- activation: relu
- solver: adam
----- Model accuracy: 0.993
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best model:
- hidden_layer_sizes: 64
- activation: relu
- solver: adam
----- Model accuracy: 0.37
Fitting 2 folds for each of 3 candidates, totalling 6 fits


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Best model:
- hidden_layer_sizes: 128
- activation: relu
- solver: adam
----- Model accuracy: 0.825


In [32]:
print(f"Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec")
print(f"----------|----------|------------|----------|--------------|------------|--------")
print(f"Time      | {t1} \t | {t2} \t | {t3} \t | {t4} \t | {t5} \t | {t6} \t")
print(f"Accuracy  | {acc1} \t | {acc2} \t | {acc3} \t | {acc4} \t | {acc5} \t | {acc6} \t")

Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec
----------|----------|------------|----------|--------------|------------|--------
Time      | 231.46 	 | 160.0 	 | 46.54 	 | 75.01 	 | 192.07 	 | 1.98 	
Accuracy  | 0.2971 	 | 0.7505 	 | 0.4124 	 | 0.9933 	 | 0.3695 	 | 0.8249 	


In [33]:
s1 = time()
acc1 = deep_learning_func("./data/clean_dataset_original.pkl")
t1 = time() - s1

s2 = time()
acc2 = deep_learning_func("./data/clean_dataset_singletons.pkl")
t2 = time() - s2

s3 = time()
acc3 = deep_learning_func("./data/clean_dataset_triplets.pkl")
t3 = time() - s3

s4 = time()
acc4 = deep_learning_func("./data/clean_dataset_sum_triplets.pkl")
t4 = time() - s4

s5 = time()
acc5 = deep_learning_func("./data/clean_dataset_sum_k_mers.pkl")
t5 = time() - s5

s6 = time()
acc6 = deep_learning_func("./data/clean_dataset_biovec.pkl")
t6 = time() - s6

t1 = round(t1, 2)
t2 = round(t2, 2)
t3 = round(t3, 2)
t4 = round(t4, 2)
t5 = round(t5, 2)
t6 = round(t6, 2)

all_times.append([t1, t2, t3, t4, t5, t6])
all_accs.append([acc1, acc2, acc3, acc4, acc5, acc6])

  0%|          | 0/9 [00:00<?, ?it/s]2022-06-18 01:35:58.328015: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-06-18 01:35:58.333417: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-18 01:35:58.337836: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exsto-Latitude-E5450): /proc/driver/nvidia/version does not exist
2022-06-18 01:35:58.446702: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-18 01:35:59.861400: W tensorflow/core/framework/cpu_alloc

0.26476189494132996


2022-06-18 01:36:03.151594: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 118942400 exceeds 10% of free system memory.
 22%|██▏       | 2/9 [00:06<00:19,  2.86s/it]

0.4161904752254486


2022-06-18 01:36:04.413618: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 118942400 exceeds 10% of free system memory.
 33%|███▎      | 3/9 [00:07<00:13,  2.26s/it]2022-06-18 01:36:05.951041: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 118942400 exceeds 10% of free system memory.
 44%|████▍     | 4/9 [00:09<00:09,  1.86s/it]2022-06-18 01:36:07.200877: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 118942400 exceeds 10% of free system memory.
 56%|█████▌    | 5/9 [00:10<00:06,  1.64s/it]

0.44857141375541687


100%|██████████| 9/9 [00:16<00:00,  1.81s/it]
 41%|████      | 11/27 [00:21<00:40,  2.50s/it]

0.47238096594810486


100%|██████████| 27/27 [01:28<00:00,  3.28s/it]


----- Model accuracy: 0.472


 11%|█         | 1/9 [00:01<00:08,  1.08s/it]

0.4276190400123596


 22%|██▏       | 2/9 [00:02<00:07,  1.09s/it]

0.569523811340332


 56%|█████▌    | 5/9 [00:05<00:04,  1.09s/it]

0.5971428751945496


 89%|████████▉ | 8/9 [00:09<00:01,  1.29s/it]

0.6523809432983398


100%|██████████| 9/9 [00:11<00:00,  1.26s/it]
100%|██████████| 27/27 [00:36<00:00,  1.35s/it]


----- Model accuracy: 0.652


 11%|█         | 1/9 [00:00<00:07,  1.01it/s]

0.10000000149011612


 33%|███▎      | 3/9 [00:03<00:06,  1.03s/it]

0.11619047820568085


 56%|█████▌    | 5/9 [00:05<00:04,  1.02s/it]

0.2800000011920929


 67%|██████▋   | 6/9 [00:06<00:03,  1.05s/it]

0.3580952286720276


100%|██████████| 9/9 [00:10<00:00,  1.16s/it]
100%|██████████| 27/27 [00:33<00:00,  1.23s/it]


----- Model accuracy: 0.358


 11%|█         | 1/9 [00:01<00:10,  1.36s/it]

0.9933333396911621


 22%|██▏       | 2/9 [00:02<00:09,  1.38s/it]

0.9942857027053833


100%|██████████| 9/9 [00:13<00:00,  1.47s/it]
100%|██████████| 27/27 [00:43<00:00,  1.63s/it]


----- Model accuracy: 0.994


 11%|█         | 1/9 [00:01<00:09,  1.24s/it]

0.334285706281662


 22%|██▏       | 2/9 [00:02<00:08,  1.24s/it]

0.3438095152378082


 33%|███▎      | 3/9 [00:03<00:07,  1.27s/it]

0.345714271068573


 67%|██████▋   | 6/9 [00:07<00:03,  1.29s/it]

0.3619047701358795


100%|██████████| 9/9 [00:12<00:00,  1.36s/it]
 59%|█████▉    | 16/27 [00:22<00:14,  1.36s/it]

0.3685714304447174


  y = column_or_1d(y, warn=True)


[CV 2/2] END max_depth=20, max_features=20, random_state=1234;, score=0.732 total time=   1.8s
[CV 2/2] END max_depth=30, max_features=10, random_state=1234;, score=0.727 total time=   1.3s
[CV 2/2] END max_depth=30, max_features=15, random_state=1234;, score=0.729 total time=   1.6s
[CV 2/2] END max_depth=30, max_features=20, random_state=1234;, score=0.739 total time=   1.8s
[CV 2/2] END max_depth=40, max_features=10, random_state=1234;, score=0.728 total time=   1.2s
[CV 2/2] END max_depth=40, max_features=15, random_state=1234;, score=0.730 total time=   1.5s
[CV 2/2] END max_depth=40, max_features=20, random_state=1234;, score=0.739 total time=   1.9s
[CV 2/2] END max_depth=50, max_features=10, random_state=1234;, score=0.728 total time=   1.3s
[CV 2/2] END max_depth=50, max_features=15, random_state=1234;, score=0.730 total time=   1.6s
[CV 2/2] END max_depth=50, max_features=20, random_state=1234;, score=0.739 total time=   1.9s
[CV 2/2] END max_depth=10, max_features=10, random

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  return self._fit(X, y)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 1/2] END max_depth=20, max_features=20, random_state=1234;, score=0.742 total time=   1.8s
[CV 1/2] END max_depth=30, max_features=10, random_state=1234;, score=0.733 total time=   1.2s
[CV 1/2] END max_depth=30, max_features=15, random_state=1234;, score=0.739 total time=   1.5s
[CV 1/2] END max_depth=30, max_features=20, random_state=1234;, score=0.743 total time=   1.8s
[CV 1/2] END max_depth=40, max_features=10, random_state=1234;, score=0.734 total time=   1.2s
[CV 1/2] END max_depth=40, max_features=15, random_state=1234;, score=0.739 total time=   1.5s
[CV 1/2] END max_depth=40, max_features=20, random_state=1234;, score=0.742 total time=   1.9s
[CV 1/2] END max_depth=50, max_features=10, random_state=1234;, score=0.734 total time=   1.3s
[CV 1/2] END max_depth=50, max_features=15, random_state=1234;, score=0.739 total time=   1.6s
[CV 1/2] END max_depth=50, max_features=20, random_state=1234;, score=0.742 total time=   1.9s
[CV 1/2] END max_depth=10, max_features=10, random

100%|██████████| 27/27 [00:40<00:00,  1.51s/it]


----- Model accuracy: 0.369


 11%|█         | 1/9 [00:00<00:06,  1.15it/s]

0.2181372493505478


 22%|██▏       | 2/9 [00:01<00:06,  1.06it/s]

0.3061274588108063


 33%|███▎      | 3/9 [00:02<00:05,  1.10it/s]

0.46654412150382996


 67%|██████▋   | 6/9 [00:05<00:02,  1.12it/s]

0.7185049057006836


100%|██████████| 9/9 [00:08<00:00,  1.07it/s]
  0%|          | 0/27 [00:00<?, ?it/s]

KeyboardInterrupt



In [None]:
print(f"Statistic | Original | Singletons | Triplets | Sum Triplets | Sum K-mers | Biovec")
print(f"----------|----------|------------|----------|--------------|------------|--------")
print(f"Time      | {t1} \t | {t2} \t | {t3} \t | {t4} \t | {t5} \t | {t6} \t")
print(f"Accuracy  | {acc1} \t | {acc2} \t | {acc3} \t | {acc4} \t | {acc5} \t | {acc6} \t")

In [None]:
names = ["Original", "Singletons", "Triplets", "Sum Triplets", "Sum K-mers", "Biovec"]
tests = ["Decision trees", "Random trees", "Nearest neighbours", "MLP", "Machine Learning"]
print(all_times)
print(all_accs)
results_plot_benchmark(names, tests, all_times, all_accs)

In [None]:
files = ["./data/clean_dataset_original.pkl",
         "./data/clean_dataset_singletons.pkl",
         "./data/clean_dataset_triplets.pkl",
         "./data/clean_dataset_sum_triplets.pkl",
         "./data/clean_dataset_sum_k_mers.pkl",
         "./data/clean_dataset_biovec.pkl"]

plot_sizes(files, names, "./presentation/images/sizes.png")