# Intro
This is a jupyter notebook for the validation of a result set to a "truth" result set  

The result set used in this notebook is the results after filtering using **BiG-SLICE-like features**

This code is part of the BiG-SCAPE 2.0 thesis project  

Start by importing these modules

In [1]:
import multiprocessing
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

import plots.hist

import data
import truth.truth as truth
import input.paths as paths
import input.bigslice_hmm

import predictions
import predictions.tests

import validation

from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial

import math

The following code instantiates any relevant paths to data or output. Make sure to change these for your purposes.
Not all the paths may be relevant, e.g. if you are using the bigscape_distances.tsv that is included in this repository.

In [2]:

# location of the full bigscape distances tab separated file.
# this file is created if you are loading new values from a run using the path after this one
# the included default tsv file contains what are considered to be "truth" pairs - pairs generated
# from a stable BiG-SCAPE run and are used to compare against to see how a different set performs
FULL_TSV = "bigscape_distances.tsv"
# Base folder of Cluster files generated by BiG-SCAPE for a run.
BIGSCAPE_CLUSTER_FILES = "D:/Temp/2022-04-28_15-43-46_hybrids_global"

In [3]:
print("Loading test distances")
TEST_DISTANCES = truth.get_distances_from_results(BIGSCAPE_CLUSTER_FILES, "0.30")

TEST_PAIRS = validation.pairs_from_distances(TEST_DISTANCES)

Loading test distances


In [4]:
print("Loading truth")
# load truth values
TRUTH_DISTANCES = truth.from_file(FULL_TSV)

TRUTH_PAIRS = validation.pairs_from_distances(TRUTH_DISTANCES)

Loading truth
Loading stored info from database
Done
2982 BGCs in database


In [14]:
print(len(TRUTH_PAIRS[0] & TEST_PAIRS[0]), "true inclusions")
print(len(TRUTH_PAIRS[1] & TEST_PAIRS[0]), "false inclusions")
print(len(TRUTH_PAIRS[1].symmetric_difference(TEST_PAIRS[0])), "true exclusions")
print(len(TRUTH_PAIRS[0].symmetric_difference(TEST_PAIRS[0])), "false exclusions")

752 true inclusions
3 false inclusions
1343586 true exclusions
4092 false exclusions


In [5]:
validation.print_full_stats(TRUTH_PAIRS, TEST_PAIRS)

Total pairs in truth:  1347682
Composition (T/F/U):  4843 1342839 0
Total pairs in prediction:  753
Composition (T/F/U):  753 0 0
True positives:  752
False positives:  3
True negatives:  0
False negatives:  0
Precision:  0.9960264900662251
Sensitivity:  1.0
Specificity:  0.0
Accuracy:  0.9960264900662251
