# BinaPs Demo

Copyright 2022 Bernardo C. Rodrigues

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later
version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
details. You should have received a copy of the GNU General Public License along with this program. If not, 
see <https://www.gnu.org/licenses/>. 

This notebook demonstrates the usability of the BinaPs implementation provided by its authors. It consists on three
steps:

**1. Generate synthetic data:** With one of the provided scripts, we generate synthetic data in which a set of known
patterns are planted.

**2. Run BinaPs:** We'll run BinaPs over this synthetic dataset and get a list of inferred patterns

**3. Compare inferred to original patterns**: We use another provided script to calculate the 
[F1](https://en.wikipedia.org/wiki/F-score) value between the inferred and original patterns.

In [1]:
import torch

# Check if CUDA is available
if(torch.cuda.is_available()):
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
else:
    print("No CUDA device avaiable")

CUDA Device: NVIDIA GeForce RTX 3080


## 1. Generate synthetic data

In this step we will end up with 4 files:

```
...
├── data.dat
├── data.dat_patterns.txt
├── data_itemOverlap.dat
├── data_itemOverlap.dat_patterns.txt
...
```

*data.dat* and *data_itemOverlap.dat* are the synthetic datasets while *data.dat_patterns.txt* are their respective
*data_itemOverlap.dat_patterns.txt* root patterns respectively. Arbitrary noise is added to test the BinaPS robustness.

In *data_itemOverlap.dat* the patterns may overlap (e.g ABC CDE CEF) while in *data.dat* they may not (e.g. AB C DE F).


In [168]:
from lib.BinapsWrapper import generate_synthetic_data
from lib.BinaryDataset import BinaryDataset

output_file = "data"

row_quantity = 1000
column_quantity = 10
max_pattern_size = 500
noise = 0.05
density = 0.03

generate_synthetic_data(row_quantity, column_quantity, output_file, max_pattern_size, noise, density)

Rscript binaps/Data/Synthetic_data/generate_toy.R AND 10 1000 500 data 0.05 0.03
[1] 2844
[1] 1000 2844
[1] "Added noise."
[1] "Converted to dat file."
[1] "Removed rows without content."
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 339077 18.2     641325 34.3   641325 34.3
Vcells 933429  7.2    8388608 64.0  8388493 64.0
[1] 1000 1705



In [169]:
binary_dataset = BinaryDataset.load_from_binaps_compatible_input(f"{output_file}.dat")
binary_dataset_item_overlap = BinaryDataset.load_from_binaps_compatible_input(f"{output_file}_itemOverlap.dat")

print(binary_dataset.shape, binary_dataset_item_overlap.shape)

(1000, 2844) (1000, 1705)


## 2. Run BinaPs

A complete run of th BinaPs autoencoder. In this step we'll end up with the inferred patterns:
```
...
├── data.binaps.patterns
...
```

Attention, patterns in this output start at '0', as opposed to patterns generated by
*generate_synthetic_data* that start at '1'. In other words, the pattern {0,1,5} is equal to {1,2,6}
in the original dataset.


In [151]:
from lib.BinapsWrapper import run_binaps

data_path = "data.dat"
hidden_dimension = -1
epochs = 1000

run_binaps(data_path, hidden_dimension, epochs)

python3 binaps/Binaps_code/main.py -i data.dat --hidden_dim -1 --epochs=1000
Data Sparsity:
0.09837105171815341
6, 298.853821 [output truncated]5, -2.3358154296875125
...
1000, 241.205566, 236.12779235839844, -1.6804656982421875


In [170]:
data_path = "data_itemOverlap.dat"
hidden_dimension = -1
epochs = 1000

run_binaps(data_path, hidden_dimension, epochs)

python3 binaps/Binaps_code/main.py -i data_itemOverlap.dat --hidden_dim -1 --epochs=1000
Data Sparsity:
0.08909677419354839
6, 138.799316, 1 [output truncated]-0.919677734375875
...
1000, 134.762115, 130.22630310058594, -0.1924285888671875


## 3. Compare inferred to original patterns

Get the F1 score for the inferred patterns.

compare_datasets_based_on_f1 takes into account the diference of the patterns' start mentioned at
the previous step.

In [153]:
from lib.BinapsWrapper import compare_datasets_based_on_f1

estimated_patterns_file = "data.binaps.patterns"
real_patterns_file = "data.dat_patterns.txt"

compare_datasets_based_on_f1(estimated_patterns_file, real_patterns_file)

0.0



In [171]:
estimated_patterns_file = "data_itemOverlap.binaps.patterns"
real_patterns_file = "data_itemOverlap.dat_patterns.txt"

compare_datasets_based_on_f1(estimated_patterns_file, real_patterns_file)

0.0



In [172]:
from lib.BinapsWrapper import parse_binaps_patterns

filename = 'data.binaps.patterns'
with open(filename, 'r') as f:
    patterns = parse_binaps_patterns(f)

print(patterns[0])
print(patterns[1])
print(patterns[2])
print(patterns[3])

[1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169]
[1087, 1088, 1089, 1090, 1092, 1095, 1096, 1097, 1101, 1102, 1103, 1104, 1108, 1109, 1110, 1111, 1113, 1114, 1115, 1117, 1118, 1119, 1121, 1122, 1123, 1124, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1140, 1141, 1142, 1145, 1147, 1148, 1149, 1150, 1154, 1155, 1156, 1157, 1159, 1161, 1163, 1164, 1165, 1166, 1168, 1169]
[1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1

In [173]:
filename = 'data_itemOverlap.binaps.patterns'
with open(filename, 'r') as f:
    patterns_item_overlap = parse_binaps_patterns(f)

In [174]:
from lib.FormalConceptAnalysis import construct_context_from_binaps_patterns

context = construct_context_from_binaps_patterns(binary_dataset, patterns, True)
context_item_overlap = construct_context_from_binaps_patterns(binary_dataset_item_overlap, patterns_item_overlap, True)

In [158]:
from lib.FormalConceptAnalysis import get_factor_matrices_from_concepts
import numpy as np

Af, Bf = get_factor_matrices_from_concepts(context, binary_dataset.shape[0], binary_dataset.shape[1])
I = np.matmul(Af, Bf)

real_coverage = np.count_nonzero(I) / np.count_nonzero(binary_dataset._binary_dataset)

print(real_coverage)

0.0020324127491558076


In [175]:
Af, Bf = get_factor_matrices_from_concepts(context_item_overlap, binary_dataset_item_overlap.shape[0], binary_dataset_item_overlap.shape[1])
I = np.matmul(Af, Bf)

real_coverage = np.count_nonzero(I) / np.count_nonzero(binary_dataset_item_overlap._binary_dataset)

print(real_coverage)

0.08865117503785136


In [161]:
import random
num_patterns = len(patterns_item_overlap)

def generate_random_list(x, y):
    random_list = []
    for _ in range(x):
        random_list.append(random.randint(1,y))
    return random_list

random_patterns = []
for pattern in patterns_item_overlap:
    random_patterns.append(generate_random_list(len(pattern), binary_dataset_item_overlap.shape[1]))

context_item_overlap = construct_context_from_binaps_patterns(binary_dataset_item_overlap, random_patterns, True)


Af, Bf = get_factor_matrices_from_concepts(context_item_overlap, binary_dataset_item_overlap.shape[0], binary_dataset_item_overlap.shape[1])
I = np.matmul(Af, Bf)

real_coverage = np.count_nonzero(I) / np.count_nonzero(binary_dataset_item_overlap._binary_dataset)

print(real_coverage)


0.02201173184652506


In [100]:
def assert_reconstruction(reconstructed, original):
    for i, row in enumerate(reconstructed):
        for j, cell in enumerate(row):
            if cell:
                assert original[i][j]

assert_reconstruction(I, binary_dataset_item_overlap._binary_dataset)