# BinaPs Demo

This notebook demonstrates the usability of the BinaPs implementation provided by its authors. It consists on three
steps:

**1. Generate synthetic data:** With one of the provided scripts, we generate synthetic data in which a set of known
patterns are planted.

**2. Run BinaPs:** We'll run BinaPs over this synthetic dataset and get a list of inferred patterns

**3. Compare inferred to original patterns**: We use another provided script to calculate the 
[F1](https://en.wikipedia.org/wiki/F-score) value between the inferred and original patterns.

Copyright 2022 Bernardo C. Rodrigues

See COPYING file for license details

In [None]:
import torch

if(torch.cuda.is_available()):
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
else:
    print("No CUDA device available. This will be slow.")

## 1. Generate synthetic data

In this step we will end up with 4 files:

```
...
├── data.dat
├── data.dat_patterns.txt
├── data_itemOverlap.dat
├── data_itemOverlap.dat_patterns.txt
...
```

*data.dat* and *data_itemOverlap.dat* are the synthetic datasets while *data.dat_patterns.txt* are their respective
*data_itemOverlap.dat_patterns.txt* root patterns respectively. Arbitrary noise is added to test the BinaPS robustness.

In *data_itemOverlap.dat* the patterns may overlap (e.g ABC CDE CEF) while in *data.dat* they may not (e.g. AB C DE F).


In [None]:
import tempfile
from pathlib import Path
from binaps.binaps_wrapper import generate_synthetic_data

output_file = "data"

row_quantity = 1000
column_quantity = 10
max_pattern_size = 500
noise = 0.05
density = 0.03

tmp_dir = Path(tempfile.mkdtemp())

generate_synthetic_data(tmp_dir, row_quantity, column_quantity, output_file, max_pattern_size, noise, density)

In [None]:
from dataset.binary_dataset import BinaryDataset

binary_dataset = BinaryDataset.load_from_binaps_compatible_input(tmp_dir / f"{output_file}.dat")
binary_dataset_item_overlap = BinaryDataset.load_from_binaps_compatible_input(tmp_dir / f"{output_file}_itemOverlap.dat")

print("Shape of the binary dataset:", binary_dataset.shape)
print("Shape of the binary overlap dataset:", binary_dataset_item_overlap.shape)

## 2. Run BinaPs

A complete run of th BinaPs autoencoder. In this step we'll end up with the inferred patterns:
```
...
├── data.binaps.patterns
...
```

Attention, patterns in this output start at '0', as opposed to patterns generated by
*generate_synthetic_data* that start at '1'. In other words, the pattern {0,1,5} is equal to {1,2,6}
in the original dataset.


In [None]:
from binaps.binaps_wrapper import run_binaps_cli

data_path = tmp_dir / "data.dat"
hidden_dimension = -1
epochs = 1000

run_binaps_cli(data_path=data_path, hidden_dimension=hidden_dimension, epochs=epochs)

In [None]:
data_path = tmp_dir / "data_itemOverlap.dat"
hidden_dimension = -1
epochs = 1000

run_binaps_cli(data_path=data_path, hidden_dimension=hidden_dimension, epochs=epochs)

## 3. Compare inferred to original patterns

Get the F1 score for the inferred patterns.

compare_datasets_based_on_f1 takes into account the diference of the patterns' start mentioned at
the previous step.

In [None]:

from binaps.binaps_wrapper import compare_datasets_based_on_f1

from pathlib import Path

tmp_dir =  Path('/tmp/tmp6ncs54a2')

estimated_patterns_file = tmp_dir / "data.binaps.patterns"
real_patterns_file = tmp_dir / "data.dat_patterns.txt"

compare_datasets_based_on_f1(estimated_patterns_file, real_patterns_file)

In [None]:
estimated_patterns_file = tmp_dir / "data_itemOverlap.binaps.patterns"
real_patterns_file = tmp_dir / "data_itemOverlap.dat_patterns.txt"

compare_datasets_based_on_f1(estimated_patterns_file, real_patterns_file)