# BinaPs Demo

Copyright 2022 Bernardo C. Rodrigues

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later
version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
details. You should have received a copy of the GNU General Public License along with this program. If not, 
see <https://www.gnu.org/licenses/>. 

This notebook demonstrates the usability of the BinaPs implementation provided by its authors. It consists on three
steps:

**1. Generate synthetic data:** With one of the provided scripts, we generate synthetic data in which a set of known
patterns are planted.

**2. Run BinaPs:** We'll run BinaPs over this synthetic dataset and get a list of inferred patterns

**3. Compare inferred to original patterns**: We use another provided script to calculate the 
[F1](https://en.wikipedia.org/wiki/F-score) value between the inferred and original patterns.

In [1]:
import torch

# Check if CUDA is available
if(torch.cuda.is_available()):
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
else:
    print("No CUDA device avaiable")

CUDA Device: NVIDIA GeForce GTX 1050


## 1. Generate synthetic data

In this step we will end up with 4 files:

```
...
├── data.dat
├── data.dat_patterns.txt
├── data_itemOverlap.dat
├── data_itemOverlap.dat_patterns.txt
...
```

*data.dat* and *data_itemOverlap.dat* are the synthetic datasets while *data.dat_patterns.txt* are their respective
*data_itemOverlap.dat_patterns.txt* root patterns respectively. Arbitrary noise is added to test the BinaPS robustness.

In *data_itemOverlap.dat* the patterns may overlap (e.g ABC CDE CEF) while in *data.dat* they may not (e.g. AB C DE F).


In [2]:
from lib.utils import generate_synthetic_data
import os

output_file = f"data"

row_quantity = 10000
column_quantity = 50
max_pattern_size = 10
noise = 0.001
density = 0.05

generate_synthetic_data(row_quantity, column_quantity, output_file, max_pattern_size, noise, density)

Rscript binaps/Data/Synthetic_data/generate_toy.R AND 50 10000 10 data 0.001 0.05
[1] 310
[1] 10000   310
[1] "Added noise."
[1] "Converted to dat file."
[1] "Removed rows without content."
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 380919 20.4     809590 43.3   641325 34.3
Vcells 892617  6.9    8388608 64.0  8388514 64.0
[1] 10000   186



## 2. Run BinaPs

A complete run of th BinaPs autoencoder. In this step we'll end up with the inferred patterns:
```
...
├── data.binaps.patterns
...
```

Attention, patterns in this output start at '0', as opposed to patterns generated by
*generate_synthetic_data* that start at '1'. In other words, the pattern {0,1,5} is equal to {1,2,6}
in the original dataset.


In [3]:
from lib.utils import run_binaps

data_path = "data.dat"
hidden_dimension = 20
epochs = 100

run_binaps(data_path, hidden_dimension, epochs)

python3 binaps/Binaps_code/main.py -i data.dat --hidden_dim 20 --epochs=100
Data Sparsity:
0.05372346044153375

Test set: Average loss: 85.639381, Accuracy: 0/946 (0%)


Test set: Average loss: 55.226379, Accuracy: 0/946 (0%)


Test set: Average loss: 51.845478, Accuracy: 0/946 (0%)


Test set: Average loss: 50.555359, Accuracy: 0/946 (0%)


Test set: Average loss: 51.086777, Accuracy: 1/946 (0%)


Test set: Average loss: 47.933315, Accuracy: 0/946 (0%)


Test set: Average loss: 46.696804, Accuracy: 1/946 (0%)


Test set: Average loss: 45.997536, Accuracy: 1/946 (0%)


Test set: Average loss: 46.891121, Accuracy: 0/946 (0%)


Test set: Average loss: 46.822929, Accuracy: 0/946 (0%)


Test set: Average loss: 46.331894, Accuracy: 3/946 (0%)


Test set: Average loss: 45.469749, Accuracy: 0/946 (0%)


Test set: Average loss: 46.719376, Accuracy: 3/946 (0%)


Test set: Average loss: 45.338291, Accuracy: 1/946 (0%)


Test set: Average loss: 45.951797, Accuracy: 0/946 (0%)


Test set: Average 

## 3. Compare inferred to original patterns

Get the F1 score for the inferred patterns.

compare_datasets_based_on_f1 takes into account the diference of the patterns' start mentioned at
the previous step.

In [4]:
from lib.utils import compare_datasets_based_on_f1

estimated_patterns_file = "data.binaps.patterns"
real_patterns_file = "data.dat_patterns.txt"

compare_datasets_based_on_f1(estimated_patterns_file, real_patterns_file)

0.1111111111111111



# Overlapped patterns example

In [5]:
data_path = "data_itemOverlap.dat"
hidden_dimension = 20
epochs = 100

run_binaps(data_path, hidden_dimension, epochs)

python3 binaps/Binaps_code/main.py -i data_itemOverlap.dat --hidden_dim 20 --epochs=100
Data Sparsity:
0.0821346239322735

Test set: Average loss: 92.769867, Accuracy: 0/933 (0%)


Test set: Average loss: 74.524017, Accuracy: 2/933 (0%)


Test set: Average loss: 72.694328, Accuracy: 1/933 (0%)


Test set: Average loss: 71.628136, Accuracy: 0/933 (0%)


Test set: Average loss: 69.832253, Accuracy: 4/933 (0%)


Test set: Average loss: 69.488716, Accuracy: 2/933 (0%)


Test set: Average loss: 69.991730, Accuracy: 3/933 (0%)


Test set: Average loss: 67.574120, Accuracy: 3/933 (0%)


Test set: Average loss: 70.088135, Accuracy: 1/933 (0%)


Test set: Average loss: 68.983467, Accuracy: 3/933 (0%)


Test set: Average loss: 68.657387, Accuracy: 2/933 (0%)


Test set: Average loss: 69.638840, Accuracy: 4/933 (0%)


Test set: Average loss: 72.530586, Accuracy: 6/933 (1%)


Test set: Average loss: 68.153419, Accuracy: 5/933 (1%)


Test set: Average loss: 70.161530, Accuracy: 5/933 (1%)


Test se

In [6]:
estimated_patterns_file = "data_itemOverlap.binaps.patterns"
real_patterns_file = "data_itemOverlap.dat_patterns.txt"

compare_datasets_based_on_f1(estimated_patterns_file, real_patterns_file)

0.1917808219178082

