# Hands-On Hecuba


## Part 1
Hecuba is built around two main data structures; `StorageObj` and `StorageDict`. The `StorageObj` is a python regular object with a set of persistent attributes, annotated as `@ClassField name type`, for example: `@ClassField myattr int`

On the other hand, the `StorageDict` represents a dictionary. To describe its data model, one can write `@TypeSpec dict<<key>,value>`, where key and value follow the format `name:type`. Keep in mind that an StorageObj can have many `ClassFields` while a StorageDict will have exactly one `TypeSpec`.


List of supported data types:
https://github.com/bsc-dd/hecuba/wiki/1:-User-Manual#immutable-types-supported

Naming conventions:
https://github.com/bsc-dd/hecuba/wiki/1:-User-Manual#hecuba-data-classes

### Exercise 1 - Define data models

Define a class that inherits from either `StorageObj` or `StorageDict`. Then, add a data model that uses more than one attribute for the StorageObj, or more than one value if you chose the `StorageDict`.

In [1]:
from hecuba import StorageObj
class Element(StorageObj):
    """
    @ClassField atomic_number int
    @ClassField mass double
    @ClassField symbol str
    """

Create one instance, using the empty constructor `MyClass()`. Add some data, and then, invoke the method `make_persistent("name")`. At this point, the data will be sent to persistent storage.

In [2]:
helium = Element()
helium.atomic_number = 2
helium.mass = 4.002602
helium.symbol = "He"
helium.make_persistent("helium")

Now, you can also access the data on storage with `cqlsh`, an interface to access Cassandra which can run SQL-like commands. Run `cqlsh` from your terminal, and explore the data. Also, you can run queries from the Notebook like:

In [3]:
!cqlsh -e 'DESCRIBE my_app'


CREATE KEYSPACE my_app WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE my_app.experiment (
    id int PRIMARY KEY,
    x double,
    y double,
    z double
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

CREATE TABLE my_app.element (
    storage_id uuid PR

In [4]:
!cqlsh -e 'SELECT storage_id,name FROM hecuba.istorage LIMIT 10'


 [0;1;31mstorage_id[0m                           | [0;1;35mname[0m
--------------------------------------+-------------------
 [0;1;32m8d8e1e7a-d929-4d37-a728-09fe6157167b[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32md4fca017-8e45-4534-ad7b-ea448e0cde03[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32m6f6c55fa-7949-4793-a9a8-68cac424cc56[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32mf9693a60-723c-4862-bcab-202de264262e[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32m7c9e3490-10f1-4ae2-8ff8-9655bb214988[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32m1994f7ac-2433-41ac-851c-69ad27db6eda[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32m5d0d43ee-c5d4-4771-aced-d455651574e8[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32mb15b5162-5cbe-4b7d-a769-0b107ab9c236[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32m5fa684ff-f28d-4fcf-b5b7-4ee48ba949e7[0m | [0;1;33mmy_app.experiment[0m
 [0;1;32m245f44ec-d107-4f3f-a801-2890d58516e1[0m | [0;1;33mmy_app.experiment[0m

(10 rows)


Finally, add a method to the class definition. The method should combine multiple attributes, or the values of a given key.

In [5]:
class Element(StorageObj):
    """
    @ClassField atomic_number int
    @ClassField mass double
    @ClassField symbol str
    """
    
    def print_info(self):
        print(f"Element '{self.symbol}' has an atomic number of {self.atomic_number} and {self.mass}u mass.")

Instantiate the object again, but this time use the same "name" previously used to make the data persistent. In this way, the object will be able to recover the previous data. You can also try to call the new method.

In [6]:
helium = Element("helium")
helium.print_info()

Element 'He' has an atomic number of 2 and 4.002602u mass.


### Exercise 2 - Let's parallelize workloads

Now, declare a class that inherits from `StorageDict`, and define a data model.

Then, declare one instance using the persistent constructor `MyClass("someid")`. Populate the object with data, let's say, with 100k to 10 Millions key-value pairs.

In [7]:
from hecuba import StorageDict
class Particles(StorageDict):
    """
    @TypeSpec dict<<id:int>,x:double,y:double,z:double>
    """

dataset = Particles("experiment")
for i in range(10**6):
    dataset[i] = [i*10, i/0.2, i*0.5%0.8]

    

All Hecuba object's have a generator method, `split()`, that yields subsets of the object until all data has been fetch. 
Try that, you will see that data is split randomly, but all data is there.

In [9]:
total = 0
for i, block in enumerate(dataset.split()):
    n_elements = len(block)
    total = total + n_elements
    print(f"Block {i} has {n_elements}")
print(f"We counted {total} elements.")

Block 0 has 30581
Block 1 has 36478
Block 2 has 28566
Block 3 has 41303
Block 4 has 39662
Block 5 has 25605
Block 6 has 25049
Block 7 has 17128
Block 8 has 33631
Block 9 has 25417
Block 10 has 23396
Block 11 has 38444
Block 12 has 56691
Block 13 has 30255
Block 14 has 22755
Block 15 has 38362
Block 16 has 28414
Block 17 has 18621
Block 18 has 29313
Block 19 has 29780
Block 20 has 30013
Block 21 has 25400
Block 22 has 30312
Block 23 has 36659
Block 24 has 29376
Block 25 has 37343
Block 26 has 45041
Block 27 has 34817
Block 28 has 25078
Block 29 has 40701
Block 30 has 20508
Block 31 has 22758
Block 32 has 2543
We counted 1000000 elements.


## Part 2

### Exercise 3 - Parallelize an Hecuba app with COMPSs

We will take the program X, which runs in sequential, and add the following to make it run in parallel.
