In [1]:
# using Pkg
# Pkg.activate(".")
# Pkg.add("WordTokenizers")
# Pkg.add("Languages")
# Pkg.instantiate()
# Pkg.add("SQLite")
# Pkg.add("DBInterface")
# Pkg.add("JSON3")
# Pkg.add("EasyConfig")
# Pkg.add("MurmurHash3")
# Pkg.add("TextAnalysis")
# Pkg.add("UUIDs")
# Pkg.add("PooledArrays")
include("src/lisa_store.jl")

using SQLite
using DBInterface
using MurmurHash3
using TextAnalysis
using JSON3
using PooledArrays
using UUIDs
using EasyConfig
using HDF5


# Domains and co-domains

Here we'll try to reduce ambiguity of tracing tokens from the HllSet presentation.  

The consistency of conversion of datasets to HllSets depends on two factors:

 1. The $p$ parameter that defines the precision of the conversion through the number of bins;
 2. The type of a hash function that we are using.

Specifically, the output from hash function depends on the seed values used in initiating of the hash function. Applying different seed values we can control the generated hashes.

Here is an idea of the algorithm:

 1. We are performing the dataset processing as usual, utilizing standard hash function:

$$F_{(std)}: X_{(std)} \to Y_{(std)}$$

 2. Then we are tracing original tokens by applying back processing:

$$G_{(std)}: Y_{(std)} \to X_{(std)}$$

 3. Now we can perform the same dataset processing using the same hash function but with different seed values:

$$F_{(seed)}: X_{(seed)} \to Y_{(seed)}$$

 4. And we will trace back the results from modified hash function:

$$G_{(seed)}: Y_{(seed)} \to X_{(seed)}$$

It is possible, that

 $$X_{(seed)} \not= X_{(std)}$$

But it is also obvious, that tokens from the original dataset should be in both results.

## Standard processing


In [2]:
db = Graph.DB("lisa_analytics.db")
db.sqlitedb

SQLite.DB("lisa_analytics.db")

In [3]:
Store.book_file(db, "/home/alexmy/JULIA/DEMO/sample/")

sha1: 0b90b1fee69c77ffa3efe57db7788112ef96dba6
sha1: 6be12bee4edf7c96016907e44bb520be80dc9232


In [4]:
uuid = string(uuid4())
df = Graph.set_lock!(db, 
    "/home/alexmy/JULIA/DEMO/sample", 
    "csv", 
    "book_file", 
    "ingest_csv", 
    "waiting", 
    "waiting", 
    uuid; result=true)

for row in eachrow(df)
    assign = Graph.Assignment(row) 
    col_uuid = string(uuid4())
    Store.ingest_csv_by_column(db, assign, col_uuid; limit=10000, offset=10)
end

Processed column: 8
Processed column: 15


In [5]:
ds_id = "3f9526f8d331b9519b8632a11b2d344ab7c647b6"
node = Graph.getnode(db, ds_id, :; table_name="t_nodes")

Node(3f9526f8d331b9519b8632a11b2d344ab7c647b6; ["csv_column"]; props: column_name="Vehicle type", file_sha1="0b90b1fee69c77ffa3efe57db7788112ef96dba6", column_type="String")

In [6]:
result = Graph.gettokens(db, "3f9526f8d331b9519b8632a11b2d344ab7c647b6", :)
tokens = Store.collect_tokens(db, result)
println(tokens)

Set(["Taxi", "mgw", "Taxi/Private", "Moped", "String", "M/cycle", "vehicle", "and", "seats", "over", "coach", "more", "Motor", "pass", "Minibus", "horse", "motor", "under", "hire", "cycle", "Motorcycle", "Private", "Vehicle", "Other", "Goods", "tonnes", "Bus", "gross", "Van", "Ridden", "Car", "weight", "car", "passenger", "from", "type", "maximum", "Pedal"])


## Processing with seeded hash

We'll go through the same steps with the same params except the database.
 - the db name would be "db_seed.db"
  

In [7]:
db_seed = Graph.DB("db_seed.db")

Graph.DB("db_seed.db") (34 assignments, 0 commits, 174 tokens, 0 nodes, 0 edges25 t_nodes, 23 t_edges)

In [8]:
Store.book_file(db_seed, "/home/alexmy/JULIA/DEMO/sample/"; seed=42, P=10)

sha1: 0b90b1fee69c77ffa3efe57db7788112ef96dba6
sha1: 6be12bee4edf7c96016907e44bb520be80dc9232


In [9]:
uuid = string(uuid4())
df = Graph.set_lock!(db_seed, 
    "/home/alexmy/JULIA/DEMO/sample", 
    "csv", 
    "book_file", 
    "ingest_csv", 
    "waiting", 
    "waiting", 
    uuid; result=true)

for row in eachrow(df)
    assign = Graph.Assignment(row) 
    col_uuid = string(uuid4())
    # Important, do not forget to set HllSet precission parameter p to 8
    Store.ingest_csv_by_column(db_seed, assign, col_uuid; limit=10000, offset=10, p=10, seed=42)
end

Processed column: 8
Processed column: 15


### We are getting the dataset directly from **nodes** table of the "db_seed.db" database.

We are utilizing the fact that SHA1 node ID is not affected by changing the hash function for the tokens encoding.

In [10]:
ds_id = "3f9526f8d331b9519b8632a11b2d344ab7c647b6"
node_seed = Graph.getnode(db_seed, ds_id, :; table_name="t_nodes")

Node(3f9526f8d331b9519b8632a11b2d344ab7c647b6; ["csv_column"]; props: column_name="Vehicle type", file_sha1="0b90b1fee69c77ffa3efe57db7788112ef96dba6", column_type="String")

In [11]:
result = Graph.gettokens(db, "3f9526f8d331b9519b8632a11b2d344ab7c647b6", :)
tokens_seed = Store.collect_tokens(db_seed, result)
println(tokens_seed)

Set(["Taxi", "mgw", "Taxi/Private", "Moped", "String", "M/cycle", "vehicle", "and", "seats", "over", "coach", "more", "Motor", "pass", "Minibus", "horse", "motor", "under", "hire", "cycle", "Motorcycle", "Private", "Vehicle", "Other", "Goods", "tonnes", "Bus", "gross", "Van", "Ridden", "Car", "weight", "car", "passenger", "from", "type", "maximum", "Pedal"])


In [12]:
intersection = intersect(tokens, tokens_seed)

Set{String} with 38 elements:
  "maximum"
  "Pedal"
  "Taxi/Private"
  "Moped"
  "String"
  "M/cycle"
  "over"
  "and"
  "seats"
  "vehicle"
  "coach"
  "more"
  "Motor"
  "pass"
  "Minibus"
  "horse"
  "motor"
  "under"
  "hire"
  ⋮ 

In [13]:
println("tokens size: ", length(tokens), 
    ";\ntokens_seed size: ", length(tokens_seed), 
    ";\nintersection size: ", length(intersection))

tokens size: 38;
tokens_seed size: 38;
intersection size: 38


### Lets check how use of a seeded hash affected HllSets

In [14]:
hll_std = SetCore.HllSet{10}()
hll_seed = SetCore.HllSet{10}()

dataset_std = node.dataset
dataset_seed = node_seed.dataset

println("dataset_std size: ", length(dataset_std), 
    ";\ndataset_seed size: ", length(dataset_seed))

# Restore collect_hll_sets
hll_std = SetCore.restore(hll_std, Vector{UInt64}(dataset_std))
hll_seed = SetCore.restore(hll_seed, Vector{UInt64}(dataset_seed))

println("hll_std size: ", SetCore.count(hll_std), 
    ";\nhll_seed size: ", SetCore.count(hll_seed))

hll_intersection = intersect(hll_std, hll_seed)
# SetCore.count(hll_intersection)

println("hll_intersection size: ", SetCore.count(hll_intersection))

dataset_std size: 1024;
dataset_seed size: 1024
hll_std size: 36;
hll_seed size: 36
hll_intersection size: 1


### So, we are lucky , we got not empty intersection from two HllSets built using different hash functions. (Or may be not, because $1$ is small and could be within the range of an estimation error)

We also can see that the cardinality estimations in our case are not bad. The difference in both case is equal $2$, or about $2.63$%.

# Applying HllSets for Tabular data structures

In [15]:
include("src/lisa_store.jl")

using SQLite
using DBInterface
using MurmurHash3
using TextAnalysis
using JSON3
using PooledArrays
using UUIDs
using EasyConfig
using SparseArrays



In [16]:
db = Graph.DB("lisa_analytics.db")

Graph.DB("lisa_analytics.db") (34 assignments, 0 commits, 174 tokens, 0 nodes, 0 edges125 t_nodes, 123 t_edges)

In [17]:
Store.book_file(db, "/home/alexmy/JULIA/DEMO/sample/"; column=false)

sha1: 0b90b1fee69c77ffa3efe57db7788112ef96dba6
sha1: 6be12bee4edf7c96016907e44bb520be80dc9232


In [18]:
uuid = string(uuid4())
df = Graph.set_lock!(db, 
    "/home/alexmy/JULIA/DEMO/sample", 
    "csv", 
    "book_file", 
    "ingest_csv", 
    "waiting", 
    "waiting", 
    uuid; result=true)

for row in eachrow(df)
    assign = Graph.Assignment(row) 
    col_uuid = string(uuid4())    
    Store.ingest_csv_by_row(db, assign; limit=50, offset=10)
end

In [19]:
# Provide csv file sha1 id to extract row and column nodes
source_id = "0b90b1fee69c77ffa3efe57db7788112ef96dba6"
"""
    Here we are going to extract row and column nodes from the csv file.
    The resulting matrix will show the cardinality of intersection of row and column nodes.
"""
matrix = Store.get_card_matrix(db, source_id)

for row in eachrow(matrix)
    println(row)
end

[2.0, 2.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 2.0, 1.0, 1.0, 2.0, 4.0, 5.0, 4.0]
[2.0, 3.0, 2.0, 1.0, 1.0, 4.0, 4.0, 4.0]
[2.0, 2.0, 2.0, 1.0, 1.0, 4.0, 4.0, 3.0]
[2.0, 2.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 1.0, 4.0, 3.0, 3.0]
[2.0, 2.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 2.0, 1.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 4.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 3.0, 4.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 3.0, 4.0]
[2.0, 3.0, 1.0, 1.0, 2.0, 4.0, 4.0, 3.0]
[2.0, 3.0, 1.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 1.0, 4.0, 4.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 4.0, 3.0]
[2.0, 2.0, 1.0, 1.0, 1.0, 4.0, 8.0, 3.0]
[2.0, 2.0, 1.0, 1.0, 1.0, 4.0, 3.0, 3.0]
[2.0, 2.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 2.0, 2.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 1.0, 1.0, 2.0, 4.0, 3.0, 3.0]
[2.0, 3.0, 1.0, 

In [20]:
"""
    Running get_node_matrix function to extract row and column nodes from the csv file.
    Each cell of the resulting matrix will hold a node that would represent an intersection
    of corresponding row and column of the original csv file.
"""
node_matrix = Store.get_node_matrix(db, source_id)

for row in eachrow(node_matrix)
    println(row)
end

Main.Graph.Node[Node(966ad5e5d23b8586ac43df4aa4264014ab4d4ce9; ["132ec1bf601f4f40e08442e1b0dfbf86637718a2_4b925186fe6753be2ed6908976e44e0f7630d40f"]; props: source="132ec1bf601f4f40e08442e1b0dfbf86637718a2", target="4b925186fe6753be2ed6908976e44e0f7630d40f"), Node(fb67889a1103ac16d991e732d87616e025d655f7; ["132ec1bf601f4f40e08442e1b0dfbf86637718a2_5c6ad737758c030a92351e80c15650712fa06108"]; props: source="132ec1bf601f4f40e08442e1b0dfbf86637718a2", target="5c6ad737758c030a92351e80c15650712fa06108"), Node(1a0994d3adeb82d9bdf8aaf490e77d46507e381c; ["132ec1bf601f4f40e08442e1b0dfbf86637718a2_1d29c5326d2292d8717e189a12ca4bd4cbac8b76"]; props: source="132ec1bf601f4f40e08442e1b0dfbf86637718a2", target="1d29c5326d2292d8717e189a12ca4bd4cbac8b76"), Node(4a9b66603d1b6446899a1e7fcbbd2bd8378d4e61; ["132ec1bf601f4f40e08442e1b0dfbf86637718a2_d87927e475744f6280feca8fd040dd42d07dba4a"]; props: source="132ec1bf601f4f40e08442e1b0dfbf86637718a2", target="d87927e475744f6280feca8fd040dd42d07dba4a"), Node(fce

In [21]:
"""
    Finally we are going to recreate the original csv file (a sample in our case) 
    from the node matrix.

    Important to keep in mind that the results of each cell in the matrix would not be the same 
    as in the original csv file.
    Possible discrepancies can include wrong order on tokens in multitoken cells, 
    missing cells, etc.
    It is a natural result of the probalistic approximation performed on original csv file.
    The original csv file was tokenized, compacted into HllSet, and then reconstructed.
"""
value_matrix = Store.get_value_matrix(db, source_id)

for row in eachrow(value_matrix)
    println(row)
end

["[\"Serious\"]", "[\"Pedestrian\"]", "[\"Male\"]", "[]", "[\"Friday\"]", "[\"Kensington\",\"Chelsea\",\"and\"]", "[\"Motorcycle\",\"over\",\"and\"]", "[\"Not\",\"Pedestrian\",\"pedestrian\"]"]
["[\"Serious\"]", "[\"Pedestrian\"]", "[\"Female\"]", "[]", "[\"Wednesday\"]", "[\"Kensington\",\"Chelsea\",\"and\"]", "[\"car\",\"Taxi/Private\",\"and\",\"hire\"]", "[\"crossing\",\"Pedestrian\",\"Crossing\",\"ped.\",\"facility\"]"]
["[\"Serious\"]", "[\"rider\",\"Driver\"]", "[\"Male\"]", "[]", "[\"Monday\"]", "[\"Kensington\",\"Chelsea\",\"and\"]", "[\"cycle\",\"Motor\",\"and\",\"under\"]", "[\"crossing\",\"carriageway\",\"elsewhere\"]"]
["[\"Serious\"]", "[\"Pedestrian\"]", "[\"Male\"]", "[]", "[\"Monday\"]", "[\"Kensington\",\"Chelsea\",\"and\"]", "[\"cycle\",\"Motor\",\"and\",\"under\"]", "[\"Not\",\"Pedestrian\",\"pedestrian\"]"]
["[\"Serious\"]", "[\"Pedestrian\"]", "[\"Male\"]", "[]", "[\"Sunday\"]", "[\"Kensington\",\"Chelsea\",\"and\"]", "[\"Car\",\"and\"]", "[\"Not\",\"Pedestrian\",\