Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Harmonized the readme file's examples and reflected some of the recent
additions and changes in the documentation and functionality,
especially the use of the new `hash_functions` module.
  • Loading branch information
cobaltine committed May 22, 2023
1 parent 3ab6e6f commit 1594a03
Showing 1 changed file with 87 additions and 63 deletions.
150 changes: 87 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ Create consistent and comparable fingerprints with secure hashes from unordered

A JSON fingerprint consists of three parts: the version of the underlying canonicalization algorithm, the hash function used and a hexadecimal digest of the hash function output. An example output could look like this: `jfpv1$sha256$5815eb0ce6f4e5ab0a771cce2a8c5432f64222f8fd84b4cc2d38e4621fae86af`.

| Fingerprint element | Description |
|:--------------------|:------------------------------------------------------------------------------------|
| $ | JSON fingerprint element separator |
| jfpv1 | JSON fingerprint version identifier: **j**son **f**inger**p**rint **v**ersion **1** |
| sha256 | Hash function identifier (sha256, sha384 or sha512) |
| 5815eb0c...1fae86af | The secure hash function output in hexadecimal format |
| Fingerprint element | Description |
|:--------------------|:-------------------------------------------------------------------------------------|
| jfpv1 | JSON fingerprint version identifier: **j**son **f**inger**p**rint **v**ersion **1** |
| $ | JSON fingerprint element separator |
| sha256 | Hash function identifier (sha256, sha384 or sha512) |
| 5815eb0c...1fae86af | The secure hash function output in hexadecimal format |


<!-- TOC titleSize:2 tabSpaces:2 depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 skip:0 title:1 charForUnorderedList:* -->
Expand Down Expand Up @@ -77,20 +77,22 @@ JSON fingerprints can be created with the `create()` function, which requires th

```python
import json
import json_fingerprint

input_1 = json.dumps([3, 2, 1, [True, False], {"foo": "bar"}])
input_2 = json.dumps([2, {"foo": "bar"}, 1, [False, True], 3]) # Different order
fp_1 = json_fingerprint.create(input=input_1, hash_function="sha256", version=1)
fp_2 = json_fingerprint.create(input=input_2, hash_function="sha256", version=1)
print(f"Fingerpr. 1: {fp_1}")
print(f"Fingerpr. 2: {fp_2}")
import json_fingerprint
from json_fingerprint import hash_functions

json_1 = json.dumps([3, 2, 1, [True, False], {"foo": "bar"}])
json_2 = json.dumps([2, {"foo": "bar"}, 1, [False, True], 3]) # Different order
fp_1 = json_fingerprint.create(input=json_1, hash_function=hash_functions.SHA256, version=1)
fp_2 = json_fingerprint.create(input=json_2, hash_function=hash_functions.SHA256, version=1)
print(f"Fingerprint 1: {fp_1}")
print(f"Fingerprint 2: {fp_2}")
```
This will output two identical fingerprints regardless of the different order of the json elements:

```
Fingerpr. 1: jfpv1$sha256$2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03
Fingerpr. 2: jfpv1$sha256$2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03
This will output two identical fingerprints regardless of the different order of the json elements:
```text
Fingerprint 1: jfpv1$sha256$2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03
Fingerprint 2: jfpv1$sha256$2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03
```

Since JSON objects with identical data content and structure will always produce identical fingerprints, the fingerprints can be used effectively for various purposes. These include finding duplicate JSON data from a larger dataset, JSON data cache validation/invalidation and data integrity checking.
Expand All @@ -105,16 +107,16 @@ import json_fingerprint

fp = "jfpv1$sha256$2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03"
version, hash_function, hash = json_fingerprint.decode(fingerprint=fp)
print(f"Version (integer): {version}")
print(f"JSON fingerprint version: {version}")
print(f"Hash function: {hash_function}")
print(f"Secure hash: {hash}")
print(f"Hash hex digest: {hash}")
```
This will output the individual elements that make up a fingerprint as follows:

```
Version (integer): 1
This will output the individual elements that make up a fingerprint as follows:
```text
JSON fingerprint version: 1
Hash function: sha256
Secure hash: 164e2e93056b7a0e4ace25b3c9aed9cf061f9a23c48c3d88a655819ac452b83a
Hash hex digest: 2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03
```


Expand All @@ -124,20 +126,23 @@ The `match()` is another convenience function that matches JSON data against a f

```python
import json

import json_fingerprint

input_1 = json.dumps([3, 2, 1, [True, False], {"foo": "bar"}])
input_2 = json.dumps([3, 2, 1])
json_1 = json.dumps([3, 2, 1, [True, False], {"foo": "bar"}])
json_2 = json.dumps([3, 2, 1])
# "target_fp" contains the JSON fingerprint of "json_1"
target_fp = "jfpv1$sha256$2ecb0c919fcb06024f55380134da3bbaac3879f98adce89a8871706fe50dda03"
match_1 = json_fingerprint.match(input=input_1, target_fingerprint=target_fp)
match_2 = json_fingerprint.match(input=input_2, target_fingerprint=target_fp)
print(f"Fingerprint matches with input_1: {match_1}")
print(f"Fingerprint matches with input_2: {match_2}")
match_1 = json_fingerprint.match(input=json_1, target_fingerprint=target_fp)
match_2 = json_fingerprint.match(input=json_2, target_fingerprint=target_fp)
print(f"Fingerprint matches with json_1: {match_1}")
print(f"Fingerprint matches with json_2: {match_2}")
```

This will output the following:
```
Fingerprint matches with input_1: True
Fingerprint matches with input_2: False
```text
Fingerprint matches with json_1: True
Fingerprint matches with json_2: False
```


Expand All @@ -147,11 +152,13 @@ The `find_matches()` function takes a JSON string and a list of JSON fingerprint

```python
import json

import json_fingerprint

# Produces SHA256: jfpv1$sha256$d119f4d8...b1710d9f
# Produces SHA384: jfpv1$sha384$9bca46fd...fd0e2e9c
input = json.dumps({"foo": "bar"})
# "data" produces jfpv1 (SHA256): jfpv1$sha256$d119f4d8...b1710d9f
# "data produces jfpv1 (SHA384): jfpv1$sha384$9bca46fd...fd0e2e9c
json_data = json.dumps({"foo": "bar"})

fingerprints = [
# SHA256 match
"jfpv1$sha256$d119f4d8b802091520162b78f57a995a9ecbc88b20573b0c7e474072b1710d9f",
Expand All @@ -163,17 +170,20 @@ fingerprints = [
# SHA256, not a match
"jfpv1$sha256$73f7bb145f268c033ec22a0b74296cdbab1405415a3d64a1c79223aa9a9f7643",
]
matches = json_fingerprint.find_matches(input=input, fingerprints=fingerprints)
# Print raw matches, which include 2 same SHA256 fingerprints

# Output all matches, including the two identical SHA256 fingerprints
matches = json_fingerprint.find_matches(input=json_data, fingerprints=fingerprints)
print(*(f"\nMatch: {match[0:30]}..." for match in matches))
deduplicated_matches = json_fingerprint.find_matches(input=input,

# Output deduplicated matches, removing one redundant SHA256 match
deduplicated_matches = json_fingerprint.find_matches(input=json_data,
fingerprints=fingerprints,
deduplicate=True)
# Print deduplicated matches
print(*(f"\nDeduplicated match: {match[0:30]}..." for match in deduplicated_matches))
```

This will output the following results, first the list with a duplicate and the latter with deduplicated results:
```
```text
Match: jfpv1$sha256$d119f4d8b80209152...
Match: jfpv1$sha256$d119f4d8b80209152...
Match: jfpv1$sha384$9bca46fd7ef7aa2e1...
Expand All @@ -182,16 +192,19 @@ Deduplicated match: jfpv1$sha384$9bca46fd7ef7aa2e1...
Deduplicated match: jfpv1$sha256$d119f4d8b80209152...
```


## JSON normalization

The jfpv1 JSON fingerprint function transforms the data internally into a normalized (canonical) format before hashing the output.


### Alternative specifications

Most existing JSON normalization/canonicalization specifications and related implementations operate on three key aspects: data structures, values and data ordering. While the ordering of key-value pairs (objects) is straightforward, issues usually arise from the ordering of arrays.

The JSON specifications, including the most recent [RFC 8259](https://tools.ietf.org/html/rfc8259), have always considered the order of array elements to be _meaningful_. As data gets serialized, transferred, deserialized and serialized again throughout various systems, maintaining the order of array elements becomes impractical if not impossible in many cases. As a consequence, this makes the creation and comparison of secure hashes of JSON data across multiple systems a complex process.


### JSON Fingerprint v1 (jfpv1)

The jfpv1 specification takes a more _value-oriented_ approach toward JSON normalization and secure hash creation: values and value-related metadata bear most significance when JSON data gets normalized into the jfpv1 format. The original JSON data gets transformed into a flattened list of small objects, which are then hashed and sorted, and ultimately hashed again as a whole.
Expand All @@ -203,6 +216,7 @@ In practice, the jfpv1 specification purposefully ignores the original order of

In the case of arrays, each array gets a unique hash identifier based on the data elements it holds. This way, each flattened value "knows" to which array it belongs to. This identifier is called a _sibling hash_ because it is derived from each array element's value as well as its neighboring values.


## Performance

The JSON fingerprint v1 specification and its first implementation have been designed with a primary focus on functional utility over performance. There are some performance-related characteristics that are good to be aware of:
Expand All @@ -212,15 +226,18 @@ The JSON fingerprint v1 specification and its first implementation have been des

Below are some examples of the performance impact when processing different types of data structures.


### Example 1: flat data structures

Processing an array of arrays with the maximum depth of 2 levels for each datum:

```python
import json
import json_fingerprint
import time

import json_fingerprint
from json_fingerprint import hash_functions

data = json.dumps(
[
[1, 2],
Expand All @@ -234,28 +251,31 @@ data = json.dumps(
start_time = time.time_ns() # Measure time in nanoseconds
iterations = 1000
for i in range(iterations):
json_fingerprint.create(input=data, hash_function="sha256", version=1)
json_fingerprint.create(input=data, hash_function=hash_functions.SHA256, version=1)
end_time = time.time_ns()
elapsed_time_ms = round(((end_time - start_time) / iterations / 1000000), 2) # To milliseconds
print(f"Average processing time per JSON fingerprint: {elapsed_time_ms} milliseconds")
duration = round(((end_time - start_time) / iterations / 1000000), 2) # To milliseconds
print(f"Average processing time per JSON fingerprint: {duration} milliseconds")
```

Performance test results:
```commandline
```text
Average processing time per JSON fingerprint: 0.27 milliseconds
```

As seen in the test results, flat data structures perform well on modern computer hardware.


### Example 2: nested data structures

Processing a nested array of arrays with datums `11` and `12` 7 levels deep in the data structure:

```python
import json
import json_fingerprint
import time

import json_fingerprint
from json_fingerprint import hash_functions

data = json.dumps(
[
[1, 2, [3, 4, [5, 6, [7, 8, [9, 10, [11, 12]]]]]],
Expand All @@ -264,55 +284,59 @@ data = json.dumps(
start_time = time.time_ns() # Measure time in nanoseconds
iterations = 1000
for i in range(iterations):
json_fingerprint.create(input=data, hash_function="sha256", version=1)
json_fingerprint.create(input=data, hash_function=hash_functions.SHA256, version=1)
end_time = time.time_ns()
elapsed_time_ms = round(((end_time - start_time) / iterations / 1000000), 2) # To milliseconds
print(f"Average processing time per JSON fingerprint: {elapsed_time_ms} milliseconds")
duration = round(((end_time - start_time) / iterations / 1000000), 2) # To milliseconds
print(f"Average processing time per JSON fingerprint: {duration} milliseconds")
```

Performance test results:
```commandline
```text
Average processing time per JSON fingerprint: 2.75 milliseconds
```

Compared to the flat data structure with the same amount of data, the nesting of arrays increased the processing time tenfold.


### Example 3: big JSON objects

Processing a dynamically generated JSON object of three different sizes: `~256KiB`, `~512KiB`, and `~1MiB`:

```python
import json
import json_fingerprint
import time

import json_fingerprint
from json_fingerprint import hash_functions


def test_performance(data: str, size: str) -> None:
start_time = time.time_ns() # Measure time in nanoseconds
iterations = 1000
for i in range(iterations):
json_fingerprint.create(input=data, hash_function="sha256", version=1)
json_fingerprint.create(input=data, hash_function=hash_functions.SHA256, version=1)
end_time = time.time_ns()
elapsed_time_ms = round(((end_time - start_time) / iterations / 1000000), 2) # To milliseconds
print(f"Average processing time per JSON fingerprint ({size}): {elapsed_time_ms} milliseconds")
duration = round(((end_time - start_time) / iterations / 1000000), 2) # To milliseconds
print(f"Average processing time per JSON fingerprint ({size}): {duration} milliseconds")


text_block = "abcdefg " * 16384 # a single text element, total size ~128KiB
text_list = ["hijklmn " * 128 for i in range(128)] # 128 * 1KiB text elements, total size ~128KiB
test_performance(json.dumps({"text_block": text_block, "text_list": text_list}), "~256KiB")
test_performance(json.dumps({"text_block": text_block * 2, "text_list": text_list * 2}), "~512KiB")
test_performance(json.dumps({"text_block": text_block * 4, "text_list": text_list * 4}), "~1MiB")
text = "abcdefg " * 16384 # a single text element (~128KiB)
text_list = ["hijklmn " * 128 for i in range(128)] # 128 * 1KiB text elements (~128KiB)
test_performance(json.dumps({"text": text, "text_list": text_list}), "~256KiB")
test_performance(json.dumps({"text": text * 2, "text_list": text_list * 2}), "~512KiB")
test_performance(json.dumps({"text": text * 4, "text_list": text_list * 4}), "~1MiB")
```

Performance test result:
```commandline
```text
Average processing time per JSON fingerprint (~256KiB): 2.91 milliseconds
Average processing time per JSON fingerprint (~512KiB): 5.42 milliseconds
Average processing time per JSON fingerprint (~1MiB): 11.18 milliseconds
```

Processing fairly sizeable JSON objects with text content in a flat structure scales linearly.


## Running tests

The entire internal test suite of json-fingerprint is included in its distribution package. If you wish to run the internal test suite, install the package and run the following command:
Expand All @@ -321,10 +345,10 @@ The entire internal test suite of json-fingerprint is included in its distributi

If all tests ran successfully, this will produce an output similar to the following:

```
..............................
```text
...............................
----------------------------------------------------------------------
Ran 30 tests in 0.007s
Ran 31 tests in 0.008s
OK
```

0 comments on commit 1594a03

Please sign in to comment.