## Analysis and evaluation of PII regexes

In this notebook we do some exploratory analysis of our PII detection tools using the annotated dataset, we observe that:
1. For SSH & API keys detection:
* detect-secrets tool has a very low recall (detects 2 out of 37 keys). Update: this tool apparently needs context even though it uses regexes. TODO: run another evaluation of the whole code files instead of just instances.
* our regex for keys: https://regex101.com/r/pndDnd/1 : 
    * has a high recall (detects 28 out of 37 keys, and probably more because some keys had dots inside so the regex split them into multiple keys)
    * has many false positives that are paths, words attached by ":" or "_" you can find them in `./experiments/before_gibberish_keys.txt`
    * one solution to increase precision was to use a **Gibberish detector**, if a detected key is not labeled as gibberish (not word like) we don't keep it,
    this removes 174 false positives that you can find in `./experiments/before_gibberish_keys.txt`
    * **TO IMPROVE:**
    * Some hashes are not labelled as gibberish by the gibberish detector(=> not filtered), not sure if they are really secrets, for an example see `./experiments/file_with_hashes.txt` (some other hashes -from that file- are filtered though)
    * There are still some false positives like name/path (labeled as gibberish) in this format "e2e_mask_rcnn_X-152-32x8d-FPN-IN5k_1.44x" and "//deno.land/std@0.142.0/testing/asserts.ts"
    * If there is an "=" or "id=" in front of the key it is detected
    * Some instances like "f47dbc9c:" and "dc22a3aa:" are detected, they seems like ids of patch releases, their context is saved in `./experiments/short_keys_patch_releases.txt`
    * You can check all detected keys by looking for 'KEY' tags in `./experiments/list_detected_pii.txt` 
* TODO: get precision numbers and try adding more filters (from detect-secrets fore example)
2. For email detection:
* **TO IMPROVE:**
* our regex https://regex101.com/r/8CsR5P/1 and the updated bigscience regex https://regex101.com/r/LNwpG1/1 labelled a lot of samples like "dusk-network/icon@4.5.0" as emails
* the updated bigscience regex doesn't detect well emails with are between "<" and ">" as in `<email>`.
* our regex detected noreply@127.0.0.1 as an email
* both regexes have a high recall on the list of emails we detected (without delimiters)
* TODO: more comparison of the two regexes and precision/recall numbers and use of context detection
3. IP addresses: TODO

In [1]:
from datasets import load_dataset

# this dataset has lists of PII without context
ds = load_dataset("bigcode/pii-instances", use_auth_token=True, split="train")

Downloading readme:   0%|          | 0.00/896 [00:00<?, ?B/s]

Using custom data configuration loubnabnl--pii-instances-b56ea5fc2b13487c


Downloading and preparing dataset None/None to /Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/17.1k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


In [2]:
from utils.emails_ip_addresses_detection import detect_email_addresses
from utils.keys_detection import detect_keys

In [3]:
# small test
text = """this is a test example with an email random@hf.co and address 10.1.1.1
          aws_access_key_id=AKIAIOSFODNN7EXAMPLE
          aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
          randomstring=b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABDHn"""
keys = detect_keys(text)
emails_ip_adresses = detect_email_addresses(text, new_email_regex=False)

In [93]:
print(keys)

[{'tag': 'AWS Access Key', 'value': 'AKIAIOSFODNN7EXAMPLE', 'start': 99, 'end': 119}]


In [94]:
print(emails_ip_adresses)

[{'tag': 'EMAIL', 'value': 'random@hf.co', 'start': 37, 'end': 49}, {'tag': 'IP_ADDRESS', 'value': '10.1.1.1', 'start': 62, 'end': 70}]


In [4]:
other_keys = detect_email_addresses(text, tag_types={"KEY"}, new_email_regex=False)
other_keys

[{'tag': 'KEY',
  'value': '=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
  'start': 151,
  'end': 192},
 {'tag': 'KEY',
  'value': '=b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABDHn',
  'start': 215,
  'end': 281}]

## Tests
Let's test the pipelines on our PII instances

Due to an issue with LighTag, we can't download annotations with the reviews, we will need to manually clean the samples and remove empty samples we added to the dataset to have the same number of rows.

In [6]:
# select sample correct api samples
ds_api = ds.select([i for i in range(17) if i not in [7, 8, 13, 14, 15]] + [i for i in range(43, 80) if i not in [44, 63, 69, 72]])

In [7]:
len(ds_api["API_KEY"])

45

In [11]:
api_keys_clean = set(ds_api["API_KEY"])

In [52]:
len(api_keys_clean)

30

In [82]:
ssh_keys_clean = ds["SSH_KEY"][:7]
ssh_keys_clean

['6622386f8d83dc9efefb8c03a4dbfc18e7928d89ffc2ec3e2feb9473e8f410c9',
 '546d57b6c88c2be7517759c016c0bf0313dfcc14adfcb43967f3c5d24657f366',
 '76d8ae334545bbdf2db49414c25d2cfd8685e7b6187f119b28e93ad9c5118e9d',
 '43e0352fee07fa5b92dd22e557cb1d050ccde0cf97273e02f694930695b15134',
 'c9eb8a1102d0a68cafc93f22df73445b8f69706f3322285f9a2f623a28df0176',
 'eff634a68a01d081c0bdc51752dfa0709781f0e4',
 '4d986a461d1b24bb5776fb49063b9a1891939f336b306a6bc75f58d0a4e98bcb']

In [9]:
addresses_clean = ds["IP_ADDRESS"][:95]

Detect API keys

In [12]:
detect_secrets_results = []
regexes_results = []
detect_secrets_nb = 0
regexes_nb = 0
for key in api_keys_clean:
    output_1 = detect_keys(key)
    output_2 = detect_email_addresses(key, tag_types={"KEY"}, new_email_regex=False)
    if output_1:
        detect_secrets_nb += 1
        detect_secrets_results.append(output_1)
    if output_2:
        regexes_nb += 1
        regexes_results.append(output_2)

In [16]:
print(f"nb detected by detect-secrets: {detect_secrets_nb}")
print(f"nb detected by regexes: {regexes_nb}")
print(f"number true API keys: {len(api_keys_clean)}")

nb detected by detect-secrets: 2
nb detected by regexes: 21
number true API keys: 30


In [101]:
detect_secrets_results

[[{'tag': 'JSON Web Token',
   'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDE0NDE3fQ.',
   'start': 0,
   'end': 192}],
 [{'tag': 'JSON Web Token',
   'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDEzNTUxfQ.',
   'start': 0,
   'end': 192}]]

detect-secrets has a very low recall: 2 out of 30, let's analyze the regex detections

In [109]:
res = 0
values = []
for i, elem in enumerate(regexes_results):
    if len(elem) != 1:
        print(f"\ndetection was split at {i} for {elem}\n")
    else:
        value = elem[0]["value"]
        if value in api_keys_clean:
            res += 1
            values.append(value)
        else:
            print(f"\nwrong detection at {i} for {value}\n")
print(f"Number of correctly detected strings: {res}")


wrong detection at 6 for ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-RC4-SHA:ECDHE-RSA-AES128-SHA:AES128-GCM-SHA256:RC4:HIGH:


wrong detection at 7 for 476611152863-ltgqfk9jhq1vsenin5039n58ogkraltb


detection was split at 11 for [{'tag': 'KEY', 'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9', 'start': 0, 'end': 36}, {'tag': 'KEY', 'value': 'eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDE0NDE3fQ', 'start': 37, 'end': 191}, {'tag': 'KEY', 'value': 'D8ja66bnLxJ3bsJlaKRtOquu8XbibjNCyFxJpI7vafc', 'start': 192, 'end': 235}]


detection was split at 12 for [{'tag': 'KEY', 'value': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9', 'start': 0, 'end': 36}, {'tag': 'KEY', 'value': 'eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDEzNTUxfQ', 'start': 37, 'end': 191}, {'tag': 'KEY', 'value': '8X-fBRUHdrwtkTLcOFAsW-vvv

In [111]:
print("missed keys")
# three of them were just truncated because they contained dots inside: not sure they are real keys
for key in api_keys_clean:
    if key not in values:
        print(key)

missed keys
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN0123456789
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-RC4-SHA:ECDHE-RSA-AES128-SHA:AES128-GCM-SHA256:RC4:HIGH:!MD5:!aNULL:!EDH:!CAMELLIA
476611152863-ltgqfk9jhq1vsenin5039n58ogkraltb.apps.googleusercontent.com
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN
mfxl'vmsdv';mfdb'fdamlmdsvfdkfnjn
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDE0NDE3fQ.D8ja66bnLxJ3bsJlaKRtOquu8XbibjNCyFxJpI7vafc
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjU5ODFmMTY3MjEyYjM0OGFlZDdmYTlmNSIsInNjb3BlIjpbImFkbWluIiwiZXZlbnRfbWFuYWdlciIsImV2ZW50X2xvZ2dlciIsImV2ZW50X3dhdGNoZXIiXSwiaWF0IjoxNTI1MDEzNTUxfQ.8X-fBRUHdrwtkTLcOFAsW-vvvqCzmkZKM2gQgHNkBKk
2(iwreobf4b(-=h_p=^!obgxdgn3_*s!17=_3wc4dun9_y^q+c
rSHvhgdOQUB4KMc5JS1alzhg
6595b64144ccf1df
AIzaasdf
0123456789abcdefghijklmno
ABCDEFGHIJKLMNabcdefghijklmnopqrstuvwxyz0123456789


Detect SSH keys

In [17]:
detect_secrets_results = []
regexes_results = []
detect_secrets_nb = 0
regexes_nb = 0
for key in ssh_keys_clean:
    output_1 = detect_keys(key)
    output_2 = detect_email_addresses(key, tag_types={"KEY"}, new_email_regex=False)
    if output_1:
        detect_secrets_nb += 1
        detect_secrets_results.append(output_1)
    if output_2:
        regexes_nb += 1
        regexes_results.append(output_2)

In [18]:
print(f"nb detected by detect-secrets: {detect_secrets_nb}")
print(f"nb detected by regexes: {regexes_nb}")
print(f"number true ssh keys: {len(ssh_keys_clean)}")

nb detected by detect-secrets: 0
nb detected by regexes: 7
number true ssh keys: 7


In [124]:
res = 0
values = []
for i, elem in enumerate(regexes_results):
    if len(elem) != 1:
        print(f"\ndetection was split at {i} for {elem}\n")
    else:
        value = elem[0]["value"]
        if value in ssh_keys_clean:
            res += 1
            values.append(value)
        else:
            print(f"\nwrong detection at {i} for {value}\n")
print(f"number of correctly detected strings: {res}")

number of correctly detected strings: 7


Remarks & questions: 
* some of the keys missed included dots, can API keys include dots? 
* add this regex to detect-secrets plugins to use filters on top of it ?
Observations:
* detect-secrets is not able to detect most API keys and all SSH keys
* our regex for keys detects all shh keys ad 17 out of 30, 2 keys are split into 3 parts because they had two dots inside, and most of the keys left may not be real API keys

=>
* detect-secrets has a very low recall (even with no filters), the other secret keywords have many false positives so we can't add them.
* our regex seems to have a high recall(very few missed positives/keys)
* let's measure its precision by running it on the original code files

In [1]:
from datasets import load_dataset

ds_full = load_dataset("bigcode/pii-for-code", use_auth_token=True, split="train")

Using custom data configuration bigcode--pii-for-code-2810c83b744e2a86
Found cached dataset json (/Users/loubnabenallal/.cache/huggingface/datasets/bigcode___json/bigcode--pii-for-code-2810c83b744e2a86/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)


In [2]:
from detect_pii import scan_pii_batch_viz

examples = ds_full.select(range(100))
outputs = scan_pii_batch_viz(examples, key_detector="regex", new_email_regex=False)

id: 6 for detectedt key: 'f47dbc9c:' in context: s@5.0.2
  - @dusk-network/button@5.0.2
  - @dusk-network/menu@5.0.2

## 5.0.1

### Patch Changes

- f47dbc9c: Release
- Updated dependencies [f47dbc9c]
  - @dusk-network/icon@5.0.1
  - @dusk-network/helpers@5.

id: 6 for detectedt key: 'dc22a3aa:' in context: s@3.0.7
  - @dusk-network/button@3.0.7
  - @dusk-network/menu@3.0.7

## 3.0.6

### Patch Changes

- dc22a3aa: testing changesets
- Updated dependencies [dc22a3aa]
  - @dusk-network/icon@3.0.6
  - @dusk-network



#### Gibberish detector
Adding the gibberish detector removes 173 false positives like:
* ar/www/rajkdjango2/bin/python, param_and_buffer_names_set:, Msf::Exploit::Remote::SNMPClient, d2/d24/interfaceZeebe_1_1Client_1_1Api_1_1Builder_1_1IAccessTokenSupplier

But also removes 8 hashes like these from files 31, 37:
* d3d43ab4e03fdf106b9191f4e0161cfcde3f040e, d3d43ab4e03fdf106b9191f4e0161cfcde3f040e 8d11fab63089a24c8b17063d29a4b0eac359fb41

Strings like this e2e_faster_rcnn_R-101-FPN_1x are considered gibberrish and thus detected as keys

In [15]:
len("37697547/e2e_keypoint_rcnn_R-50-FPN_1x")

38

In [19]:
text = "hello loubna\n is sha=anna but\n here is it"
first_index = text.index("anna")
lines = text[:first_index].splitlines()
lines[-1]

['hello loubna', ' is sha=']

In [1]:
from gibberish_detector import detector
Detector = detector.create_from_model('gibberish_data/big.model')
Detector.is_gibberish("d5/d02/interfaceZeebe_1_1Client_1_1Api_1_1Commands_1_1IPublishMessageCommandStep3".replace("_", " ").replace("-", " ").lower())

False

In [18]:
text = "cb2f8b691ccf3eae9846c67735f413a49befea28"
Detector.is_gibberish(text.lower())

True

In [4]:

Detector.is_gibberish("27dcfe42e3fb3422b72ce48b48bf601c0a3e46e850ee72d9bdd17b5863b6e42c".replace("_", " ").replace(":", " ").lower())

True

### Email detection

* our current regex detects many false positives that are derivatives of: dusk-network/helpers@4.6.12
* bigscience updated regex: can't detect emails well when they are in this format: <email> and also labels dusk-network/helpers@4.6.12 as emails, see https://regex101.com/r/LNwpG1/1

In [2]:
from detect_pii import scan_pii_batch_viz

examples = ds_full.select(range(100))
# to use  updated BigScience regex set new_email_regex=True
outputs = scan_pii_batch_viz(examples, key_detector="regex", new_email_regex=True)

context e2e_faster_rcnn_X-101-32x8d-FPN_1x: : "01_33_49.iAX0mXvW",
        "35857345/e2e_faster_rcnn_R-50-FPN_1x": "01_36_30.cUF7QR7I",
        "35857890/e2e_faster_rcnn_R-101-FPN_1x": "01_38_50.sNxI7sX7",
        "36761737/e2e_faster_rcnn_X-101-32x8d-FPN_1x": "06_31_39.5MIHi1fZ",
        "35858791/e2e_mask_rcnn_R-50-C4_1x": "01_45_57.ZgkA7hPB",
        "35858933/e2e_mask_rcnn_R-50-FPN_1x": "01_48_14.DzEQe4wC",
        "35861795/e2e_m
True
context e2e_keypoint_rcnn_R-50-FPN_1x: 843/e2e_mask_rcnn_X-101-32x8d-FPN_1x": "06_35_59.RZotkLKI",
        "37129812/e2e_mask_rcnn_X-152-32x8d-FPN-IN5k_1.44x": "09_35_36.8pzTQKYK",
        # keypoints
        "37697547/e2e_keypoint_rcnn_R-50-FPN_1x": "08_42_54.kdzV35ao"
    }

    @staticmethod
    def get(name):
        if name.startswith("Caffe2Detectron/COCO"):
            return ModelCatalog.get_c2_detectron_12_2017_base
True


Let's test the recall of emails

In [3]:
from datasets import load_dataset

# this dataset has lists of PII without context
ds = load_dataset("loubnabnl/pii-instances", use_auth_token=True, split="train")

Using custom data configuration loubnabnl--pii-instances-b56ea5fc2b13487c
Found cached dataset parquet (/Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [68]:
# filter samples from ds with an empty email column
ds_emails = ds.filter(lambda x: x["EMAIL"] != "")["EMAIL"]
print(len(ds_emails))

Loading cached processed dataset at /Users/loubnabenallal/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--pii-instances-b56ea5fc2b13487c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-05bc1724d8ac05a9.arrow


189


In [69]:
# sample 44 is a wrong annotation
ds_emails = ds_emails[:44] + ds_emails[45:]

In [70]:
# fiX an issue with this annotation
ds_emails[142] = ds_emails[142][:-1]
ds_emails[108] = ds_emails[108].strip()

In [71]:
ds_emails = list(set(ds_emails))
len(ds_emails)

170

In [72]:
from utils.emails_ip_addresses_detection import detect_email_addresses

old_regex_results = []
new_regex_results = []
new_regex_nb = 0
old_regex_nb = 0
for key in ds_emails:
    output_1 = detect_email_addresses(key, tag_types={"EMAIL"}, new_email_regex=False)
    output_2 = detect_email_addresses(key, tag_types={"EMAIL"}, new_email_regex=True)
    if output_1:
        old_regex_nb += 1
        old_regex_results.append(output_1)
    if output_2:
        new_regex_nb += 1
        new_regex_results.append(output_2)

In [73]:
print(f"nb emails detected by old regex: {old_regex_nb}")
print(f"nb emails detected by new BS regex: {new_regex_nb}")
print(f"number true EMAILS: {len(ds_emails)}")

nb emails detected by old regex: 170
nb emails detected by new BS regex: 169
number true EMAILS: 170


In [74]:
def get_nb_detections(results, refs, mode="old"):
    res = 0
    values = []
    for i, elem in enumerate(results):
        assert len(elem) == 1
        value = elem[0]["value"]
        if value in refs:
            res += 1
            values.append(value)
        else:
            print(f"\nwrong detection of {mode} regex at {i} for {value}\n")
    return res, values

res, values = get_nb_detections(old_regex_results, ds_emails)
res_new, values_new = get_nb_detections(new_regex_results, ds_emails, mode="new")
print(f"number of correctly detected strings with old regex: {res}")
print(f"number of correctly detected strings with new regex: {res_new}")

number of correctly detected strings with old regex: 170
number of correctly detected strings with new regex: 169


In [78]:
print("missed emails with the new regex")
# three of them were just truncated because they contained dots inside: not sure they are real keys
for key in ds_emails:
    if key not in values_new:
        print(key)

missed emails with the new regex
noreply@127.0.0.1


It's a false annotation

In [80]:
for i in range(5):
    ds_emails[i] = "<" + ds_emails[i] + ">"
ds_emails[:7]

['<loffjh@gmail.com>',
 '<Chris.Mears@monash.edu>',
 '<pychuang@gwu.edu>',
 '<nguyenthieu2102@gmail.com>',
 '<robert.kausch@gmx.net>',
 'info@srampos.com',
 'mark.samman@gmail.com']

In [81]:
from utils.emails_ip_addresses_detection import detect_email_addresses

old_regex_results = []
new_regex_results = []
new_regex_nb = 0
old_regex_nb = 0
for key in ds_emails[:7]:
    output_1 = detect_email_addresses(key, tag_types={"EMAIL"}, new_email_regex=False)
    output_2 = detect_email_addresses(key, tag_types={"EMAIL"}, new_email_regex=True)
    if output_1:
        old_regex_nb += 1
        old_regex_results.append(output_1)
    if output_2:
        new_regex_nb += 1
        new_regex_results.append(output_2)

print(f"nb emails detected by old regex: {old_regex_nb}")
print(f"nb emails detected by new BS regex: {new_regex_nb}")
print(f"number true EMAILS: {len(ds_emails[:7])}")

nb emails detected by old regex: 7
nb emails detected by new BS regex: 2
number true EMAILS: 7


In [43]:
ref = [(1, 5), (6, 9)]
pred = [(1, 10)]

In [53]:
scores = {"TP": 0, "FN": 0, "FP": 0}
alpha = 0.8
beta = 0.8

def overlapped(a, b, alpha=0.1, beta=0.1):
    """Returns True if the intervals a and b overlap for more than 80% of their lengths"""
    size_overlap = max(0, min(a[1], b[1]) - max(a[0], b[0]))
    ref_overlap = size_overlap / (b[1] - b[0])
    pred_overlap = size_overlap / (a[1] - a[0])
    return (ref_overlap > alpha and pred_overlap > beta)

# use index so to recover the original data
detection_indice = {"TP_pred": set(), "TP_ref": set(), "FN": set(), "FP": set()}
for i, interval in enumerate(pred):
    for j, target in enumerate(ref):
        if overlapped(interval, target, alpha, beta):
            # the prediction is a true positive
            scores["TP"] += 1
            detection_indice['TP_pred'].add(i)
            detection_indice['TP_ref'].add(j)
            break
    else:
        # the prediction is a false positive
        scores["FP"] += 1
        detection_indice['FP'].add(i)
# the rest of the targets that aren't detected are false negatives
detection_indice["FN"] = set(range(len(ref))) - detection_indice["TP_ref"]
scores["FN"] = len(detection_indice["FN"])

In [54]:
def compare_intervals(ref_intervals, pred_intervals):
    """Compare two lists of intervals and return the number of true positives, false positives and false negatives
    author : @copilot
    """
    ref_intervals = sorted(ref_intervals, key=lambda x: x[0])
    pred_intervals = sorted(pred_intervals, key=lambda x: x[0])
    ref_idx = 0
    pred_idx = 0
    ref_len = len(ref_intervals)
    pred_len = len(pred_intervals)
    ref_matched = [False] * ref_len
    pred_matched = [False] * pred_len
    while ref_idx < ref_len and pred_idx < pred_len:
        ref_interval = ref_intervals[ref_idx]
        pred_interval = pred_intervals[pred_idx]
        if overlapped(ref_interval, pred_interval):
            ref_matched[ref_idx] = True
            pred_matched[pred_idx] = True
        if ref_interval[1] < pred_interval[1]:
            ref_idx += 1
        else:
            pred_idx += 1
    metrics = {
        'TP_ref': sum(ref_matched),
        'TP_pred': sum(pred_matched),
        'FN': ref_len - sum(ref_matched),
        'FP': pred_len - sum(pred_matched),
    }
    return metrics, ref_matched, pred_matched

In [55]:
compare_intervals(ref, pred)

({'TP_ref': 2, 'TP_pred': 1, 'FN': 0, 'FP': 0}, [True, True], [True])

In [45]:
detection_indice

{'TP_pred': set(), 'TP_ref': set(), 'FN': {0, 1}, 'FP': {0}}

In [46]:
scores

{'TP': 0, 'FN': 2, 'FP': 1}