# CC certificate id evaluation
This notebook can be used to evaluate our heuristics for certificate id assignment
and canonicalization.

It looks at several aspects & issues:

1. Certificates with no id assigned.
2. Duplicate certificate id assignments (when two certificates get the same ID assigned).
3. Certificates that have the same certification report document (an issue of the input data that we get
   that explains some of the duplicate certificate id assignments).

In [None]:
from sec_certs.dataset import CCDataset
import csv

In [None]:
# dset = CCDataset.from_web_latest()
dset = CCDataset.from_json("../../cc_dset/CommonCriteria_dataset.json")
# dset._compute_normalized_cert_ids()
# dset.to_json()

## Certificates with no id

Here we report the number of certificates in our dataset that we have no certificate ID for.

In [None]:
missing_id_dgsts = set()
for cert in dset:
    if not cert.heuristics.cert_id:
        print(cert.dgst, cert.heuristics.cert_id, cert.scheme)
        missing_id_dgsts.add(cert.dgst)
print(f"Total: {len(missing_id_dgsts)}")

### Check manually evaluated missing


In [None]:
i = 0
with open("missing_ids.csv", "r") as f:
    reader = csv.DictReader(f)
    for line in reader:
        try:
            cert = dset[line["id"]]
        except:
            continue
        if line["cert_id"] and line["cert_id"] != cert.heuristics.cert_id:
            i += 1
            print(line["id"], line["cert_id"], cert.heuristics.cert_id, line["reason"])
print(f"Total: {i}")


The following cell checks which manually analyzed missing certificate IDs were since fixed.

In [None]:
manual_missing_ids = set()
i = 0
with open("missing_ids.csv", "r") as f:
    reader = csv.DictReader(f)
    for line in reader:
        manual_missing_ids.add(line["id"])
        if line["id"] not in missing_id_dgsts:
            i += 1
            print(",".join(line.values()))
print(f"Total: {i}")

The following cell lists missing certificate IDs that *went missing* since manual analysis.

In [None]:
new_missing_ids = missing_id_dgsts.difference(manual_missing_ids)
for idd in new_missing_ids:
    cert = dset[idd]
    print(idd, cert.heuristics.cert_id)
print(f"Total: {len(new_missing_ids)}")

## Duplicate certificate id assignment

Here we report the number of certificates in our dataset that have a duplicate certiticate
ID assigned.

In [None]:
id_mapping = {}
for cert in dset:
    if cert.heuristics.cert_id is not None:
        c_list = id_mapping.setdefault(cert.heuristics.cert_id, [])
        c_list.append(cert.dgst)

duplicate_id_dgsts = set()
for idd, entries in id_mapping.items():
    if len(entries) > 1 and idd:
        print(idd, entries)
        duplicate_id_dgsts.update(entries)
print(f"Total: {len(duplicate_id_dgsts)}")

## Duplicate report documents

Some certificates have erroneously uploaded certificate reports, here we check their
hashes and report such duplicates in the input data.

In [None]:
duplicate_docs = {}

for cert in dset:
    if cert.state.report_pdf_hash is not None:
        r_list = duplicate_docs.setdefault(cert.state.report_pdf_hash, [])
        r_list.append(cert.dgst)

duplicate_doc_dgsts = set()
for hash, entries in duplicate_docs.items():
    if len(entries) > 1:
        print(hash, entries)
        for entry in entries:
            duplicate_doc_dgsts.add(entry)
print(f"Total: {len(duplicate_doc_dgsts)}")

The following prints the amount of certificate id duplicates due to input data (two or more
certificates share a certification report document).

In [None]:
duplicate_ids_due_doc = duplicate_doc_dgsts.intersection(duplicate_id_dgsts)
print(len(duplicate_ids_due_doc))

The following prints the amount of certificate id duplicates that are not due to input data.

In [None]:
duplicate_ids_issue = duplicate_id_dgsts.difference(duplicate_doc_dgsts)
print(len(duplicate_ids_issue))

### Check manually evaluated duplicates

The following cell checks that id collisions that were manually analyzed in the past have been fixed.
A `True` means that we have now assigned the correct ID.

In [None]:
i = 0
with open("duplicate_ids.csv", "r") as f:
    reader = csv.DictReader(f)
    for line in reader:
        try:
            cert = dset[line["id"]]
        except:
            continue
        if line["true_id"] != cert.heuristics.cert_id:
            print(line["id"],line["result"], line["true_id"] == cert.heuristics.cert_id , line["true_id"], cert.heuristics.cert_id, line["fixable"])
            i += 1
print(f"Total: {i}")

The following cell lists those duplicates that were fixed by changes since manual analysis.

In [None]:
manual_duplicate_ids = set()
i = 0
with open("duplicate_ids.csv", "r") as f:
    reader = csv.DictReader(f)
    for line in reader:
        manual_duplicate_ids.add(line["id"])
        if line["id"] not in duplicate_id_dgsts:
            i += 1
            print(",".join(line.values()))
print(f"Total: {i}")

The following cell lists duplicates that were *created* since manual analysis.

In [None]:
new_duplicate_ids = duplicate_id_dgsts.difference(manual_duplicate_ids)
for idd in new_duplicate_ids:
    cert = dset[idd]
    print(idd, cert.heuristics.cert_id)
print(f"Total: {len(new_duplicate_ids)}")