# PnID parsing with SDK-experimental demo

This notebook uses Publicdata PnIDs to demonstrate how to contextualize PnIDs using cognite-sdk-python-experimental.

## Accessing CDF
This tutorial assumes you have some basic knowledge of CDF and the Python SDK. If not, please follow the 'lab' tutorials first.

For this tutorial you need access the publicdata project / tenant. If you don't have one, you can get an API-key [here](https://content.cognite.com/open-industrial-data).

## Setup
These steps wil import the SDK and some required packages. 
You can then connect a client to CDF by pasting your API key in the popup field.

In [None]:
from getpass import getpass
from cognite.experimental import CogniteClient

In [None]:
from IPython.display import display_pdf, Image, display, display_svg

In [None]:
project = "publicdata" # put your project name as string here
api_key=getpass("Please enter API key: ")
client = CogniteClient(
    api_key=api_key,
    project=project,
    client_name="DSHub",
    base_url="https://api.cognitedata.com"
)

## Gather data
We will start by finding some PnIDs and corresponding asset names to use in the processing. The PnID parsing tool only supports PDF files, so we'll restrict the search to PDF mime type.

In [None]:
client.files.list(mime_type="application/pdf").to_pandas().head()

We will use the file named `PH-ME-P-0156-001.pdf` with id `230063753840368` in the following examples. 

We can now download the file information and the file content. The file content is not strictly necessary, but is useful to visualize.

In [None]:
f = client.files.retrieve(230063753840368)

# Display the PDF file by downloading the content (can be skipped)
display_pdf(client.files.download_bytes(f.id), raw=True)

The PnID contextualization requires a list of tags/strings to look for in the PnID. This helps the algorithm to only return tags that actually exists. We will therefore download all asset and file data and use the asset names and file names as tags to look for.

Note: In a more realistic case, you should limit the list of potential names/tags as much as possible for better performance, e.g by selecting only a subtree of the asset hierarchy.

In [None]:
files = client.files.list(limit=None)
assets = client.assets.list(limit=None, partitions=10)
print(f"Retrieved {len(files)} files and {len(assets)} assets")

## Detect assets and files in PnID
Now we are ready to run the parser. The job takes two required parameters, the CDF `file_id` of the PnID file and a list of strings to look for called `entities`.

In [None]:
# Create list of asset and file names to look for
entities = [f.name for f in files] + [a.name for a in assets]

In [None]:
# Start PnID parsing job
job = client.pnid_parsing.parse(file_id=f.id, entities=entities)

The parsing job is run in an asynchronous manner, and the call to `job.result` will wait until it has finished. This might take a few seconds.

In [None]:
results = job.result

The results contain  `items`, a list of detected tags, and `svgUrl` which can be used to fetch a highlighted SVG version of the PnID. The URL is valid for 30 seconds.

In [None]:
detected_entities = results["items"]
svg = results["svgUrl"]
print(f"Found {len(detected_entities)} assets and files in PnID")

In [None]:
# Display the annotated PnID in SVG format
Image(url=svg)

In [None]:
# List the first 10 detected tags for further processing
for de in detected_entities[0:10]:
    print(f'{de["text"]: <15}', de["boundingBox"])

## Advanced options

The parsing also supports a few advanced options. For an up to date list of all available options, see [the API documentation](https://docs.cognite.com/api/playground/#operation/pnidDetect).

### 1 Fuzzy matching of entities
In some cases the tag names found in a PnID do not match the asset list exactly. The PnID parsing algorithm is still able to find matches by looking at substrings and similar characters (i.e `0` and `O`, `8` and `B`). To use fuzzy matching, set `partial_match = True`.

In [None]:
job = client.pnid_parsing.parse(file_id=f.id, entities=entities, partial_match=True)
fuzzy_results = job.result
print(f"Without partial_match enabled: {len(detected_entities)} assets and files detected")
print(f"With partial_match enabled:    {len(fuzzy_results['items'])} assets and files detected")

### 2 Using asset aliases
In some cases, there exists a known mapping between e.g asset names and the tags found in the PnID. It is possible to provide these aliases to the parsing algorithm as a dictionary called `name_mapping`.

In [None]:
job = client.pnid_parsing.parse(file_id=f.id, 
                                entities=["myawesomeasset"], 
                                name_mapping={"myawesomeasset":"23-DB-9101"}
                               )

In [None]:
job.result["items"]

## Detect entities in PnID based on regex pattern
This feature is useful if you don't have a pre-defined list of entities to look for in the PnID (like a list of asset tags), but you do know that all strings of interest follow one or more patterns. 

This method takes a CDF file_id as input, as well as a list of `patterns`. The regular expression patterns must follow the format defined in [the API documentation](https://docs.cognite.com/api/playground/#operation/pnidExtract).

The regular expression patterns are limited to simple expressions without wildcards, anchors or repetition symbols, and must include groups of letters or digits to look for. A group is defined by a pattern in parenthesis () and can match either letters or numbers, but not both. Valid groups are e.g. `([A-Z]{2,5})`, `([0-9]{4})`, `(TAG)`. The groups in the pattern can optionally be separated by a separator character like `-` or `_`. The resulting tags will then include this separator character between its matching groups.

In the above PnID, we see that many tags follow these patterns:\
`23-TE-96114-01`, i.e (2 digits)-(2 letters)-(5 digits)-(2 digits)\
`23-TE-96114`, i.e (2 digits)-(2 letters)-(5 digits)

We can represent this as the following expressions:\
`([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})`\
`([0-9]{2})-([A-Z]{2})-([0-9]{5})`

In [None]:
# Define a regex-like pattern to search for
patterns = ["([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})", "([0-9]{2})-([A-Z]{2})-([0-9]{5})"]

NOTE: This method is not yet supported in SDK, so we are calling the API directly.

In [None]:
response = client.post(f"/api/playground/projects/{project}/context/pnid/extractpattern", 
                       json={"fileId": f.id, "patterns": patterns}).json()

In [None]:
get_response = client.get(f"/api/playground/projects/{project}/context/pnid/extractpattern/{response['jobId']}").json()
# Wait until status=Completed
get_response["status"]

In [None]:
print(f"Detected {len(get_response['items'])} entities from pattern {pattern}")

In [None]:
# Display detected patterns in file
response = client.post(f"/api/playground/projects/{project}/context/pnid/convert", {"fileId": f.id, "items": get_response['items']})

In [None]:
get_response = client.get(f"/api/playground/projects/{project}/context/pnid/convert/{response.json()['jobId']}").json()
# Wait until status=Completed
get_response["status"]

In [None]:
Image(url=get_response["svgUrl"])

# Get raw OCR data from PnID file
The Context API is storing the OCR data for each processed PnID, and this data can be retrieved for further processing. 

NOTE: This method is not yet supported in SDK, so we are calling the API directly.

In [None]:
ocr_response = client.post(f"/api/playground/projects/{project}/context/pnid/ocr", {"fileId": f.id}).json()

In [None]:
ocr_data = ocr_response["items"][0]["annotations"]
print(f"Extracted {len(ocr_data)} raw text elements from document")
ocr_data[0:5]

In [None]:
# Display OCR data in file
response = client.post(f"/api/playground/projects/{project}/context/pnid/convert", {"fileId": f.id, "items": ocr_data}).json()

In [None]:
get_response = client.get(f"/api/playground/projects/{project}/context/pnid/convert/{response['jobId']}").json()
# Wait until status=Completed
get_response["status"]

In [None]:
Image(url=get_response["svgUrl"])

## BETA: Detect common objects in PnIDs
The PnID API also supports a basic form of object detection, i.e it can identify shapes representing common components in a PnID. We currently support 14 classes of objects.

This endpoint takes a single `file_id` parameter, and returns a list of detected objects.

In [None]:
response = client.post(f"/api/playground/projects/{project}/context/pnidobjects/findobjects", 
                       {"fileId": f.id}).json()

In [None]:
get_response = client.get(f"/api/playground/projects/{project}/context/pnidobjects/{response['jobId']}").json()
# Wait until status=Completed
get_response["status"]

In [None]:
print(f"Detected {len(get_response['items'])} objects in PnID")

In [None]:
# List the first 10 detected objects for further processing
for o in get_response['items'][0:10]:
    print(f'{o["type"]: <15}', de["boundingBox"])