# PnID parsing with SDK-experimental demo

This notebook uses Publicdata PnIDs to demonstrate how to contextualize PnIDs using cognite-sdk-python-experimental.

## Accessing CDF
This tutorial assumes you have some basic knowledge of CDF and the Python SDK. If not, please follow the 'lab' tutorials first.

For this tutorial you need access the publicdata project / tenant. If you don't have one, you can get an API-key [here](https://content.cognite.com/open-industrial-data).

## Setup
These steps wil import the SDK and install some extra required packages for visualizing in notebooks. 
You can then connect a client to CDF by pasting your API key in the popup field.

In [None]:
!pip install Pillow numpy bounding_box pdf2image

In [None]:
from getpass import getpass
from cognite.experimental import CogniteClient # version~=0.41
from IPython.display import display_pdf, Image, display, display_svg
import time

In [None]:
project = "publicdata"  # you can also put your own project name here
api_key = getpass("Please enter API key: ")
client = CogniteClient(
    api_key=api_key,
    project=project,
    client_name="DSHub",
)

## Gather data
We will start by finding some PnIDs and corresponding asset names to use in the processing. The PnID parsing tool only supports PDF files, so we'll restrict the search to PDF mime type.

In [None]:
client.files.list(mime_type="application/pdf")

We select one of these files, 'PH-ME-P-0156-001.pdf' with id 230063753840368 to investigate.
We can now download the file information and the file content. The file content is not strictly necessary, but is useful to visualize.

In [None]:
file_id = 230063753840368
f = client.files.retrieve(file_id)

# Display the PDF file by downloading the content (can be skipped)
display_pdf(client.files.download_bytes(f.id), raw=True)

The PnID contextualization requires a list of tags/strings to look for in the PnID. This helps the algorithm to only return tags that actually exists. We will therefore download all asset and file data and use the asset names and file names as tags to look for.

Note: In a more realistic case, you should limit the list of potential names/tags as much as possible for better performance, e.g by selecting only a subtree of the asset hierarchy.

In [None]:
files = client.files.list(limit=None)
assets = client.assets.list(limit=None, partitions=10)
print(f"Retrieved {len(files)} files and {len(assets)} assets")

## Detect assets and files in PnID
Now we are ready to run the parser. The job takes two required parameters, the CDF `file_id` or `file_external_id` of the PnID file and a list of `entities` to look for, which is either a list of strings, or a list of objects and a search field to use.

In [None]:
entities = files.dump() + assets.dump()

In [None]:
# Start PnID parsing job
job = client.pnid_parsing.detect(
    file_id=f.id,  # or `file_external_id` for file's external id
    entities=entities,
    search_field="name",
)

In [None]:
job

The parsing job is run in an asynchronous manner, and the call to `job.result` or any other helper methods will wait until it has finished. This might take a few seconds.

In [None]:
results = job.result
print(len(results['items']),'matches')

The following helper methods may be useful:

The results contain  `items` a list of detected tags, but is often more convenient to use the `.matches` helper method. This method also resolved the text found back to the entities passed to detect.

In [None]:
detected_entities = job.matches
detected_entities

In [None]:
detected_entities[0].text

In [None]:
detected_entities[0].confidence

In [None]:
detected_entities[0].entities

In [None]:
job.image # downloads file and draws bounding boxes with text

## Annotate detected items in SVG format

You can convert the result to svg with the convert method

In [None]:
convert_image_job = job.convert()
convert_image_job.image # returns an svg image

## Advanced options

The parsing also supports a few advanced options. For an up to date list of all available options, see [the API documentation](https://docs.cognite.com/api/playground/#operation/pnidDetect).

### 1 Fuzzy matching of entities
In some cases the tag names found in a PnID do not match the asset list exactly. The PnID parsing algorithm is still able to find matches by looking at substrings and similar characters (i.e `0` and `O`, `8` and `B`). To use fuzzy matching, set `partial_match = True`.

In [None]:
job = client.pnid_parsing.detect(file_id=f.id, entities=entities, partial_match=True)
fuzzy_results = job.matches

In [None]:
print(f"Without partial_match enabled: {len(detected_entities)} assets and files detected")
print(f"With partial_match enabled:    {len(fuzzy_results)} assets and files detected")

### 2 Using asset aliases
In some cases, there exists a known mapping between e.g asset names and the tags found in the PnID. It is possible to provide these aliases to the parsing algorithm as a dictionary called `name_mapping`.

In [None]:
job = client.pnid_parsing.detect(
    file_id=f.id,
    entities=["myawesomeasset"],
    name_mapping={"myawesomeasset": "23-DB-9101"},
)

In [None]:
job.matches

## Detect entities in PnID based on regex pattern
This feature is useful if you don't have a pre-defined list of entities to look for in the PnID (like a list of asset tags), but you do know that all strings of interest follow one or more patterns. 

This method takes a CDF file_id as input, as well as a list of `patterns`. The regular expression patterns must follow the format defined in [the API documentation](https://docs.cognite.com/api/playground/#operation/pnidExtract).

The regular expression patterns are limited to simple expressions without wildcards, anchors or repetition symbols, and must include groups of letters or digits to look for. A group is defined by a pattern in parenthesis () and can match either letters or numbers, but not both. Valid groups are e.g. `([A-Z]{2,5})`, `([0-9]{4})`, `(TAG)`. The groups in the pattern can optionally be separated by a separator character like `-` or `_`. The resulting tags will then include this separator character between its matching groups.

In the above PnID, we see that many tags follow these patterns:\
`23-TE-96114-01`, i.e (2 digits)-(2 letters)-(5 digits)-(2 digits)\
`23-TE-96114`, i.e (2 digits)-(2 letters)-(5 digits)

We can represent this as the following expressions:\
`([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})`\
`([0-9]{2})-([A-Z]{2})-([0-9]{5})`

In [None]:
# Define a regex-like pattern to search for
patterns = [
    "([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})",
    "([0-9]{2})-([A-Z]{2})-([0-9]{5})",
]

In [None]:
pattern_job = client.pnid_parsing.extract_pattern(file_id=f.id,patterns=patterns)

In [None]:
pattern_job.matches

In [None]:
print(f"Detected {len(job.matches)} entities from patterns {patterns}")

# Get raw OCR data from PnID file
The Context API is storing the OCR data for each processed PnID, and this data can be retrieved for further processing. 


In [None]:
ocr_results = client.pnid_parsing.ocr(file_id=f.id)
ocr_results

In [None]:
ocr_results.image # PIL image with bounding boxes

You can also get this data from a detect job, or use these results in a convert job

In [None]:
job.ocr()

In [None]:
ocr_image = job.convert(ocr=True)
ocr_image.image # svg image with highlights

## ALPHA: Detect common objects from PnIDs using SDK
The SDK also supports a basic form of object detection in P&IDs with single-paged PDF format, i.e it can identify shapes representing common components in a PnID. We currently support 20 classes of objects (valve, indicator, shared indicator, ball valve, diamond, tag, triangle, square with diagonal line, pump or centrifuge, flange, reducer, rotameter, slope, cloud, heat exchanger, note, logo, table, spectacle blind, object.)

This function takes a single `file_id` parameter, and returns a list of detected objects. For API usage, please refer to `pnidobjects_demo_api.ipynb`

In [None]:
object_job = client.pnid_object_detection.find_objects(file_id=f.id)

In [None]:
results = object_job.result
detected_objects = results["items"]
print(f"Found {len(detected_objects)} objects in PnID")
print(f"Types found: {list({de['type'] for de in detected_objects})}")

In [None]:
# List the first 10 detected objects for further processing
for de in detected_objects[0:10]:
    print(f'{de["type"]: <20}', f'score: {de["score"]: 15f}', de["boundingBox"])