# PnID parsing with SDK-experimental demo

This notebook uses Publicdata PnIDs to demonstrate how to contextualize PnIDs using cognite-sdk-python-experimental.

## Accessing CDF
This tutorial assumes you have some basic knowledge of CDF and the Python SDK. If not, please follow the 'lab' tutorials first.

For this tutorial you need access the publicdata project / tenant. If you don't have one, you can get an API-key [here](https://content.cognite.com/open-industrial-data).

## Setup
These steps wil import the SDK and some required packages. 
You can then connect a client to CDF by pasting your API key in the popup field.

In [1]:
from getpass import getpass
from cognite.experimental import CogniteClient # version=~0.22.3

In [2]:
from IPython.display import display_pdf, Image, display, display_svg
import time

In [3]:
project = "publicdata" # put your project name as string here
api_key=getpass("Please enter API key: ")
client = CogniteClient(
    api_key=api_key,
    project=project,
    client_name="DSHub",
    base_url="https://api.cognitedata.com"
)

Please enter API key:  ················································


## Gather data
We will start by finding some PnIDs and corresponding asset names to use in the processing. The PnID parsing tool only supports PDF files, so we'll restrict the search to PDF mime type.

In [4]:
client.files.list(mime_type="application/pdf").to_pandas().head()

Unnamed: 0,name,source,mimeType,metadata,dataSetId,id,uploaded,uploadedTime,createdTime,lastUpdatedTime
0,PH-ME-P-0156-001.pdf,Discovery,application/pdf,"{'__COGNITE_PNID': 'true', '__COGNITE_PNID_PAR...",2452112635370053,230063753840368,True,1586949730690,1586949728738,1587127451418
1,PH-25578-P-4110119-001.pdf,Discovery,application/pdf,"{'__COGNITE_PNID': 'true', '__COGNITE_PNID_PAR...",2452112635370053,1173477928867194,True,1586949730124,1586949728733,1587127411349
2,PH-ME-P-0152-001.pdf,Discovery,application/pdf,"{'__COGNITE_PNID': 'true', '__COGNITE_PNID_PAR...",2452112635370053,1296220000699223,True,1586949730652,1586949728743,1587127447626
3,PH-ME-P-0004-001.pdf,Discovery,application/pdf,"{'__COGNITE_PNID': 'true', '__COGNITE_PNID_PAR...",2452112635370053,2240135229528595,True,1586949729424,1586949728742,1587127486739
4,PH-ME-P-0156-002.pdf,Discovery,application/pdf,"{'__COGNITE_PNID': 'true', '__COGNITE_PNID_PAR...",2452112635370053,2694497441923100,True,1586949730695,1586949728743,1587127549067


We will use the file named `PH-ME-P-0156-001.pdf` with id `230063753840368` in the following examples. 

We can now download the file information and the file content. The file content is not strictly necessary, but is useful to visualize.

In [5]:
f = client.files.retrieve(230063753840368)

# Display the PDF file by downloading the content (can be skipped)
display_pdf(client.files.download_bytes(f.id), raw=True)

The PnID contextualization requires a list of tags/strings to look for in the PnID. This helps the algorithm to only return tags that actually exists. We will therefore download all asset and file data and use the asset names and file names as tags to look for.

Note: In a more realistic case, you should limit the list of potential names/tags as much as possible for better performance, e.g by selecting only a subtree of the asset hierarchy.

In [6]:
files = client.files.list(limit=None)
assets = client.assets.list(limit=None, partitions=10)
print(f"Retrieved {len(files)} files and {len(assets)} assets")

Retrieved 18 files and 1114 assets


## Detect assets and files in PnID
Now we are ready to run the parser. The job takes two required parameters, the CDF `file_id` of the PnID file and a list of strings to look for called `entities`.

In [7]:
# Create list of asset and file names to look for
entities = [f.name for f in files] + [a.name for a in assets]

In [8]:
# Start PnID parsing job
job = client.pnid_parsing.detect(file_id=f.id, entities=entities)

The parsing job is run in an asynchronous manner, and the call to `job.result` will wait until it has finished. This might take a few seconds.

In [9]:
results = job.result

The results contain  `items`, a list of detected tags, and `svgUrl` which can be used to fetch a highlighted SVG version of the PnID. The URL is valid for 30 seconds.

In [10]:
detected_entities = results["items"]
print(f"Found {len(detected_entities)} assets and files in PnID")

Found 104 assets and files in PnID


In [11]:
# List the first 10 detected tags for further processing
for de in detected_entities[0:10]:
    print(f'{de["text"]: <15}', de["boundingBox"])

23-TE-96136-05  {'xMax': 0.2575055410034254, 'xMin': 0.2393713479750151, 'yMax': 0.24536905101168424, 'yMin': 0.22827016243944148}
23-FG-96101-02  {'xMax': 0.8351803344751159, 'xMin': 0.8170461414467056, 'yMax': 0.6933599316044458, 'yMin': 0.6762610430322029}
23-TE-96113-02  {'xMax': 0.535966149506347, 'xMin': 0.5162200282087447, 'yMax': 0.6568823026503278, 'yMin': 0.640068395554289}
23-YT-96134-01  {'xMax': 0.43904896232117674, 'xMin': 0.4209147692927665, 'yMax': 0.5651182673126247, 'yMin': 0.5477343972641778}
23-TE-96116-04  {'xMax': 0.7886359057021962, 'xMin': 0.7705017126737861, 'yMax': 0.28184667996580226, 'yMin': 0.2647477913935594}
23-YI-96120-02  {'xMax': 0.7289945597420915, 'xMin': 0.7106588756800323, 'yMax': 0.6047306925049872, 'yMin': 0.5876318039327444}
23-XE-96124     {'xMax': 0.5079588958291356, 'xMin': 0.4908321579689704, 'yMax': 0.5306355086919351, 'yMin': 0.5149615275007124}
23-TE-96111-04  {'xMax': 0.542010880515817, 'xMin': 0.5234737054201088, 'yMax': 0.2812767170133

## Annotate detected items in SVG format

In [12]:
# Display the annotated PnID in SVG format
job = client.pnid_parsing.convert(file_id=f.id, items=detected_entities)
results = job.result
svg = results["svgUrl"]
Image(url=svg)

## Advanced options

The parsing also supports a few advanced options. For an up to date list of all available options, see [the API documentation](https://docs.cognite.com/api/playground/#operation/pnidDetect).

### 1 Fuzzy matching of entities
In some cases the tag names found in a PnID do not match the asset list exactly. The PnID parsing algorithm is still able to find matches by looking at substrings and similar characters (i.e `0` and `O`, `8` and `B`). To use fuzzy matching, set `partial_match = True`.

In [13]:
job = client.pnid_parsing.detect(file_id=f.id, entities=entities, partial_match=True)
fuzzy_results = job.result
print(f"Without partial_match enabled: {len(detected_entities)} assets and files detected")
print(f"With partial_match enabled:    {len(fuzzy_results['items'])} assets and files detected")

Without partial_match enabled: 104 assets and files detected
With partial_match enabled:    139 assets and files detected


### 2 Using asset aliases
In some cases, there exists a known mapping between e.g asset names and the tags found in the PnID. It is possible to provide these aliases to the parsing algorithm as a dictionary called `name_mapping`.

In [14]:
job = client.pnid_parsing.detect(file_id=f.id, 
                                entities=["myawesomeasset"], 
                                name_mapping={"myawesomeasset":"23-DB-9101"}
                               )

In [15]:
job.result["items"]

[{'boundingBox': {'xMax': 0.611525287124723,
   'xMin': 0.5599435825105783,
   'yMax': 0.06611570247933884,
   'yMin': 0.05528640638358507},
  'confidence': 1.0,
  'text': 'myawesomeasset',
  'type': None},
 {'boundingBox': {'xMax': 0.5631674390489624,
   'xMin': 0.5115857344348177,
   'yMax': 0.40210886292390996,
   'yMin': 0.39184952978056437},
  'confidence': 1.0,
  'text': 'myawesomeasset',
  'type': None}]

## Detect entities in PnID based on regex pattern
This feature is useful if you don't have a pre-defined list of entities to look for in the PnID (like a list of asset tags), but you do know that all strings of interest follow one or more patterns. 

This method takes a CDF file_id as input, as well as a list of `patterns`. The regular expression patterns must follow the format defined in [the API documentation](https://docs.cognite.com/api/playground/#operation/pnidExtract).

The regular expression patterns are limited to simple expressions without wildcards, anchors or repetition symbols, and must include groups of letters or digits to look for. A group is defined by a pattern in parenthesis () and can match either letters or numbers, but not both. Valid groups are e.g. `([A-Z]{2,5})`, `([0-9]{4})`, `(TAG)`. The groups in the pattern can optionally be separated by a separator character like `-` or `_`. The resulting tags will then include this separator character between its matching groups.

In the above PnID, we see that many tags follow these patterns:\
`23-TE-96114-01`, i.e (2 digits)-(2 letters)-(5 digits)-(2 digits)\
`23-TE-96114`, i.e (2 digits)-(2 letters)-(5 digits)

We can represent this as the following expressions:\
`([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})`\
`([0-9]{2})-([A-Z]{2})-([0-9]{5})`

In [16]:
# Define a regex-like pattern to search for
patterns = ["([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})", "([0-9]{2})-([A-Z]{2})-([0-9]{5})"]

NOTE: This method is not yet supported in SDK, so we are calling the API directly.

In [17]:
response = client.post(f"/api/playground/projects/{project}/context/pnid/extractpattern", 
                       json={"fileId": f.id, "patterns": patterns}).json()

In [18]:
api_url = f"/api/playground/projects/{project}/context/pnid/extractpattern"
get_response = client.get(f"{api_url}/{response['jobId']}").json()

while get_response["status"] not in ["Completed", "Failed"]:
    time.sleep(5)
    print(get_response["status"])
    get_response = client.get(url = f"{api_url}/{response['jobId']}").json()

Running
Running
Running


In [19]:
print(f"Detected {len(get_response['items'])} entities from pattern {patterns}")

Detected 84 entities from pattern ['([0-9]{2})-([A-Z]{2})-([0-9]{5})-([0-9]{2})', '([0-9]{2})-([A-Z]{2})-([0-9]{5})']


In [20]:
# Display detected patterns in file
response = client.post(f"/api/playground/projects/{project}/context/pnid/convert", {"fileId": f.id, "items": get_response['items']}).json()

In [21]:
api_url = f"/api/playground/projects/{project}/context/pnid/convert"
get_response = client.get(f"{api_url}/{response['jobId']}").json()

while get_response["status"] not in ["Completed", "Failed"]:
    time.sleep(5)
    print(get_response["status"])
    get_response = client.get(url = f"{api_url}/{response['jobId']}").json()

Running
Running


In [22]:
Image(url=get_response["svgUrl"])

# Get raw OCR data from PnID file
The Context API is storing the OCR data for each processed PnID, and this data can be retrieved for further processing. 

NOTE: This method is not yet supported in SDK, so we are calling the API directly.

In [23]:
ocr_response = client.post(f"/api/playground/projects/{project}/context/pnid/ocr", {"fileId": f.id}).json()

In [24]:
ocr_data = ocr_response["items"][0]["annotations"]
print(f"Extracted {len(ocr_data)} raw text elements from document")
ocr_data[0:5]

Extracted 983 raw text elements from document


[{'boundingBox': {'xMax': 0.4521458795083619,
   'xMin': 0.44791456780173283,
   'yMax': 0.035907666001709894,
   'yMin': 0.02707324023938444},
  'confidence': 1.0,
  'text': '4',
  'type': None},
 {'boundingBox': {'xMax': 0.5673987507555914,
   'xMin': 0.5627644569816643,
   'yMax': 0.036477628954117984,
   'yMin': 0.02707324023938444},
  'confidence': 1.0,
  'text': '5',
  'type': None},
 {'boundingBox': {'xMax': 0.7977030022164013,
   'xMin': 0.7934716905097723,
   'yMax': 0.035907666001709894,
   'yMin': 0.02707324023938444},
  'confidence': 1.0,
  'text': '7',
  'type': None},
 {'boundingBox': {'xMax': 0.9129558734636309,
   'xMin': 0.9083215796897038,
   'yMax': 0.03619264747791394,
   'yMin': 0.02707324023938444},
  'confidence': 1.0,
  'text': '8',
  'type': None},
 {'boundingBox': {'xMax': 0.05500705218617771,
   'xMin': 0.04493250050372759,
   'yMax': 0.04274722143060701,
   'yMin': 0.03562268452550584},
  'confidence': 1.0,
  'text': 'ME',
  'type': None}]

In [25]:
# Display OCR data in file
response = client.post(f"/api/playground/projects/{project}/context/pnid/convert", {"fileId": f.id, "items": ocr_data}).json()

In [26]:
api_url = f"/api/playground/projects/{project}/context/pnid/convert"
get_response = client.get(f"{api_url}/{response['jobId']}").json()

while get_response["status"] not in ["Completed", "Failed"]:
    time.sleep(5)
    print(get_response["status"])
    get_response = client.get(url = f"{api_url}/{response['jobId']}").json()

Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Running
Running


In [27]:
Image(url=get_response["svgUrl"])

## ALPHA: Detect common objects from PnIDs using SDK
The SDK also supports a basic form of object detection in P&IDs with single-paged PDF format, i.e it can identify shapes representing common components in a PnID. We currently support 20 classes of objects (valve, indicator, shared indicator, ball valve, diamond, tag, triangle, square with diagonal line, pump or centrifuge, flange, reducer, rotameter, slope, cloud, heat exchanger, note, logo, table, spectacle blind, object.)

This function takes a single `file_id` parameter, and returns a list of detected objects. For API usage, please refer to `pnidobjects_demo_api.ipynb`

In [28]:
job = client.pnid_object_detection.find_objects(file_id=f.id)

In [29]:
results = job.result
detected_objects = results["items"]
print(f"Found {len(detected_objects)} objects in PnID")
print(f"Types found: {list({de['type'] for de in detected_objects})}")

Found 208 objects in PnID
Types found: ['reducer', 'logo', 'tag', 'triangle', 'valve', 'shared indicator', 'indicator', 'note', 'pump or centrifuge', 'flange', 'object']


In [30]:
# List the first 10 detected objects for further processing

for de in detected_objects[0:10]:
    print(f'{de["type"]: <20}', f'score: {de["score"]: 15f}', de["boundingBox"])

indicator            score:        0.999892 {'xMax': 0.3947058823529412, 'xMin': 0.37970588235294117, 'yMax': 0.7079866888519135, 'yMin': 0.6863560732113144}
indicator            score:        0.999878 {'xMax': 0.7647058823529411, 'xMin': 0.7502941176470588, 'yMax': 0.46256239600665555, 'yMin': 0.44093178036605657}
indicator            score:        0.999871 {'xMax': 0.5367647058823529, 'xMin': 0.5220588235294118, 'yMax': 0.5349417637271214, 'yMin': 0.5137271214642263}
shared indicator     score:        0.999848 {'xMax': 0.43470588235294116, 'xMin': 0.4188235294117647, 'yMax': 0.15931780366056572, 'yMin': 0.13643926788685523}
shared indicator     score:        0.999844 {'xMax': 0.5370588235294118, 'xMin': 0.5220588235294118, 'yMax': 0.5973377703826955, 'yMin': 0.5752911813643927}
shared indicator     score:        0.999830 {'xMax': 0.3788235294117647, 'xMin': 0.36323529411764705, 'yMax': 0.19841930116472545, 'yMin': 0.17637271214642264}
shared indicator     score:        0.999823 {'xMa