# Entity Matching with SDK-experimental demo

This notebook uses a small dummy data set to demonstrate how to do entity matching using cognite-sdk-python-experimental.

It aims at demonstrating most of the capabilities available, explain when the different parameter combinations are most suitable and explain (in some detail) what happens in the background.

## Get access to CDF
We assume you have some basic knowledge of CDF and the SDK. If not, please follow the 'lab' tutorials first.

To do this tutorial you nee have access to a Cognite project / tenant, you can apply for one here.


## Import modules
We need to import some Python modules in order to interact with CDF. We will use the Python SDK with Experimental Extensions, which we below refer to as a client.

In [3]:
from cognite.client.exceptions import CogniteAPIError
from cognite.experimental import CogniteClient
from getpass import getpass

## Create a client

To get access to your project, replace "yourproject" with your project name in the next cell. 

When you create the CogniteClient below, getpass will ask for your API key in an extra password field. Simply paste it in and press shift+enter.

In [16]:
project = 'yourproject'
api_key = getpass("Please enter API key: ")
client = CogniteClient(project=project,
                       api_key=api_key,
                       client_name="dshub"
                      )

Please enter API key: ········


## Create dummy data

This tutorial uses a small dummy data set created below to demonstrate how to do entity matching using Python SDK with Experimental Extensions.

In [12]:
match_from = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"}, 
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
match_to = [
    {"id":0, "name" : "21AA1019CA", "description": "correct"}, 
    {"id":1, "name" : "21AA1019CA", "description": "wrong"}, 
    {"id":2, "name" : "13FV1234BU"},
    {"id":3, "name" : "13FV1234BU", "description": "ok"}
]
true_matches = [(0,0)]

## Fit a supervised ml-model and predict for the same data

The supervised model calculates one or more similarity measures between match-to and match-from items. Then it uses these calculated similarity measures as features and fits a classification model using the labeled data.

Note, before calculating the similarity measures and training a model a set a candidate matches are selected. A pair of match-to and match-from items is considered to be a candidate if they have at least one token in common. Only the candidate match-from, match-to combinations are used in the training.  This is done to reduce computing time - calculating similarity measures for all possible combinations can be extremely heavy (10.000 time series and 30.000 assets -> 300.000.000 combinations). 

In [13]:
model = client.entity_matching.fit(match_from = match_from,
                                   match_to = match_to,
                                   true_matches = true_matches
)

## Predict

When `predict` is called without any data, predictions are on the training data.

`num_matches` determines the number of matches to return for each `matchFrom` item, default is 1.

In [7]:
job = model.predict(num_matches = 2)
matches = job.result
matches["items"]

[{'matchFrom': {'description': 'correct',
   'id': 0,
   'name': 'KKL_21AA1019CA.PV'},
  'matches': [{'matchTo': {'description': 'correct',
     'id': 0,
     'name': '21AA1019CA'},
    'score': 0.425},
   {'matchTo': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'},
    'score': 0.425}]},
 {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
  'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'}, 'score': 0.425},
   {'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
    'score': 0.425}]}]

Note: For both `matchFrom` items we see that the two matches returned have an equal score. Hence, the model is not able to distinguish between the correct and incorrect match. 
Also, the scores are quite low. Unsupervised learning makes more sense when the data set is  small.

## Refit

Refit lets you retrain a model (using the same parameters) with additional labels/true-matches. The new `true_matches` (1,3) are added to the `true_matches`-list from the original model.   

To fit a model using only the (1,3) label. A new model must be trained using `fit`.

In [14]:
model = model.refit(true_matches = [(1,3)])

In [15]:
## Predict on training data
job = model.predict(num_matches = 2)
matches = job.result
matches["items"]

[{'matchFrom': {'description': 'correct',
   'id': 0,
   'name': 'KKL_21AA1019CA.PV'},
  'matches': [{'matchTo': {'description': 'correct',
     'id': 0,
     'name': '21AA1019CA'},
    'score': 0.425},
   {'matchTo': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'},
    'score': 0.425}]},
 {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
  'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'}, 'score': 0.425},
   {'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
    'score': 0.425}]}]

Note: In this example the results are the same. The new true-match follows the exact same pattern as the original. 

## Fit unsupervised model

If there are no `true_matches` included in the `fit` call, an unsupervised model is trained.

As for a supervised model candidates are selected and similarity measures between the candidates are calculated. However, instead of training a classification model, the average of the average of the similarity measures are calculated and returned as the score. 

When there are no or few true matches (labeled data), an unsupervised model is preferred.

In [32]:
model = client.entity_matching.fit(match_from = match_from,
                                      match_to = match_to
) 

In [33]:
# Predict on the training data
job = model.predict(num_matches = 2)
matches = job.result
matches["items"]  

[{'matchFrom': {'description': 'correct',
   'id': 0,
   'name': 'KKL_21AA1019CA.PV'},
  'matches': [{'matchTo': {'description': 'correct',
     'id': 0,
     'name': '21AA1019CA'},
    'score': 1.0},
   {'matchTo': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'},
    'score': 1.0}]},
 {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
  'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'}, 'score': 1.0},
   {'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
    'score': 1.0}]}]

Note: The scores have increased, but the model is still not able to distinguish between the correct and incorrect match. 

### Add additional keys
By default only name in `match_from` and name in `match_to` are used to calculate similarity measures. The `keys_from_to` parameter lets you specify all combinations of fields in `match_from` and `match_to` that should be used to calculate features.  

In this example it looks like also comparing the description field for both `match_to` and `match_from` will improve the model.

Note: Calculating similarity measures can be time consuming. Therefore, we avoid adding `keys_from_to` combinations which adds little or no information to the model. 

In [13]:
try:
    model = client.entity_matching.fit(match_from = match_from,
                                       match_to = match_to,
                                       keys_from_to = [("name", "name"), ("description", "description")]
                                      )
except CogniteAPIError as error:
    print(error)

Error in input data. Specified keys to match from and to are not in all items: Missing keys in some objects: missing key(s) in matchFrom: set() and in matchTo {'description'}. | code: 400 | X-Request-ID: ce0f5278-6eef-9a75-93b4-846a7da10fc0


The request results in an error because one of the items in `match_to` is missing description. 
If the `complete_missing` is set to `True` missing values are replaced by empty strings.

### Add `complete_missing`

In [34]:
model = client.entity_matching.fit(match_from = match_from,
                                   match_to = match_to,
                                   keys_from_to = [("name", "name"), ("description", "description")],
                                   complete_missing = True
)

In [35]:
# Predict on training data
job = model.predict(num_matches = 2)
matches = job.result
matches["items"]   

[{'matchFrom': {'description': 'correct',
   'id': 0,
   'name': 'KKL_21AA1019CA.PV'},
  'matches': [{'matchTo': {'description': 'correct',
     'id': 0,
     'name': '21AA1019CA'},
    'score': 1.0},
   {'matchTo': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'},
    'score': 0.5}]},
 {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
  'matches': [{'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
    'score': 1.0},
   {'matchTo': {'description': '', 'id': 2, 'name': '13FV1234BU'},
    'score': 0.5}]}]

Note: The model now gives the correct matches a score of 1 and the incorrect matches score 0.5.

## Additional match_to items

The data below is the same as what was used in the previous examples, except that there are two new items in match_to.  
Id 10 and 13 are similar to 0 and 3 respectively, but the first letter combination ("AA" and "FV") are swapped with the prefix for the match_from items (KKL). 

We will now see how this leads to difficulties if we use the default `feature_type` ("simple”).

In [36]:
match_from = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"}, 
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
match_to = [
    {"id":0,  "name" : "21AA1019CA", "description": "correct"},
    {"id":10, "name" : "21KKL1019CA", "description": "correct"},
    {"id":1,  "name" : "21AA1019CA", "description": "wrong"}, 
    {"id":2,  "name" : "13FV1234BU"},
    {"id":3,  "name" : "13FV1234BU", "description": "ok"},
    {"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
true_matches = [(0,0), (1,3)]

In [37]:
model = client.entity_matching.fit(match_from = match_from,
                                   match_to = match_to,
                                   keys_from_to = [("name", "name"), ("description", "description")],
                                   complete_missing = True
) 

In [38]:
# Predict on training data
job = model.predict(num_matches = 2)
matches = job.result
matches["items"]  

[{'matchFrom': {'description': 'correct',
   'id': 0,
   'name': 'KKL_21AA1019CA.PV'},
  'matches': [{'matchTo': {'description': 'correct',
     'id': 0,
     'name': '21AA1019CA'},
    'score': 0.9472135954999579},
   {'matchTo': {'description': 'correct', 'id': 10, 'name': '21KKL1019CA'},
    'score': 0.9472135954999579}]},
 {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
  'matches': [{'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
    'score': 0.9472135954999579},
   {'matchTo': {'description': 'ok', 'id': 13, 'name': '13KKL1234BU'},
    'score': 0.9472135954999579}]}]

Note: The new `match_to`-items have identical scores as the correct matches. 
This is because when using `feature_type`="simple" only the number of matching tokens are considered. 

Hence, for `match_to`-item with id 0 `21`, `AA`, `1019` and `CA` matches a token in `match-from`-item with id 0.
For `match_to`-item with id 10 `21`, `KKL`, `1019` and `CA` matches a token in `match-from`-item with id 0. Thus, the same number of tokens matches. 

The model does not take into account the `match_to`-item with id 0 have more and longer contiguous sequences of tokens.

The "bigram" `feature_type` does somewhat account for longer contiguous sequences of tokens. In addition to counting the number of matching tokens it also looks at the number of matching bigrams. That is, the number of matching tokens when two and two adjacent tokens are combined.

### Change `feature_type` 

In [39]:
model = client.entity_matching.fit(match_from = match_from,
                                   match_to = match_to,
                                   keys_from_to = [("name", "name"), ("description", "description")],
                                   complete_missing = True,
                                   feature_type = "bigram"
)

In [40]:
# Predict on training data
job = model.predict(num_matches = 2)
matches = job.result
matches["items"] 

[{'matchFrom': {'description': 'correct',
   'id': 0,
   'name': 'KKL_21AA1019CA.PV'},
  'matches': [{'matchTo': {'description': 'correct',
     'id': 0,
     'name': '21AA1019CA'},
    'score': 0.9574603844233502},
   {'matchTo': {'description': 'correct', 'id': 10, 'name': '21KKL1019CA'},
    'score': 0.7236067977499789}]},
 {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
  'matches': [{'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
    'score': 0.9574603844233502},
   {'matchTo': {'description': 'ok', 'id': 13, 'name': '13KKL1234BU'},
    'score': 0.7236067977499789}]}]

Note: The model gives a higher score to the correct matches.

## Predict on new data

To predict on new (unseen) data, simply add this data in the `predict`-call.

In [41]:
match_from = [
    {"id":100, "name" : "KKL_44AB45", "description": "ok"}
]
match_to = [
    {"id":100,  "name" : "44AB45", "description": "ok"},
    {"id":101, "name" : "44AB45", "description": "ok12"},
    {"id":102,  "name" : "44AB45"}
]

job = model.predict(match_from = match_from,
                    match_to = match_to,
                    num_matches = 3,
                    complete_missing = True)
matches = job.result
matches["items"] 

[{'matchFrom': {'description': 'ok', 'id': 100, 'name': 'KKL_44AB45'},
  'matches': [{'matchTo': {'description': 'ok', 'id': 100, 'name': '44AB45'},
    'score': 1.0},
   {'matchTo': {'description': 'ok12', 'id': 101, 'name': '44AB45'},
    'score': 0.8211142625940433},
   {'matchTo': {'description': '', 'id': 102, 'name': '44AB45'},
    'score': 0.5}]}]

## Get model info

If you have a model_id and want to know which parameters you used when training the model, use the `retrieve` method.

In [42]:
client.entity_matching.retrieve(model_id = model.model_id)

Unnamed: 0,value
model_id,2834864391038070
status,Completed
request_timestamp,1597134399047
start_timestamp,1597134399216
status_timestamp,1597134399491
feature_type,bigram
keys_from_to,"[[name, name], [description, description]]"
model_type,Unsupervised
