# Cognite contextualization lab 1 - Entity Matching
Estimated time: 30 minutes.

## Get access to CDF
We assume you have some basic knowledge of CDF and the SDK. If not, please follow the 'lab' tutorials first.

Unlike previous tutorials, for this application you *must* have access to a Cognite project / tenant, you can apply for one [here](https://cognitedata.atlassian.net/wiki/spaces/CSF/pages/1113523070/Creating+a+new+tenant).

## Import modules
We need to import some Python modules in order to interact with CDF. We will use the Python SDK with Experimental Extensions, which we below refer to as a `client`.

In [None]:
from getpass import getpass
from cognite.experimental import CogniteClient

To get access to your project, replace `yourproject` with your project name in the next cell. If you are using a production tenant, remove 'server=greenfield'.
When you create the `CogniteClient` below, `getpass` will ask for your API key in an extra password field. Simply paste it in and press shift+enter.

In [None]:
client = CogniteClient(
    api_key=getpass("Please enter API key: "),
    project="yourproject",
    server="greenfield"
)

Let's define some example entities

In [None]:
entities_to = [
    {"id": 1, "name": "ENTITY_MATCHING_FUN_314"},
    {"id": 2, "name": "ENTITY_MATCHING_INTERESTING_42"},
    {"id": 3, "name": "ENTITY_MATCHING_CONFUSING_123"},
    {"id": 4, "name": "CONFUSING_NON_123_MATCHING"},
]
entities_from = [{"id": 100, "name": "INTERESTING_SENSOR_42"},
                 {"id": 200, "name": "SENSOR_CONFUSING_123"},
                 {"id": 300, "name": "FUN_314_SENSOR_SHOULD_GIVE_NON_MATCH_123"}]

## Simple Entity Matching

In [None]:
em = client.entity_matching.fit_ml(match_from=entities_from, match_to=entities_to)
em

## You can run predict on the model object and it will wait for it to be ready and submit the predict job. 
Likewise, asking for the result will wait for it. Since it only waits when necessary, you can both work interactively as well as easily submit a large number of jobs to be processed in parallel.

In [None]:
job = em.predict_ml(match_from=entities_from, num_matches=2)
job.result

As you can see, the `SENSOR_CONFUSING_123` and `FUN_314_SENSOR_SHOULD_GIVE_NON_MATCH_123` both have two matches with identical score. Let's try a different model type which prefers tokens in the same pairs

In [None]:
em = client.entity_matching.fit_ml(match_from=entities_from, match_to=entities_to,model_type='bigram')
job = em.predict_ml(num_matches=2)
job.result

Now the top match is as expected! There are a number of different models and classifiers, but for many applications, the 'bigram' model strikes a good balance between performance and accuracy.

### You can also use CDF resources directly to run entity matching, specify fields to match on, and fill in missing fields with an empty string
This assumes you have some interesting assets and time series in your tenant.

In [None]:
assets = client.assets.list(limit=10)
time_series = client.time_series.list(limit=10)
em = client.entity_matching.fit_ml(match_from=time_series, match_to=assets, keys_from_to=[('externalId','externalId'),('description','description')], model_type='bigram', complete_missing=True)
em.predict_ml().result