<a href="https://colab.research.google.com/github/cognitedata/WiDS-2019/blob/master/WiDS_2019_Cognite_Interact_with_Assets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Classification Methods to Label Industrial Data

## What this notebook will achieve

* Extract data from an oil rig in the North Sea.
* Add more later




## Getting started

* Having a basic understanding of Python concepts will help to understand the process.

* Cognite has released *live* data to the public on the Cognite Data Platform streaming from [Valhall](https://www.akerbp.com/en/our-assets/production/valhall/), one of Aker's oil fields.

* To access the data, generate an API Key on [Open Industrial Data](https://openindustrialdata.com/). Get your key via the Google Access platform. You will be asked to fill out some personal information to generate your personal key.

* Visualize some of the machines (assets) on Valhall with Cognite's [Operational Intelligence](https://opint.cogniteapp.com/publicdata/infographics/-LOHKEJPLvt0eRIZu8mE) dashboard. This data on this page shows is streaming live data from the Valhall oil field located in the North Sea.

* To understand how to interact with the data using the Python SDK ([Docs](https://cognite-sdk-python.readthedocs-hosted.com/en/latest/)) follow along in this notebook.

## Environment Setup

#### Install the Cognite SDK package

In [1]:
!pip install cognite-sdk
!pip install cognite-datastudio
!pip install scikit-learn==0.20.3

Looking in indexes: https://pypi.python.org/simple, https://rebecca.wiborg.seyfarth:****@cognite.jfrog.io/cognite/api/pypi/snakepit/simple
Collecting cognite-sdk
  Downloading cognite_sdk-1.4.13-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 2.4 MB/s eta 0:00:01
Installing collected packages: cognite-sdk
Successfully installed cognite-sdk-1.4.13
Looking in indexes: https://pypi.python.org/simple, https://rebecca.wiborg.seyfarth:****@cognite.jfrog.io/cognite/api/pypi/snakepit/simple


#### Import the required packages

In [26]:
import os
from getpass import getpass

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from cognite.client import CogniteClient
from util.helper_functions import FitCountVectorizer, FitSimilarityEncoder

#### Connect to the Cognite Data Platform
* This client object is how all queries will be sent to the Cognite API to retrieve data.

When prompted for your API key, use the key generated by open industrial data as mentioned in the Getting Started steps.

In [27]:
client = CogniteClient(api_key=getpass("Open Industrial Data API-KEY: "),
                       project="publicdata", client_name="OID_example")

Open Industrial Data API-KEY: ········


  debug=debug,


## Accessing Cognite Data Platform (CDP)
* The CDP organizes digital information about the physical world.
* There are 6 kinds of objects stored on the CDP. Each of these objects in the CDP are labelled with a unique ID. Information regarding a specific Asset, Event, etc are often retrieved using this ID.

  * [Assets](https://doc.cognitedata.com/api/0.5/#tag/Assets) are digital representations of physical objects or groups of objects, and assets are organized into an asset hierarchy. For example, an asset can represent a water pump which is part of a subsystem on an oil platform.
  
  * [Event](https://doc.cognitedata.com/api/0.5/#tag/Events) objects store complex information about multiple assets over a time period. For example, an event can describe two hours of maintenance on a water pump and some associated pipes.
  
  * A [File](https://doc.cognitedata.com/api/0.5/#tag/Files) stores a sequence of bytes connected to one or more assets. For example, a file can contain a piping and instrumentation diagram (P&IDs) showing how multiple assets are connected.
  
  * A [Time Series](https://doc.cognitedata.com/api/0.5/#tag/Time-series) consists of a sequence of data points connected to a single asset. For example: A water pump asset can have a temperature time series that records a data point in units of °C every second.
  
  * [Sequence](https://doc.cognitedata.com/api/0.5/#tag/Sequences) are similar to time series in that they are a key value pair, but rather than using a timestamp as the key, another measurment such as depth could be the key. For example, this is used in practice when drilling and taking measurments at various depths.
  
  * A [3D](https://doc.cognitedata.com/api/0.5/#tag/3D) model is typically built up by a hierarchical structure. This looks very similar to how we organize our internal asset hierarchy. 3D models are visualized via Cognite's dashboards.
  
* It is important to refer back to the [SDK](https://cognite-sdk-python.readthedocs-hosted.com/en/latest/cognite.html) for specific details on arguments on all avaiable methods on how to access these objects.

### Collecting Asset Information

#### Retrieve a list of all Assets

* There are thousands of Assets in the CDP, we can have a look at a few examples.

* This will generate a list of assets from the CDP with no particular filters, this is a random result. Generally we would want to apply filters when retrieving records.


In [4]:
client.assets.list().to_pandas().head()

Unnamed: 0,name,parentId,description,metadata,id,createdTime,lastUpdatedTime,rootId
0,23-TE-96116-04,3117826349444493,VRD - PH 1STSTGGEAR THRUST BRG OUT,"{'ELC_STATUS_ID': '1211', 'RES_ID': '525283', ...",702630644612,0,0,6687602007296940
1,23-TE-96148,8515799768286580,VRD - PH 1STSTG COMP SEAL GAS HTR,"{'ELC_STATUS_ID': '1211', 'RES_ID': '532924', ...",5156972057719,0,0,6687602007296940
2,23-YT-96117-01,3257705896277160,VRD - PH 1STSTGGEAR 1 JOURNBRG DE,"{'ELC_STATUS_ID': '1211', 'RES_ID': '446683', ...",8019487489463,0,0,6687602007296940
3,23-FI-96151,4239585628663887,SOFT TAG VRD - PH 1STSTG PRIM SEAL LEAK DE,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",9258567430091,0,0,6687602007296940
4,23-LT-92521,2069232457199305,VRD - PH 1STSTGSUCTSCRUBBER LEVEL,"{'ELC_STATUS_ID': '1211', 'RES_ID': '523206', ...",12670864495024,0,0,6687602007296940


#### Decide on which asset we want to explore
* To get started exploring data in the CDP, we must first decide on which Asset we want to gather information from.

* Some asset names may be retrieved from the [Op Int](https://opint.cogniteapp.com/publicdata/infographics/-LOHKEJPLvt0eRIZu8mE) dashboard.

* Here is a screehshot of the [OpInt Dashboard](https://drive.google.com/open?id=1f_7nJaJu5Xgr3Oq09mIZ0KwjBAYbzEUQ) incase the page does not load.

* Some example asset names are:
  * 23-HA-9103
  * 23-PV-92583
  * 23-VG-9101
  
The *fuzzy* search for an asset can be performed as followed


In [5]:
asset_name = "23-HA-9103"
asset_df = client.assets.search(name=asset_name).to_pandas()
asset_df.head()


Unnamed: 0,name,parentId,description,metadata,id,createdTime,lastUpdatedTime,rootId
0,23-HA-9103,2513266419866445,VRD - 1ST STAGE SUCTION COOLER,"{'ELC_STATUS_ID': '1211', 'RES_ID': '531306', ...",2861239574637735,0,0,6687602007296940
1,23-CB-9103A,2137557577165478,VRD - 1ST STAGE COMPRESSOR LUBE OIL FILTER A,"{'ELC_STATUS_ID': '1211', 'RES_ID': '746776', ...",8790823573167638,0,0,6687602007296940
2,23-CB-9103B,2137557577165478,VRD - 1ST STAGE COMPRESSOR LUBE OIL FILTER B,"{'ELC_STATUS_ID': '1211', 'RES_ID': '746776', ...",379497117972793,0,0,6687602007296940
3,23-HA-9107A,2137557577165478,VRD - 1ST STAGE COMPRESSOR LUBE OIL COOLER A,"{'ELC_STATUS_ID': '1211', 'RES_ID': '786890', ...",4965752723543746,0,0,6687602007296940
4,23-HA-9107B,2137557577165478,VRD - 1ST STAGE COMPRESSOR LUBE OIL COOLER B,"{'ELC_STATUS_ID': '1211', 'RES_ID': '786896', ...",6838563873305104,0,0,6687602007296940


#### Get information on the asset of interest

* We can filter the assets to get asset-specific details based on asset_name

* The *get_asset()* interface provides the same information for 1 specific asset based on the provided ID


In [6]:
asset_id = asset_df[asset_df["name"] == asset_name].iloc[0]['id']
asset = client.assets.retrieve(id=asset_id).to_pandas()
asset

Unnamed: 0,value
name,23-HA-9103
parentId,2513266419866445
description,VRD - 1ST STAGE SUCTION COOLER
id,2861239574637735
createdTime,0
lastUpdatedTime,0
rootId,6687602007296940
ELC_STATUS_ID,1211
RES_ID,531306
SOURCE_DB,workmate


#### How do we get Asset relationships?

* The interface *get_asset_subtree()* can be used to retrieve the *children* of an Asset. 

* Each Asset is given various properties, some of the useful ones for this method are:

  * Depth: The number of edges from the parent node
  
  * Description: Includes information such as the platform and type of sensor being monitored
  
We will generate a list of all children of the main asset of interest. This is done by specifying a depth of 1.

In [7]:
subtree_df = client.assets.retrieve_subtree(id=asset_id, depth=1).to_pandas()
subtree_df.head()

Unnamed: 0,name,parentId,description,metadata,id,createdTime,lastUpdatedTime,rootId
0,23-HA-9103,2513266419866445,VRD - 1ST STAGE SUCTION COOLER,"{'ELC_STATUS_ID': '1211', 'RES_ID': '531306', ...",2861239574637735,0,0,6687602007296940
1,45-HV-92510-01,2861239574637735,VRD - PH 1STSTGSUCTCOOL SHELL PSV IN,"{'ELC_STATUS_ID': '1225', 'RES_ID': '444134', ...",274450897701725,0,0,6687602007296940
2,23-ESDV-92501A,2861239574637735,VRD - PH 1STSTGSUCTCLR GAS IN,"{'ELC_STATUS_ID': '1211', 'RES_ID': '609895', ...",576308321452985,0,0,6687602007296940
3,45-HV-92510-03,2861239574637735,VRD - PH 1STSTGSUCTCOOL SHELL PSV OUT,"{'ELC_STATUS_ID': '1225', 'RES_ID': '510103', ...",619750565594754,0,0,6687602007296940
4,45-PT-92508,2861239574637735,VRD - PH 1STSTGSUCTCOOL COOLMED OUT,"{'ELC_STATUS_ID': '1211', 'RES_ID': '485917', ...",705952550422793,0,0,6687602007296940


# Finding type/classes of assets

We will now try to classify/group the assets into different types

## Get relevant data

We will use name and description to classify/group the assets.

In [4]:
# Get all assets
asset_list = client.assets.list(limit=-1).to_pandas()[["id", "name", "description"]]
print(f"There are {len(asset_list)} assets in total.")
asset_list.head()

There are 1107 assets in total.


Unnamed: 0,id,name,description
0,702630644612,23-TE-96116-04,VRD - PH 1STSTGGEAR THRUST BRG OUT
1,5156972057719,23-TE-96148,VRD - PH 1STSTG COMP SEAL GAS HTR
2,8019487489463,23-YT-96117-01,VRD - PH 1STSTGGEAR 1 JOURNBRG DE
3,9258567430091,23-FI-96151,SOFT TAG VRD - PH 1STSTG PRIM SEAL LEAK DE
4,12670864495024,23-LT-92521,VRD - PH 1STSTGSUCTSCRUBBER LEVEL


In [5]:
asset_list.isnull().any()

id             False
name           False
description    False
dtype: bool

### Get labeled data

We have created labels for some of the assets. We will use these labels to see how different features are able to separate the data into different groups.

In [6]:
labeled_data = pd.read_csv("data/oid_assets_types.csv")

In [7]:
all_assets_with_label = pd.merge(asset_list, labeled_data, how="left")
# Set type to unknown for the assets where we do not know the type.
all_assets_with_label["type"].fillna("unknown", inplace = True)
print(f"There are {sum(all_assets_with_label['type']!='unknown')} with known type and\
      {sum(all_assets_with_label['type']=='unknown')} with unknown type")
all_assets_with_label.head()

There are 171 with known type and      936 with unknown type


Unnamed: 0.1,id,name,description,Unnamed: 0,type
0,702630644612,23-TE-96116-04,VRD - PH 1STSTGGEAR THRUST BRG OUT,0.0,sensor
1,5156972057719,23-TE-96148,VRD - PH 1STSTG COMP SEAL GAS HTR,1.0,sensor
2,8019487489463,23-YT-96117-01,VRD - PH 1STSTGGEAR 1 JOURNBRG DE,2.0,transmitter
3,9258567430091,23-FI-96151,SOFT TAG VRD - PH 1STSTG PRIM SEAL LEAK DE,3.0,indicator
4,12670864495024,23-LT-92521,VRD - PH 1STSTGSUCTSCRUBBER LEVEL,4.0,transmitter


In [2]:
assets_with_label = pd.read_csv("util/assets_with_label.csv")
assets_with_label.head()

Unnamed: 0.2,Unnamed: 0,id,name,description,Unnamed: 0.1,type,length_name
0,0,702630644612,23-TE-96116-04,VRD - PH 1STSTGGEAR THRUST BRG OUT,0.0,sensor,14
1,1,5156972057719,23-TE-96148,VRD - PH 1STSTG COMP SEAL GAS HTR,1.0,sensor,11
2,2,8019487489463,23-YT-96117-01,VRD - PH 1STSTGGEAR 1 JOURNBRG DE,2.0,transmitter,14
3,3,9258567430091,23-FI-96151,SOFT TAG VRD - PH 1STSTG PRIM SEAL LEAK DE,3.0,indicator,11
4,4,12670864495024,23-LT-92521,VRD - PH 1STSTGSUCTSCRUBBER LEVEL,4.0,transmitter,11


In [8]:
# Summarize the different types
pd.DataFrame(
    {
        "Count": all_assets_with_label
        .groupby(["type"])
        .size()
    }
).sort_values(by="Count", ascending=False)

Unnamed: 0_level_0,Count
type,Unnamed: 1_level_1
unknown,936
alarm,54
indicator,34
transmitter,32
sensor,22
valve,21
controller,8


In [9]:
# Remove unknown for now
assets_with_label = all_assets_with_label[all_assets_with_label["type"]!="unknown"].reset_index(drop=True)

## Create features from the data
Before we test any supervised on unsupervised algorithms, we need to create features from the data.

### Basic features

First we will look at two basic features:
1. Length of the name
2. Number of special characters in the name

#### 1. Length of the name

In [10]:
assets_with_label["length_name"] = assets_with_label["name"].str.len()

In [11]:
assets_with_label.groupby("type")["length_name"].mean()

type
alarm          14.537037
controller     14.875000
indicator      12.441176
sensor         12.818182
transmitter    12.562500
valve          13.333333
Name: length_name, dtype: float64

In [12]:
# Plot distributions

#### 2. Number of special characters

In [13]:
assets_with_label['num_dach_name'] = [len(x.split('-')) -1 for x in assets_with_label['name']]

In [14]:
assets_with_label.groupby("type")["num_dach_name"].mean()

type
alarm          2.777778
controller     3.000000
indicator      2.411765
sensor         2.590909
transmitter    2.468750
valve          2.380952
Name: num_dach_name, dtype: float64

In [15]:
# Plot distribution

### Extract information from text/strings

There are many different methods for extracting information from text. 

For the description we will look at how to extract all tokens from the description, and use the count of the most common tokens in each description as features.

For the names we will create features from the similarity between each name and some fixed number of unique names.

### Count words/tokens in decsription

The description can be seen as a very short document. We will use CountVectorizer to extract tokens from the descriptions.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
vectorizer = CountVectorizer(max_features=10)
vectorizer.fit(assets_with_label["description"])
# Get the words in the dict
vectorizer.get_feature_names()
count_vectorizer_desc = vectorizer.transform(assets_with_label["description"])
df_count_vectorizer_desc = pd.DataFrame(count_vectorizer_desc.toarray(),
                                        columns = vectorizer.get_feature_names())
df_count_vectorizer_desc["description"] = assets_with_label["description"]
df_count_vectorizer_desc.head()

Unnamed: 0,1ststg,alarm,brg,comp,gas,high,ph,soft,tag,vrd,description
0,0,0,1,0,0,0,1,0,0,1,VRD - PH 1STSTGGEAR THRUST BRG OUT
1,1,0,0,1,1,0,1,0,0,1,VRD - PH 1STSTG COMP SEAL GAS HTR
2,0,0,0,0,0,0,1,0,0,1,VRD - PH 1STSTGGEAR 1 JOURNBRG DE
3,1,0,0,0,0,0,1,1,1,1,SOFT TAG VRD - PH 1STSTG PRIM SEAL LEAK DE
4,0,0,0,0,0,0,1,0,0,1,VRD - PH 1STSTGSUCTSCRUBBER LEVEL


### Similarity between asset names

The names of the assets might look random, but it is not. There is a lot of information about the asset in the structure. With string features we would often create one dummy feature for each unique string, but all the names of the assets are unique. 

We would like to capture the similarity between the names without knowing what the different letter combinations actually means.

There are different methods for creating features that captures these similarities, today we will lokk at similarity encoding.

#### Cleaning the string
We do not care about difference in digits and will therefore convert all numbers to 1.

In [3]:
import re
assets_with_label["name_cleaned"] = [ re.sub(r"\d", "1", x) for x in assets_with_label["name"]]
assets_with_label[["name", "name_cleaned"]] .head()

Unnamed: 0,name,name_cleaned
0,23-TE-96116-04,11-TE-11111-11
1,23-TE-96148,11-TE-11111
2,23-YT-96117-01,11-YT-11111-11
3,23-FI-96151,11-FI-11111
4,23-LT-92521,11-LT-11111


In [19]:
print(f"Number of unique elements before cleaning {len(set(assets_with_label['name']))}")
print(f"Number of unique elements after cleaning {len(set(assets_with_label['name_cleaned']))}")

Number of unique elements before cleaning 171
Number of unique elements after cleaning 84


In [20]:
from dirty_cat import SimilarityEncoder

In [21]:
#Initialaze the similarity encoder
similarity_encoder = SimilarityEncoder(
similarity="ngram",
dtype=np.float32,
categories="most_frequent",
n_prototypes=10,
random_state=1006
)
    
    #Fit the similarity encoder and transform the data
similarity_encoder.fit(assets_with_label["name"].values.reshape(-1, 1))
sim_enc = similarity_encoder.transform(assets_with_label["name"].values.reshape(-1, 1))
sim_enc_df = pd.DataFrame(sim_enc, columns = list(similarity_encoder.categories_[0]))

In [22]:
sim_enc_df["name_cleaned"] = assets_with_label["name_cleaned"]
sim_enc_df["type"] = assets_with_label["type"]
sim_enc_df.head()

Unnamed: 0,48-PSE-96961,23-PCV-96176,23-PI-92504,23-PDT-96180,23-PDT-96146,23-PDT-92502,23-PDI-92534,23-PDI-92502,23-PDI-92501,23-PDAH-96155,name_cleaned,type
0,0.114286,0.181818,0.153846,0.181818,0.181818,0.098592,0.114286,0.098592,0.098592,0.173913,11-TE-11111-11,sensor
1,0.15,0.210526,0.118644,0.210526,0.277778,0.112903,0.112903,0.112903,0.112903,0.2,11-TE-11111,sensor
2,0.083333,0.2,0.102941,0.238095,0.238095,0.130435,0.098592,0.098592,0.147059,0.173913,11-YT-11111-11,transmitter
3,0.095238,0.210526,0.157895,0.210526,0.210526,0.112903,0.15,0.15,0.169492,0.263158,11-FI-11111,indicator
4,0.029851,0.112903,0.222222,0.15,0.15,0.277778,0.210526,0.210526,0.232143,0.107692,11-LT-11111,transmitter


## Clustering (unsupervised classification)

In a scenario where we do not have any labeled data we must used unsupervised methods such as clustering. Theree are different types of clustering. In this workshop we will use K-Means clustering to group the assets.


In [4]:
# Create features from name and description

vectorizer = FitCountVectorizer(col_name="description")
vectorizer.fit(df=assets_with_label)
count_vec_array = vectorizer.transform(df=assets_with_label)

sim_enc = FitSimilarityEncoder(col_name="name_cleaned")
sim_enc.fit(df=assets_with_label)
sim_enc_array = sim_enc.transform(df=assets_with_label)

X = np.concatenate((sim_enc_array, count_vec_array), axis=1)



In [5]:
#Cluster the data
from sklearn.cluster import KMeans
num_clusters = 10
kmeans = KMeans(n_clusters=num_clusters, random_state=1006).fit(X)

"""
Todo: This does not work well, should investigate the cluster and see if there is somthing we can add to 
get better clusters. Can also consider just clustering on some types.
"""

In [6]:
assets_with_label["cluster_label"] = kmeans.labels_
pd.DataFrame(
    {
        "Count": assets_with_label
        .groupby(["cluster_label", "type"])
        .size()
    }
).reset_index()

Unnamed: 0,cluster_label,type,Count
0,0,valve,5
1,1,alarm,5
2,1,indicator,1
3,1,sensor,2
4,1,transmitter,6
5,1,valve,4
6,2,alarm,6
7,2,controller,1
8,2,indicator,6
9,2,sensor,7


## Classification

Given that we have some training data we can train a classification model and try to predict the type for the rest of our data. There are many different classification algorithems in this tutorial we will use K nearest neighbors.


### Train test split
We will train on one part of our data and test our algorithm on the second part.


In [22]:
df_train, df_test, y_train, y_test = train_test_split(
    assets_with_label, assets_with_label["type"].values,
    train_size=0.7,
    stratify=assets_with_label["type"].values,
    random_state=1006,
)
df_train, df_test = df_train.reset_index(drop=True), df_test.reset_index(drop=True)



###  Create features
Fit on the training data and transform both the training and test data.

In [8]:
#Fit on train
vectorizer = FitCountVectorizer(col_name="description")
vectorizer.fit(df=df_train)

sim_enc = FitSimilarityEncoder(col_name="name_cleaned")
sim_enc.fit(df=df_train)

# Transform train
count_vec_array = vectorizer.transform(df=df_train)
sim_enc_array = sim_enc.transform(df=df_train)
X_train = np.concatenate((sim_enc_array, count_vec_array), axis=1)

#Transform test
count_vec_array = vectorizer.transform(df=df_test)
sim_enc_array = sim_enc.transform(df=df_test)
X_test = np.concatenate((sim_enc_array, count_vec_array), axis=1)


### Fit the classifier

In [10]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X=X_train, y=y_train)
KNeighborsClassifier(...)
print(neigh.predict(X_test))

['indicator' 'indicator' 'alarm' 'transmitter' 'indicator' 'indicator'
 'indicator' 'controller' 'indicator' 'alarm' 'transmitter' 'indicator'
 'controller' 'indicator' 'indicator' 'alarm' 'alarm' 'indicator'
 'transmitter' 'indicator' 'transmitter' 'alarm' 'sensor' 'indicator'
 'alarm' 'sensor' 'indicator' 'transmitter' 'transmitter' 'alarm' 'alarm'
 'indicator' 'valve' 'indicator' 'indicator' 'alarm' 'indicator' 'valve'
 'transmitter' 'valve' 'indicator' 'indicator' 'indicator' 'controller'
 'alarm' 'indicator' 'indicator' 'valve' 'indicator' 'alarm' 'alarm'
 'indicator']


In [14]:
sum(neigh.predict(X_test)== y_test)/len(X_test)

0.6346153846153846

## Open set classification

### Motivation
In the situation above we knew in advance all the classes we were interested in and we also had examples of all the classes we were interested in. Let's see what happens if we for instance remove all the "alarm" examples from the training set.

In a situation where we do not know all the classes at training time we need an algorithm that not only correctly predicts the class of an item, but that is also able to return an unknown label.

In [15]:
from cognite.datastudio.resource_typing import ResourceTyping

In [35]:
# Get data on correct format
training_data = []
for i, target in enumerate(y_train):
    training_data.append({"data": list(df_train.loc[i, ["name", "description"]]), "target": target})
    
predict_data = []
for i in enumerate(y_test):
    predict_data.append({"data": list(df_test.loc[i, ["name", "description"]])})

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_lowerdim(tup)


In [37]:
matcher = ResourceTyping(client)
model = matcher.fit(training_data)

In [38]:
matches = model.predict(predict_data)
print(matches)

[{'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': 'other'}, {'data': ['name', 'description'], 'score': 0.03820703440031448, 'target': '