### Entity Resolution in Python (Basic Overview)

This notebook will give you a very basic understanding of a typical workflow you might encounter. We'll explore the following topics:
* What is an Entlet?
* Adding Entlets to an EntletMap
* Creating resolution Strategies
* Creating a Pipeline
* Understanding Results

Let's start by adding the project to your path so it's importable, and then importing the Entlet object

In [None]:
import sys
sys.path.append('../')   # Assuming you `git clone`d the repo - append the library root to syspath for importing

from entity_resolution import Entlet

### What is an Entlet?

An entlet is an entity according to a single source of information. It can be anything - a state, a person, a car...basically any noun.

You can assign whatever properties you want to the entlet using the .add() method, and each property can contain as many values as you like. "Nested" values (typically in the form of dictionaries) and objects that contain one or more values, are respected, meaning the values will "stay together" throughout resolution.

Certain properties (defined below in the Entlet IDs section) are "reserved," meaning they will only accept a single value that cannot be changed/updated after it is set.

In [None]:
entlet = Entlet()

entlet.add({
    "name": "San Mateo County",
    "location": {
        "country": "US",
        "state": "CA"
    }
})

##### Working with Entlet IDs

Entlet IDs are created for you, based on the information you added to the entlet. Generally, the goal of an Entlet ID is to:
* Remain consistent across runs of the entity resolver
* Uniquely identify that entlet within its source

There are three ways an entlet can be created:

**Defining a source unique id field** (Recommended)

If your data source already supplies a unique id, you can specify the field name of the unique id. If you use this method, the field you define will become "reserved," i.e., it will only accept a single value from the .add() method that cannot be overwritten.


**Defining your own unique id** 

If no unique id is provided, you can define your own. This generally makes it a bit more difficult to keep the unique ids stable between runs of the entity resolver, so is not recommended.


**Defining a combination of fields that uniquely identify the entlet**

Alternatively, you can specify a combination of fields that together define the entlet as unique. These fields still allow multiple values; the values provided will be hashed together to create a the unique id.

In [None]:
# Define a field (countyFIPS) that contains a unique id provided by the source (example dataset)
# Note that this is a classmethod
Entlet.define_source_uid_field("countyFIPS")

# Define your own unique id for this specific entlet
# entlet.define_individual_id(1)

# Define a combination of fields that uniquely identify the entlet
# Entlet.define_custom_uid_fields("location.country", "location.state"", "name")

print("Fields required for unique ID creation: " + str(Entlet.UID_FIELDS))
print("The following fields behave as source-specific unique ID fields: " + str(Entlet.SOURCE_UID_FIELD))

You can add values to the entlet as you go.  Each field (except for "reserved" fields - see below) is treated as a list, so new values are simply appended (duplicate values are discarded).

In [None]:
entlet.add({
    "countyFIPS": "12345",  # Reserved field, because we defined it as the Source UID field
    "location": {
        "country": "UK",
        "state": "Not alabama"
    }
})

### Reserved fields on the entlet

You *must* add string values for the following 2 fields on every entlet:
* ent_type
* data_source

Entlets with different ent_types ***will not resolve together***. Capitalization matters here.

In [None]:
entlet.add({
    "ent_type": "county",
    "data_source": "test",
})

### Creating an EntletMap

Now that we've created our first entlet, added some data to it, and defined how it should produce its unique id, we have to add it to a "pool" of entlets for resolution.

In [None]:
from entity_resolution import EntletMap

# We can simply instance EntletMap with one or more entlets,
# but later we'll use the .add() method to add additional ones
emap = EntletMap([entlet])

We've added one entlet to the EntletMap. Now lets add a few more!

In [None]:
entlet = Entlet()
entlet.add({
    "ent_type": "county",
    "data_source": "test",
    "countyFIPS": "23456",
    "name": "San Mateo",
    "location": {
        "country": "US",
        "state": "CA"
    }
})
emap.add(entlet)

entlet = Entlet()
entlet.add({
    "ent_type": "county",
    "data_source": "test",
    "countyFIPS": "34567",
    "name": "Santa Clara County",
    "location": {
        "country": "US",
        "state": "CA"
    }
})
emap.add(entlet)

## Building a Resolution Pipeline

### Building Strategies

Now that we've created a few entlets, we have to determine what makes two entlets similar enough that they should be resolved together.

You can define as many strategies as you need, and each strategy can contain as many comparisons as you want. Keep in mind though that as your number of comparisons grows, so will your computational cost. 

Strategies consist of the following elements:
* A 'blocker'
* One or more ways to measure similarity between entlets
* A way to determine whether the similarity scores pass a threshold test

#### Defining a Blocker

Blockers determine how entlets get "paired up" for comparison - which implies that *not all entlets will be compared against each other*. Entity Resolution is computationally expensive, and there's always a small chance that a "bad" resolution will occur. To mitigate both of these things, we define a blocker to try and pair up entlets in the smartest possible way.

For this example, we'll use the SortedNeighborhood blocker, with a window size of 3. We tell the blocker to block based on the 'name' field, meaning the entlets will get sorted based on values in the 'name' field. The window size means that entlets will only be compared with their 3 closest alphabetical neighbors.

In [None]:
from entity_resolution.blocking.sorted_value import SortedNeighborhood

blocker = SortedNeighborhood("name", window_size=3)

#### Defining Transforms and Similarity Measures

Now we need to determine how to compare values against each other. We'll define two similarity measures for this strategy:
* Exact match between values of 'location.state'
* Cosine similarity between the vector values of 'name'

Note that the second strategy will require that we convert the value in the 'name' field to a vector. We can accomplish this using a transform that returns a vector. For this example, we'll use TF-IDF.

In [None]:
from entity_resolution.similarity import CosineSimilarity, ExactMatch
from entity_resolution.transforms import TfIdfTokenizedVector

name_exact_match = ExactMatch("location.state")

tfidf = TfIdfTokenizedVector()
name_tfidf_similarity = CosineSimilarity("name", transform=tfidf)

#### Defining a scoring method

Since we have two different measurements of similarity, we have to tell the Strategy how to combine those scores into a single heuristic that can be tested against a threshold we set.

For this example, we'll treat each of the two scores as dimensions in a vector and treat that vector's magnitude as the overall score. Any combination of scores that produces a vector with magnitude greater than 1 will be considered a valid resolution.

In [None]:
from entity_resolution.scoring import VectorMagnitude

scoring_method = VectorMagnitude(min=1.0)

In [None]:
from entity_resolution import Strategy

# Roll up the above blocker, similarity metrics, and scoring method into a Strategy
strategy = Strategy(
    blocker=blocker,
    metrics=[name_exact_match, name_tfidf_similarity],
    scoring_method=scoring_method
)

### Defining a Pipeline

The Pipeline is what actually runs Entity Resolution, and will try to minimize the computational overhead of the strategies that you provide using under-the-hood optimizations. You can define as many strategies, standardizers, and partitioners (TODO) as you like.

To run the pipeline against the entlets you've created, pass your EntletMap to the Pipeline's .resolve() method.

In [None]:
from entity_resolution import Pipeline

pipeline = Pipeline(
    strategies=[strategy], 
    standardizers=[state_std]
)

entity_map = pipeline.resolve(emap)

#### Working with Results

The Pipeline will give you back an EntityMap, where the keys represent a unique Entity ID (as opposed to the earlier Ent*let* ID), and the values represent the aggregated information corresponding to all of the underlying entlets.

In [None]:
from pprint import pprint

pprint(entity_map)