In [1]:
import pycrfsuite
import pandas as pd
from address_compare.parsers import hyphen_parse
import json
from address_compare.feature_functions import WordFeatures1 as wf
from collections import OrderedDict
from numpy.random import uniform, seed
from importlib import reload
from address_compare.constants import DIRECTIONS, STREET_TYPES, UNIT_TYPES
from string import punctuation
from address_compare.feature_functions import FeatureFunctions, WordFeatures2, FullAddressFeatures, FeatureExtractor
from address_compare.crf_tagger import AddressTagger
import numpy as np
import re

[
  {
    "self_f_has_digit": true,
    "self_f_is_digit1": false,
    "self_f_is_digit2": false,
    "self_f_is_digit3": true,
    "self_f_is_digit4": false,
    "self_f_is_digit5": false,
    "self_f_is_direction": false,
    "self_f_is_street_type_word": false,
    "self_f_length": 3,
    "self_f_pound": false,
    "self_f_unit_type": false,
    "full_f_cty_rd": false,
    "lead0_f_has_digit": false,
    "lead0_f_is_digit1": false,
    "lead0_f_is_digit2": false,
    "lead0_f_is_digit3": false,
    "lead0_f_is_digit4": false,
    "lead0_f_is_digit5": false,
    "lead0_f_is_direction": false,
    "lead0_f_is_street_type_word": false,
    "lead0_f_length": 4,
    "lead0_f_pound": false,
    "lead0_f_unit_type": false,
    "lead1_f_has_digit": false,
    "lead1_f_is_digit1": false,
    "lead1_f_is_digit2": false,
    "lead1_f_is_digit3": false,
    "lead1_f_is_digit4": false,
    "lead1_f_is_digit5": false,
    "lead1_f_is_direction": false,
    "lead1_f_is_street_type_word": true,
   

# Training a Conditional Random Field-based address parser

In this notebook I will show how we trained a CRF model to parse addresses into their component parts.

## Introduction

How do we, as humans, know when two representations of an address are pointing to the same address? The strings "1401-88 W. HASTINGS ST" and "88 West Hastings Street, Ste. 1401" are very different, but refer to the same physical place. But despite the large differences in the string representations of this location, most people with at least a little bit of experience with addresses can quickly spot that they are the same. How?

One theory that we have is that comparisons are made component-wise. We can see that each string representation points to an address with unit number 1401 and street number 88. We also know that street names are in general case-insensitive, so "HASTINGS" and "Hastings" are _probably_ the same street name. We also know that the street type "ST" is often an abbreviation for the street type "Street", and that sometimes suite types such as "Ste." are omitted from address representations. Taking these observations together, a human can quickly surmise with high confidence that these strings refer to the same address.

In order to teach a computer to do the same, the first step is to teach it to parse addresses into their components.

## Training data

The training data that we used to train the model was a list of tokenized addresses along with tags. Data was obtained using Amazon Mechanical Turk to have humans tag a list of 200 addresses. Of the 200, 185 were usable at the time of this writing. Here's a look at the first few entries in the training file to get a feel for what it looks like.

In [2]:
with open('data/tagged_addresses.json') as f:
    td = json.load(f)
    
print(json.dumps(td[0:3], indent = 2))

[
  {
    "raw_address": "612 S ALASKA ST",
    "tags": [
      "STREET_NUMBER",
      "PRE_DIRECTION",
      "STREET_NAME",
      "STREET_TYPE"
    ],
    "tokens": [
      "612",
      "S",
      "ALASKA",
      "ST"
    ]
  },
  {
    "raw_address": "540 RONLEE LN NW STE B",
    "tags": [
      "STREET_NUMBER",
      "STREET_NAME",
      "STREET_TYPE",
      "POST_DIRECTION",
      "UNIT_TYPE",
      "UNIT_NUMBER"
    ],
    "tokens": [
      "540",
      "RONLEE",
      "LN",
      "NW",
      "STE",
      "B"
    ]
  },
  {
    "raw_address": "624 SUNSET PARK DR",
    "tags": [
      "STREET_NUMBER",
      "STREET_NAME",
      "STREET_NAME",
      "STREET_TYPE"
    ],
    "tokens": [
      "624",
      "SUNSET",
      "PARK",
      "DR"
    ]
  }
]


Tags were permitted to be one of `UNIT_TYPE`, `UNIT_NUMBER`, `STREET_NUMBER`, `PRE_DIRECTION`, `STREET_NAME`, `STREET_TYPE`, or `POST_DIRECTION`.

## The CRF model

Linear chain conditional random field models are discriminative classifiers that allow classification of sequences of interdependent observations. It has been widely applied (see the link below) to _Named Entity Recognition_ tasks, which is the commonly used name for the general problem of applying labels to sequences of symbols--usually applying part-of-speech labels (Noun, Verb, etc.) to words.

An advantage of CRF for this particular task over other similar models such as the Hidden Markov Model (HMM) is that CRF can incorporate a wide variety of features aside from just the given order of tags from training data. This is because CRF models, being discriminative, model the conditional probability distribution $p(y|\mathbf{x})$ directly, without ever modeling the joint distribution $p(\mathbf{x},y)$. This is good news when the features $\mathbf{x}$ are high dimensional and highly interdependent, which is likely true in the case of many named entity recognition tasks.

(Source)
http://homepages.inf.ed.ac.uk/csutton/publications/crftutv2.pdf

### Feature Extraction

For the address parsing problem, an individual observation $y_j$ is a sequence $(l_1, l_2, ..., l_t)$ of $t$ labels. For each label $l_i$ in $y$, we also provide a sequence $(x_{i1}, x_{i2}, ..., x_{if})$ of features for that label, so the feature set is $$\mathbf{x}_j = ((x_{11}, x_{12}, ..., x_{1f}), ..., (x_{t1}, x_{t2}, ..., x_{tf}))$$

To bring this into practice, it was convenient to have a method to quickly obtain many features at once. To achieve this, (in perhaps a slight abuse of object orented programming methods) we created a class called a `FeatureFunctionApplicator`, which has a method `FeatureFunctionApplicator.exec_all(self, s)` which applies maps multiple feature functions to `s` and returns the results in a dictionary. Feature functions are recognized as functions starting with the prefix `f_`. An example follows.

In [3]:
class SillyWordFF(FeatureFunctions):
    def f_constant(self, s):
        return "constant"
    
    def f_length(self, s):
        return len(s)
    
    def f_first_letter(self, s):
        return s[0]

The above code makes a `FeatureFunctionApplicator` with two feature functions. One feature always returns the word "constant", and the other returns the length of the input. By instantiating this `FeatureFunctionApplicator`, we obtain a means of easily applying both of those functions 

In [4]:
sillyff = SillyWordFF()
sillyff.exec_all("word")

{'f_constant': 'constant', 'f_first_letter': 'w', 'f_length': 4}

This dictionary format with feature names as keys and feature values as values is what is expected by `pycrfsuite`. Using this class made it easy to experiment with a number of different configurations of features. After experimentation, the following feature set gave the best results.

In [5]:
class SillySentenceFF(FeatureFunctions):
    
    def f_length(self, l):
        return len(l)
    
    def f_contains_colin(self, l):
        return "colin" in l

In [6]:
ssFF = SillySentenceFF()
ssFF.exec_all(["this", "is", "a", "list", "of", "tokens", "containing", "colin"])

{'f_contains_colin': True, 'f_length': 8}

In [7]:
wf = WordFeatures2()
sf = FullAddressFeatures()
fe = FeatureExtractor(word_features = wf, sentence_features=sf, lags=0, leads=0)

In [8]:
fe.extract_features(["123", "Main", "Street"])

[{'full_f_cty_rd': False,
  'self_f_has_digit': True,
  'self_f_is_digit1': False,
  'self_f_is_digit2': False,
  'self_f_is_digit3': True,
  'self_f_is_digit4': False,
  'self_f_is_digit5': False,
  'self_f_is_direction': False,
  'self_f_is_street_type_word': False,
  'self_f_length': 3,
  'self_f_pound': False,
  'self_f_unit_type': False},
 {'full_f_cty_rd': False,
  'self_f_has_digit': False,
  'self_f_is_digit1': False,
  'self_f_is_digit2': False,
  'self_f_is_digit3': False,
  'self_f_is_digit4': False,
  'self_f_is_digit5': False,
  'self_f_is_direction': False,
  'self_f_is_street_type_word': False,
  'self_f_length': 4,
  'self_f_pound': False,
  'self_f_unit_type': False},
 {'full_f_cty_rd': False,
  'self_f_has_digit': False,
  'self_f_is_digit1': False,
  'self_f_is_digit2': False,
  'self_f_is_digit3': False,
  'self_f_is_digit4': False,
  'self_f_is_digit5': False,
  'self_f_is_direction': False,
  'self_f_is_street_type_word': True,
  'self_f_length': 6,
  'self_f_poun

## Training and Testing

The code below creates a `pycrfsuite.Trainer` object, which interfaces with the `crfsuite` application to create a trained model. We used a fairly large holdout set of 30% to evaluate the model performance.

In [9]:
seed(1729)
training_size = 0.7  # use 70% for training

trainer2 = pycrfsuite.Trainer()

group = []

with open('data/tagged_addresses.json') as f:
    td = json.load(f)

for item in td:
    g = int(uniform() < training_size)
    features = fe.extract_features(item['tokens'])
    if len(features) != len(item['tags']): print(item['tokens'])
    trainer2.append(xseq = features, yseq = item['tags'], group=g)
    group.append(g)

In [10]:
trainer2.train('address_compare/trained_models/model4', holdout = 0)

Holdout group: 1

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 104
Seconds required: 0.005

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 1194.928398
Feature norm: 1.000000
Error norm: 981.166355
Active features: 104
Line search trials: 1
Line search step: 0.001110
Seconds required for this iteration: 0.001
Performance by label (#match, #model, #ref) (precision, recall, F1):
    STREET_NUMBER: (0, 0, 61) (0.0000, 0.0000, 0.0000)
    PRE_DIRECTION: (0, 0, 14) (0.0000, 0.0000, 0.0000)
    STREET_NAME: (71, 290, 71) (0.2448, 1.0000, 0.3934)
    STREET_TYPE: (0, 0, 63) (0.0000, 0.0000, 0.0000)
    POST_DIRECTION: (0, 0, 10) (0.0000, 0.0000, 0.0000)
    UNIT_TYPE: (0, 0, 32) (0.00

In [11]:
tagger3 = pycrfsuite.Tagger()
tagger3.open('address_compare/trained_models/model4')
test_group = [d for g, d in zip(group, td) if g == 0]
predicted = [tagger3.tag(fe.extract_features(t['tokens'])) for t in test_group]
[(t['raw_address'], t['tags'], p) for t, p in zip(test_group, predicted) if t['tags'] != p]

[('298 POORMAN CREEK RD #A',
  ['STREET_NUMBER',
   'STREET_NAME',
   'STREET_NAME',
   'STREET_TYPE',
   'UNIT_NUMBER'],
  ['STREET_NUMBER', 'STREET_NAME', 'STREET_TYPE', 'UNIT_TYPE', 'UNIT_NUMBER']),
 ('#4 2360 HIGHWAY 99',
  ['UNIT_NUMBER', 'STREET_NUMBER', 'STREET_TYPE', 'STREET_NAME'],
  ['UNIT_NUMBER', 'STREET_NUMBER', 'STREET_NAME', 'STREET_NAME']),
 ('9512 COUNTY ROAD T',
  ['STREET_NUMBER', 'STREET_TYPE', 'STREET_TYPE', 'STREET_NAME'],
  ['STREET_NUMBER', 'STREET_NAME', 'STREET_TYPE', 'POST_DIRECTION']),
 ('15694 SOUTH COUNTY ROAD 3',
  ['STREET_NUMBER',
   'PRE_DIRECTION',
   'STREET_TYPE',
   'STREET_TYPE',
   'STREET_NAME'],
  ['STREET_NUMBER',
   'PRE_DIRECTION',
   'STREET_NAME',
   'STREET_TYPE',
   'STREET_NUMBER'])]

In [12]:
print(tagger3.tag(fe.extract_features(["#1401", "750", "JERVIS", "ST"])))
tagger3.probability(['UNIT_NUMBER', 'STREET_NUMBER', 'STREET_NAME', 'STREET_TYPE'])

['UNIT_NUMBER', 'STREET_NUMBER', 'STREET_NAME', 'STREET_TYPE']


0.34449896091621734

In [13]:
fe.extract_features(["6369", "MT", "BAKER", "HWY", "STE", "A"])

[{'full_f_cty_rd': False,
  'self_f_has_digit': True,
  'self_f_is_digit1': False,
  'self_f_is_digit2': False,
  'self_f_is_digit3': False,
  'self_f_is_digit4': True,
  'self_f_is_digit5': False,
  'self_f_is_direction': False,
  'self_f_is_street_type_word': False,
  'self_f_length': 4,
  'self_f_pound': False,
  'self_f_unit_type': False},
 {'full_f_cty_rd': False,
  'self_f_has_digit': False,
  'self_f_is_digit1': False,
  'self_f_is_digit2': False,
  'self_f_is_digit3': False,
  'self_f_is_digit4': False,
  'self_f_is_digit5': False,
  'self_f_is_direction': False,
  'self_f_is_street_type_word': False,
  'self_f_length': 2,
  'self_f_pound': False,
  'self_f_unit_type': False},
 {'full_f_cty_rd': False,
  'self_f_has_digit': False,
  'self_f_is_digit1': False,
  'self_f_is_digit2': False,
  'self_f_is_digit3': False,
  'self_f_is_digit4': False,
  'self_f_is_digit5': False,
  'self_f_is_direction': False,
  'self_f_is_street_type_word': False,
  'self_f_length': 5,
  'self_f_pou

The model with these feature functions performs _extremely_ well, correctly tagging 56/58 training examples, or 272/276 training tags. Macro-averaged Precision, Recall, and F1 are all over 0.99.

In order to inspect the model further, we must create a `pycrfsuite.Tagger` object with the saved model (which is called `model3`). The tagger object has a model info method that we can use to inspect the transition probabilities and feature weights.

In [14]:
tagger3 = pycrfsuite.Tagger()
tagger3.open('address_compare/trained_models/model4')
model_info = tagger3.info()

The `Tagger.info` object is a little bit clunky to work with, but it contains a lot of information. The following code prints out the state transitions in order from highest to lowest conditional likelihood.

In [15]:
sorted([(v, k) for k, v in model_info.transitions.items()], reverse = True)

[(3.947201, ('UNIT_TYPE', 'UNIT_NUMBER')),
 (2.182438, ('STREET_NUMBER', 'PRE_DIRECTION')),
 (2.031264, ('STREET_NAME', 'STREET_TYPE')),
 (1.903381, ('PRE_DIRECTION', 'STREET_NAME')),
 (1.682203, ('STREET_NUMBER', 'STREET_NAME')),
 (1.336618, ('POST_DIRECTION', 'UNIT_TYPE')),
 (1.266445, ('STREET_NAME', 'STREET_NAME')),
 (1.249476, ('STREET_TYPE', 'POST_DIRECTION')),
 (0.949792, ('STREET_NUMBER', 'STREET_TYPE')),
 (0.704972, ('STREET_TYPE', 'UNIT_TYPE')),
 (0.567012, ('POST_DIRECTION', 'UNIT_NUMBER')),
 (0.486103, ('STREET_NAME', 'POST_DIRECTION')),
 (0.483353, ('STREET_TYPE', 'UNIT_NUMBER')),
 (0.202377, ('PRE_DIRECTION', 'STREET_TYPE')),
 (0.162651, ('UNIT_NUMBER', 'STREET_NUMBER')),
 (0.067689, ('STREET_TYPE', 'STREET_NAME')),
 (-0.165347, ('UNIT_NUMBER', 'UNIT_TYPE')),
 (-0.60961, ('STREET_NUMBER', 'STREET_NUMBER')),
 (-1.055264, ('STREET_TYPE', 'STREET_TYPE')),
 (-1.086708, ('STREET_TYPE', 'STREET_NUMBER'))]

And below are the feature weights for each tag.

In [16]:
sorted([(v, k) for k, v in model_info.state_features.items()], reverse=True)

[(2.829808, ('self_f_is_street_type_word', 'STREET_TYPE')),
 (2.204629, ('self_f_unit_type', 'UNIT_TYPE')),
 (2.198446, ('self_f_is_direction', 'POST_DIRECTION')),
 (2.001193, ('self_f_has_digit', 'STREET_NUMBER')),
 (1.770391, ('self_f_is_direction', 'PRE_DIRECTION')),
 (1.617426, ('full_f_cty_rd', 'STREET_TYPE')),
 (1.487678, ('self_f_pound', 'UNIT_NUMBER')),
 (1.483526, ('self_f_is_digit4', 'STREET_NUMBER')),
 (1.139578, ('self_f_is_digit5', 'STREET_NUMBER')),
 (0.776225, ('self_f_is_digit3', 'STREET_NUMBER')),
 (0.646255, ('self_f_is_digit2', 'STREET_NAME')),
 (0.550178, ('self_f_has_digit', 'UNIT_NUMBER')),
 (0.539719, ('self_f_is_digit1', 'STREET_NAME')),
 (0.43181, ('self_f_length', 'STREET_NAME')),
 (0.237129, ('self_f_is_digit1', 'UNIT_NUMBER')),
 (0.176104, ('self_f_is_digit2', 'STREET_NUMBER')),
 (0.139914, ('self_f_length', 'STREET_TYPE')),
 (0.096437, ('self_f_is_digit3', 'UNIT_NUMBER')),
 (0.080844, ('full_f_cty_rd', 'STREET_NUMBER')),
 (0.080687, ('self_f_length', 'STREE

In [17]:
tagger3.tag(fe.extract_features(["45", "MOUNT", "MAKER", "HIGHWAY"]))

['STREET_NUMBER', 'STREET_NAME', 'STREET_NAME', 'STREET_TYPE']

In [18]:
tagger3.tag(fe.extract_features(["45", "HIGHWAY", "15"]))

['STREET_NUMBER', 'STREET_NAME', 'STREET_NAME']

In [19]:
tagger3.tag(fe.extract_features(["19228", "SUNSET", "BOULEVARD"]))

['STREET_NUMBER', 'STREET_NAME', 'STREET_TYPE']

In [20]:
at = AddressTagger()
at.tag("3433 COUNTY ROAD 28")

{'POST_DIRECTION': '',
 'PRE_DIRECTION': '',
 'STREET_NAME': 'COUNTY',
 'STREET_NUMBER': '3433 28',
 'STREET_TYPE': 'ROAD',
 'UNIT_NUMBER': '',
 'UNIT_TYPE': '',
 'UNKNOWN': ''}