# Environment Setup

Source: https://github.com/elsevierlabs-os/soda

Note- SoDA matches at the lucene word level. If the word itself gets changed, then SoDA will most likely not find the match.

Steps to start solr: 1. cd to the solr directory 2. enter the comand "bin/solr start -p 8984"

Steps to start Jetty and SoDA: 1. cd to the SoDa directory 2. Enter "sbt" 3. Enter "jetty:start"

In [137]:
import sodaclient
import pandas as pd
import numpy as np

In [8]:
# Establish a connection to the soda web client
# If running local web app, will need a "soda" extension to the URL - http://localhost:8080/soda/
client = sodaclient.SodaClient("http://localhost:8080/")

In [41]:
# Check the status of soda
ind_resp = client.index()
print("Soda status:", ind_resp)
# list the lexicons
print("Lexicons:", client.dicts())

Soda status: {'status': 'ok', 'message': 'SoDA accepting requests (Solr version 7.3.0)'}
Lexicons: {'status': 'ok', 'lexicons': [{'lexicon': 'companies_addr', 'count': 1000}, {'lexicon': 'companies_city', 'count': 1000}, {'lexicon': 'companies_code', 'count': 1000}, {'lexicon': 'companies_ctry', 'count': 1000}, {'lexicon': 'companies_dict', 'count': 1000}, {'lexicon': 'companies_name', 'count': 1000}]}


Add the TSV file with the complete address name that will be used to return suggestions to the user.

In [85]:
df = pd.DataFrame.from_csv('companies_dict.tsv', sep='\t')

Add dictionaries- alread did this, so do not need to run it in here

In [14]:
# add_resp = client.add(lexicon_name, id, names, commit)

To add lexicons command line: in sbt, enter "run lexicon_name path_to_tsv number_of_players"

For example: "run companies_city Desktop/companies_city1.tsv 1"

# Get user entry

In [175]:
# Prompt the user to enter an address in separate fields
name = input("Enter the company name: ")
addr = input("Enter the company address: ")
city = input("Enter the city name: ")
ctry = input("Enter the country: ")
code = input("Enter the postal code: ")

Enter the company name: 1 mobile limited
Enter the company address: 30 city road street
Enter the city name: london
Enter the country: uk
Enter the postal code: ec


In [176]:
# Combine for one longer string
user_entry = name + " " + addr + " " + city + " " + ctry + " " + code

# Searching: Step 1- Reverse Lookup

RLookup allows non-streaming matching of phrases against entries in the dictionary. In other words, it takes a shorter string that is missing info and finds matches in the dictionary.

It is important here to decide which matching type to use. s3sort is beneficial because it accounts for entries that are out of order. stem2 and stem3, however, are useful because they account for special characters, where s3sort does not.

In [273]:
# First- check the two most telling fields- name and address
name = "1 Limited Mobile"
addr = "30 City"
city = ""
ctry = ""
name_rlook = (client.rlookup('companies_name', name, 's3sort'))
addr_rlook = (client.rlookup('companies_addr', addr, 's3sort'))

SoDA will give preference to the company name field. The first idea here was to check if there is only one match found in the name, then look to see if that ID was found in any of the other fields. If it was, return that address. However, because there would be hundreds of matches in the city, country, and postal code lexicons (but rlookup only return 10 matches), this was not ideal.

New approach- if there is only one match in the name, calculate the edit distance between the users input and the suggested address. If the edit distance is below some threshold, return the string.

### Calculate the edit distance

Change the penalties to deviate from Levenshtein cost and remain in line with edit distance

In [130]:
INSERTION_PENALTY = 1
DELETION_PENALTY = 1
# This substitution penalty differentiates from Levenshtein cost (would be 1)
SUBSTITUTION_PENALTY = 2
ALLOWED_LEVELS = ["word", "char"]
LEVEL = "word"

This function computes the minimum cost of address string by totalling the minimum edit distance (stored in the dynamic array, D) for each substring of that address

In [131]:
def compute_cost(D, i, j, token_X, token_Y):
    relative_subst_cost = 0 if token_X == token_Y else SUBSTITUTION_PENALTY
    return min(D[i-1, j] + INSERTION_PENALTY, D[i, j-1] + DELETION_PENALTY, D[i-1, j-1] + relative_subst_cost)

This function splits the larger address string into separate component word strings

In [132]:
def tokenize_string(string, level="word"):
    assert level in ALLOWED_LEVELS
    if level is "word":
        return string.split(" ")
    else:
        return list(string)

This function computes the edit distance between two strings by dynamically stroing the edit distance of each smaller substring in a 2-D array

In [133]:
def minimum_edit_distance(string1, string2, level="word"):
    """The function uses the dynamic programming approach from Wagner-Fischer to compute the minimum edit distance
    between two sequences.
    :param string1 first sequence
    :param string2 second sequence
    :param level defines on which granularity the algorithm will be applied. "word" specifies the token to
    be sequential words while "char" applies the algorithm on a character-by-character level"""
    # Call tokenize string on the two address strings that were passed to the method
    string1_tokens = tokenize_string(string1, level)
    string2_tokens = tokenize_string(string2, level)
    n = len(string1_tokens)
    m = len(string2_tokens)
   
#     print(string2_tokens)
 
    D = np.zeros((n, m))
 
    for i in range(n):
        for j in range(m):
            if j == 0:
                D[i,j] = i
            elif i == 0:
                D[i,j] = j
            else:
                D[i,j] = compute_cost(D, i, j, string1_tokens[i], string2_tokens[j])
 
    return string2_tokens, D[n-1, m-1]

Case 1- There is only one match in the name dictionary. Get the address and return it if the edit distance between the suggestion and user entry is below the threshold.

In [272]:
response_soda = "None found!"
if len(name_rlook['entries']) == 1:
    name_id = name_rlook['entries'][0]['id']
    raw_id = int(name_id.split('_')[1])
    response_soda = (df.iloc[raw_id - 1]['NAME'])
    dist = minimum_edit_distance(response_soda, user_entry)
    # if the edit distance is below the threshold, return the suggestion
    if (dist[1] > 25.0):
        response_soda = "None found!"
print(response_soda)

1 MOBILE LIMITED 30 CITY ROAD LONDON EC1Y 2AB


Case 2- There are multiple matches in the name. Find corresponding IDs in the name matches and address matches. Remember another fault here- will only return max 10.

In [256]:
# return the one with the largest combined confidence
response_soda = "None found!"
for name in name_rlook['entries']:
    highest_confidence = 0
    name_id = name['id']
    addr_id = "ADDR_" + name_id.split('_')[1]
    raw_id = int(name_id.split('_')[1])
    for addr in addr_rlook['entries']:
        if addr['id'] == addr_id:
            conf = (name['confidence'] + addr['confidence'])
            confidence = max(highest_confidence, conf)
            if conf > highest_confidence:
                response_soda = (df.iloc[raw_id - 1]['NAME'])
print(response_soda)

None found!


Case 3- If there is only one match in the addresses lexicon, return that full address. It would not be useful to check corresponding IDs in city, country, and code in the case that there are multiple address matches due to the 10 match maximum. If none of these 3 cases work, more on to step 2.

In [257]:
response_soda = "None found!"
if len(addr_rlook['entries']) == 1:
    addr_id = addr_rlook['entries'][0]['id']
    raw_id = int(addr_id.split('_')[1])
    response_soda = (df.iloc[raw_id - 1]['NAME'])
    dist = minimum_edit_distance(response_soda, user_entry)
    # if the edit distance is below the threshold, return the suggestion
    if (dist[1] > 25.0):
        response_soda = "None found!"
print(response_soda)

None found!


# Searching: Step 2- Annotate
There are faults with rlookup, so we use the annotate method as a backup. For example, rlookup will only return a maximum of 10 matches, so vague searches or searches in the city, country, and postal code dictionaries will not be useful. If no results are found in rlookup, then we will use annotate, which is useful if the user enters information in the wrong field, or adds additional and unnecessary information to the address.

This implementation will find addresses such as 1 Mobile Limited Company London UK, that rlookup would not find.

Take the input as one large string and annotate it against each separate lexicon.

In [295]:
user_entry = "1 Mobile Limited Company in the city of london"
name_annot = client.annot('companies_name', user_entry, 'stem2')
addr_annot = client.annot('companies_addr', user_entry, 'stem2')

Case 1- There is only one match in the name dictionary. If that matching ID exists in some other lexicon match, then return the full address.

In [296]:
# case 1- there is only one match in the name dictionary
response_soda = "None found!"
if len(name_annot['annotations']) == 1:
    raw_id = int(name_id.split('_')[1])
    name_id = name_annot['annotations'][0]['id']
    addr_id = "ADDR_" + name_id.split('_')[1]
    dict_id = "DICT_" + name_id.split('_')[1]
    #First, check if there is a matching address with the same ID
    found = False
    for entry in addr_annot['annotations']:
        if entry['id'] == addr_id:
            found = True
            # print the full name and address to recommend to user
            response_soda = (df.iloc[raw_id - 1]['NAME'])
    # Next, check city
    if found == False:
        city_id = "CITY_" + name_id.split('_')[1]
        city_annot = (client.annot('companies_city', user_entry, 'stem2'))
        for entry in city_annot['annotations']:
            if entry['id'] == city_id:
                found = True
                # print the full name and address to recommend to user
                response_soda = (df.iloc[raw_id - 1]['NAME'])
    # Then check country
    if found == False:
        ctry_id = "CTRY_" + name_id.split('_')[1]
        ctry_annot = (client.annot('companies_ctry', user_entry, 'stem2'))
        for entry in ctry_annot['annotations']:
            if entry['id'] == ctry_id:
                found = True
                # print the full name and address to recommend to user
                response_soda = (df.iloc[raw_id - 1]['NAME'])
    # Finally, check postal code
    if found == False:
        code_id = "CODE_" + name_id.split('_')[1]
        code_annot = (client.annot('companies_code', user_entry, 'stem2'))
        for entry in code_annot['annotations']:
            if entry['id'] == code_id:
                found = True
                # print the full name and address to recommend to user
                response_soda = (df.iloc[raw_id - 1]['NAME'])
                
print(response_soda)

1 MOBILE LIMITED 30 CITY ROAD LONDON EC1Y 2AB


Case 2- There are multiple name matches found. Find corresponding IDs in the addr matches.

In [278]:
# return the one with the largest combined confidence
response_soda = "None found!"
for name in name_annot['annotations']:
    highest_confidence = 0
    name_id = name['id']
    addr_id = "ADDR_" + name_id.split('_')[1]
    raw_id = int(name_id.split('_')[1])
    for addr in addr_annot['annotations']:
        if addr['id'] == addr_id:
            conf = (name['confidence'] + addr['confidence'])
            confidence = max(highest_confidence, conf)
            if conf > highest_confidence:
                response_soda = (df.iloc[raw_id - 1]['NAME'])
print(response_soda)

1 MOBILE LIMITED 30 CITY ROAD LONDON EC1Y 2AB


Case 3- There is only one match in the address

In [279]:
response_soda = "None found!"
if len(addr_annot['annotations']) == 1:
    addr_id = addr_annot['annotations'][0]['id']
    raw_id = int(addr_id.split('_')[1])
    response_soda = (df.iloc[raw_id - 1]['NAME'])
    dist = minimum_edit_distance(response_soda, user_entry)
    # if the edit distance is below the threshold, return the suggestion
    if (dist[1] > 25.0):
        response_soda = "None found!"
print(response_soda)

None found!


# Step 3- Combination

The last case to account for is when too much info is added to one field, but too little is added in another? Such as 1 Mobile Limited Company 30 City. Here, we must use a combination of Reverse Lookup and Annotate.

Start by checking if there is a match for name through the annotation method. Run those IDs against the matches found with the addresses reverse lookup method. Return the match with the highest confidence.

In [262]:
response_soda = "None found!"
for name in name_annot['annotations']:
    highest_confidence = 0
    name_id = name['id']
    addr_id = "ADDR_" + name_id.split('_')[1]
    raw_id = int(name_id.split('_')[1])
    for addr in addr_rlook['annotations']:
        if addr['id'] == addr_id:
            conf = (name['confidence'] + addr['confidence'])
            confidence = max(highest_confidence, conf)
            if conf > highest_confidence:
                response_soda = (df.iloc[raw_id - 1]['NAME'])
print(response_soda)

1 MOBILE LIMITED 30 CITY ROAD LONDON EC1Y 2AB


# Extra Work and Notes
Examples of stem vs. s3sort are below.

In [314]:
name2 = "Bull stúdios" #ORIGINAL IS "THE BULL STUDIO LTD"
print(client.rlookup('companies_name', name2, 'stem3'))

{'status': 'ok', 'entries': [{'id': 'NAME_645', 'lexicon': 'companies_name', 'text': 'THE BULL STUDIO LTD', 'confidence': 0.5263157894736842}]}


In [315]:
name2 = "Bull studio ltd." #ORIGINAL IS "THE BULL STUDIO LTD"
print(client.rlookup('companies_name', name2, 's3sort'))

{'status': 'ok', 'entries': [{'id': 'NAME_645', 'lexicon': 'companies_name', 'text': 'THE BULL STUDIO LTD', 'confidence': 0.7368421052631579}]}


In case lexicons need to be deleted and reloaded:

In [316]:
# resp = client.delete('companies_name', "*")
# resp1 = client.delete('companies_addr', "*")
# resp2 = client.delete('companies_city1', "*")
# resp3 = client.delete('companies_zip', "*")