# Entity Resolution
* Xavier Ignacio Gonzalez (xig2000@columbia.edu)
* Woojin Kim (wk2246@columbia.edu)
* Diego Miguel Llarrull (dml2189@columbia.edu)

## Techniques 
The approach that we designed in order to resolve identical entities is a three-level matching approach. 
1. We matched entries sharing equally normalized phone numbers. 
2. We matched the remaining unmatched entries by mathcing normalised venue names.
3. We matched as many venues as possible within the remaining unmatched entries by using a cross-validated AdaBoost classifier by analysing pairwise similarity ratings on each venue feature.

### Processing
As the description suggests, prior to all analysis, the dataset was preprocessed by normalizing its features in order to make them comparable. This involved
* removing conflicting Unicode characters,punctuaction marks
* shifting all characters to lowercase
* preserving only the domain names for the URLs
* expanding often-used acronyms,
* and formalizing null values by shifting them to Python's `None`.

Once preprocessed, the dataset was extended by using the `street-address` Python address normalization library. This library basically parses addresses and normalized their structure into five fields: *house*, *street_name*, *street_type*, *suite_num* and *suite_type*. Rather than dropping the original field, we kept it and added the aforementioned columns to the datasets.

In order to avoid all pairwise comparisons, we used phone numbers to screen easily matched entities. On the training set, just using phone numbers resulted in over 99% recall, finding 225 out of 360 entries in the match file. Then we checked if entities have unique perfect name matching to find more easily matched entities. Uniquely matching enables to avoid matching franchises and repeated names that might match otherwise. Added on top of the phone number matching, we find about 333 of the 360 matches, still keeping over 99% recall. Consequently, phone numbers and the unique names were the most important features powering our technique.

### Machine learning classification
After this, we used machine learning to classify the remainder of the entries. We created a pairwise comparison row for every entry in the Locu and Foursquare datasets. We then calculated the distance given the latitude/longitude coordinates, word similarity for majority of the fields, as well as the longest common subsequence for the name pairs.

We tried a variety of classification techniques, including random forest, extra trees, bagging, AdaBoost, and decision trees; which all yielded similar results. After repeated cross validation to search the best hyperparameters, we used an AdaBoost classifier to submit our final results.

### Results
Our submission for the test set resulted in the following scores on the Instabase leaderboard:
* Recall: 99.57%
* Precision: 96.67%
* F1 score: 98.10%


## Data loading and preprocessing

In [1]:
import csv
import json
import pandas as pd
import math
import numpy as np
import streetaddress as sa
from difflib import SequenceMatcher

PATH = "Prakhar/er-assignment/fs/Instabase%20Drive/files/datasets/"
FILES = {
    "foursquare_test": "foursquare_test_hard.json",
    "locu_test": "locu_test_hard.json",
    "matches": "matches_train_hard.csv",
    "foursquare_train": "foursquare_train_hard.json",
    "locu_train": "locu_train_hard.json"
}

#Instabase load
# fs_train = pd.read_json(ib.open(PATH + FILES["foursquare_train"]))
# fs_test = pd.read_json(ib.open(PATH + FILES["foursquare_test"]))
# lc_train = pd.read_json(ib.open(PATH + FILES["locu_train"]))
# lc_test = pd.read_json(ib.open(PATH + FILES["locu_test"]))
# matches = pd.read_csv(ib.open(PATH + FILES["matches"]))

#Local load
fs_train = pd.read_json('data/foursquare_train_hard.json')
fs_test = pd.read_json('data/foursquare_test_hard.json')
lc_train = pd.read_json('data/locu_train_hard.json')
lc_test = pd.read_json('data/locu_test_hard.json')
matches = pd.read_csv('data/matches_train_hard.csv')

### Miscellaneous Functions

In [2]:
def find_distance(pt1, pt2):
    return math.sqrt( (pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2 )
    
def string_similarity(str1, str2):
    return SequenceMatcher(None, str1, str2).ratio()

def calc_lcs(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
                
    return len(s1[x_longest - longest: x_longest])

# Normalizes street addresses using the streetaddress library. All normalized fields are added as columns
def addr_parse(address):
    if address is not None: 
        addr_parser = sa.StreetAddressParser()
        addr = addr_parser.parse(address)
        format = {'house': [addr['house']],
                  'street_name': [addr['street_name']],
                  'street_type': [addr['street_type']],
                  'suite_num': [addr['suite_num']],
                  'suite_type': [addr['suite_type']] }   
    else: 
        format = {'house': [None],
                  'street_name': [None],
                  'street_type': [None],
                  'suite_num': [None],
                  'suite_type': [None] }
    rv = pd.DataFrame(data = format)
    return rv

### Cleanup

In [3]:
data_list = {'fs_train': fs_train,
             'fs_test': fs_test,
             'lc_train': lc_train,
             'lc_test': lc_test }

fs_train_phone_dir, fs_test_phone_dir = {}, {}
lc_train_phone_dir, lc_test_phone_dir = {}, {}
phone_dir = {'fs_train': fs_train_phone_dir,
                   'fs_test': fs_test_phone_dir,
                   'lc_train': lc_train_phone_dir,
                   'lc_test': lc_test_phone_dir }

for df_name, df in data_list.iteritems():
    df.drop(['country', 'region', 'locality'], inplace=True, axis=1)
    
    df.replace([''], [None], inplace=True)
    
    df['id'] = df['id'].astype('str')
    df['latitude'] = pd.to_numeric(df['latitude'])
    df['longitude'] = pd.to_numeric(df['longitude'])
    
    # Unicode chars to replace
    df['name'].replace([u"\xe9"], ['e'], regex=True, inplace=True)
    df['name'].replace([u"\xed"], ['i'], regex=True, inplace=True)
    df['name'].replace([u'\u2019'], [''], regex=True, inplace=True)
    df['name'].replace([u'\xc7'], ['c'], regex=True, inplace=True)
    df['name'].replace([u'\u2013'], ['-'], regex=True, inplace=True)
    
    df['name'].replace([r':|\'|,|\.|-'], [''], regex=True, inplace=True)
    df['name'].replace(['&'], ['and'], regex=True, inplace=True)
    df['name'].replace(['\s+|\/'], [' '], regex=True, inplace=True)
    df['name'].replace(['\s'], [''], regex=True, inplace=True)

    df['name'] = df['name'].astype(str).str.lower()
    
    df['phone'].replace([r'\(|\)|\s|-'], [''], regex=True, inplace=True)
    
    # Make a phone directory
    current_phone_dir = phone_dir[df_name]
    for i, row in df.iterrows():
        if row['phone'] != None:
            current_phone_dir[row['phone']] = row['id']
            
    df['street_address'].replace([r'<sup>|<\/sup>'], [''], regex=True, inplace=True)
    df['street_address'].replace([r'\.'], [''], regex=True, inplace=True)
    df['street_address'].replace([r'Jfk'], ['John F Kennedy'], regex=True, inplace=True)
    df['street_address'] = df['street_address'].astype(str)
    
    df['website'].replace([u"\u200e"], [''], regex=True, inplace=True)
    df['website'].replace([r'http(s)?://(www.)?|\\u200e'], [''], regex=True, inplace=True)
    df['website'].replace([r'\..*'], [''], regex=True, inplace=True)
    df['website'] = df['website'].astype(str).str.lower()
    df['website'].replace(['None'], [None], inplace=True)
    
    
c = fs_train['street_address'].apply(addr_parse)
cols = pd.concat([i for i in c]).reset_index(drop=True)
fs_train = pd.concat([fs_train,cols], axis = 1)

c = fs_test['street_address'].apply(addr_parse)
cols = pd.concat([i for i in c]).reset_index(drop=True)
fs_test = pd.concat([fs_test,cols], axis = 1)

c = lc_train['street_address'].apply(addr_parse)
cols = pd.concat([i for i in c]).reset_index(drop=True)
lc_train = pd.concat([lc_train,cols], axis = 1)

c = lc_test['street_address'].apply(addr_parse)
cols = pd.concat([i for i in c]).reset_index(drop=True)
lc_test = pd.concat([lc_test,cols], axis = 1)

600


## Phone number matching
Entities with matching phone numbers are always matching, so we process these first and reduce the size of the testing set

In [4]:
matches_train = {}
for lc_phone in lc_train_phone_dir:
    if lc_phone in fs_train_phone_dir:
        lc_id = lc_train_phone_dir[lc_phone]
        fs_id = fs_train_phone_dir[lc_phone]
        matches_train[lc_id] = fs_id

matches_test = {}
for lc_phone in lc_test_phone_dir:
    if lc_phone in fs_test_phone_dir:
        lc_id = lc_test_phone_dir[lc_phone]
        fs_id = fs_test_phone_dir[lc_phone]
        matches_test[lc_id] = fs_id

Here we confirm that the accuracy is nearly 100% for the training set using only phone number matching

In [5]:
how_true = []
for lc_id, fs_id in matches_train.iteritems():
    fs_match_id = matches[matches['locu_id'] == lc_id]['foursquare_id']
    if len(fs_match_id > 0):
        how_true.append(fs_match_id.iloc[0] == fs_id)
print(sum(how_true) / float(len(matches_train)))

0.991189427313


In [7]:
# Data sets with phone matches removed
lc_train_not_matching = [not x for x in lc_train['id'].isin(matches_train.keys())]
fs_train_not_matching = [not x for x in fs_train['id'].isin(matches_train.values())]
lc_test_not_matching = [not x for x in lc_test['id'].isin(matches_test.keys())]
fs_test_not_matching = [not x for x in fs_test['id'].isin(matches_test.values())]

lc_train = lc_train[lc_train_not_matching]
fs_train = fs_train[fs_train_not_matching]
lc_test = lc_test[lc_test_not_matching]
fs_test = fs_test[fs_test_not_matching]

## Name matching
Entities with matching names are always matching, so we also process these in advance to further reduce the size of the testing set.

In [8]:
data_list = {'fs_train': fs_train,
             'fs_test': fs_test,
             'lc_train': lc_train,
             'lc_test': lc_test }

fs_train_name_dir, fs_test_name_dir = {}, {}
lc_train_name_dir, lc_test_name_dir = {}, {}
name_dir = {'fs_train': fs_train_name_dir,
            'fs_test': fs_test_name_dir,
            'lc_train': lc_train_name_dir,
            'lc_test': lc_test_name_dir }

for df_name, df in data_list.iteritems():
    # Make a name directory
    current_name_dir = name_dir[df_name]
    for i, row in df.iterrows():
        if row['name'] != None:
            if row['name'] in current_name_dir.keys(): 
                current_name_dir[row['name']] = current_name_dir[row['name']] + [row['id']]
            else:
                current_name_dir[row['name']] = [row['id']]


added = 0
for lc_name in lc_train_name_dir.keys():
    if lc_name in fs_train_name_dir.keys():
        lc_ids = lc_train_name_dir[lc_name]
        fs_ids = fs_train_name_dir[lc_name]
        if (len(lc_ids) == 1) and (len(fs_ids) == 1):
            matches_train[lc_ids[0]] = fs_ids[0]
            added = added + 1

print len(matches_train)
            
added = 0            
for lc_name in lc_test_name_dir.keys():
    if lc_name in fs_test_name_dir.keys():
        lc_ids = lc_test_name_dir[lc_name]
        fs_ids = fs_test_name_dir[lc_name]
        if (len(lc_ids) == 1) and (len(fs_ids) == 1):
            matches_test[lc_ids[0]] = fs_ids[0]
            added = added + 1
            
print len(matches_test)

333
226


Here we confirm that the accuracy is 100% of the training set using phone number matching

In [9]:
how_true = []
for lc_id, fs_id in matches_train.iteritems():
    fs_match_id = matches[matches['locu_id'] == lc_id]['foursquare_id']
    if len(fs_match_id > 0):
        how_true.append(fs_match_id.iloc[0] == fs_id)
print(sum(how_true) / float(len(matches_train)))


0.993993993994


In [10]:
# Data sets with phone matches removed
lc_train_not_matching = [not x for x in lc_train['id'].isin(matches_train.keys())]
fs_train_not_matching = [not x for x in fs_train['id'].isin(matches_train.values())]
lc_test_not_matching = [not x for x in lc_test['id'].isin(matches_test.keys())]
fs_test_not_matching = [not x for x in fs_test['id'].isin(matches_test.values())]

lc_train = lc_train[lc_train_not_matching]
fs_train = fs_train[fs_train_not_matching]
lc_test = lc_test[lc_test_not_matching]
fs_test = fs_test[fs_test_not_matching]

## Create Row Combinations for Machine Learning
Append a prefix to identify the columns when concatenated:

In [11]:
for df in [fs_train, fs_test]:
    df.columns = ['fs_' + str(i) for i in list(df.columns)]
for df in [lc_train, lc_test]:
    df.columns = ['lc_' + str(i) for i in list(df.columns)]

LC data is repeated row at a time, then FS data is repeated entirely at a time. The two are concatenated to create the combo data frame.

In [12]:
train_left =  lc_train.loc[np.repeat(lc_train.index.values, len(lc_train))].reset_index(drop=True)
train_right =  pd.concat([fs_train]*len(fs_train), ignore_index=True)
train = pd.concat([train_left, train_right], axis=1)

test_left =  lc_test.loc[np.repeat(lc_test.index.values, len(lc_test))].reset_index(drop=True)
test_right =  pd.concat([fs_test]*len(fs_test), ignore_index=True)
test = pd.concat([test_left, test_right], axis=1)

### Add match status

In [13]:
# Match dictionary
match_dict = {}
for i, row in matches.iterrows():
    match_dict[row['locu_id']] = row['foursquare_id']

In [14]:
match_column = []
for i, row in train.iterrows():
    lc_id = row['lc_id']
    fs_id = row['fs_id']
    if (lc_id in match_dict) and (match_dict[lc_id] == fs_id):
        match_column.append(1)
    else:
        match_column.append(0)
match_column = np.array(match_column)

### Calculate various distances

In [15]:
data_list = [train, test]

for d_i, df in enumerate(data_list):
    print("#####\nStarting iteration #{}".format(d_i))
    
    print('Processing distances...')
    distance = []
    for i, row in df.iterrows():
        lc_loc = (row['lc_latitude'], row['lc_longitude'])
        fs_loc = (row['fs_latitude'], row['fs_longitude'])
        distance.append(find_distance(lc_loc, fs_loc))

    print('Processing names...')
    name_dist = []
    for i, row in df.iterrows():
        lc_name = row['lc_name']
        fs_name = row['fs_name']
        name_dist.append(string_similarity(lc_name, fs_name))

    print('Processing ZIP codes...')
    zip_dist = []
    zip_missing = []
    for i, row in df.iterrows():
        lc_zip = row['lc_postal_code']
        fs_zip = row['fs_postal_code']
        if lc_zip and fs_zip:
            zip_dist.append(string_similarity(lc_zip, fs_zip))
            zip_missing.append(0)
        else:
            zip_dist.append(np.nan)
            zip_missing.append(1)

    print('Processing phone numbers...')
    phone_dist = []
    phone_missing = []
    for i, row in df.iterrows():
        lc_phone = row['lc_phone']
        fs_phone = row['fs_phone']
        if lc_phone and fs_phone:
            phone_dist.append(string_similarity(lc_phone, fs_phone))
            phone_missing.append(0)
        else:
            phone_dist.append(np.nan)
            phone_missing.append(1)

    print('Processing URLs...')
    url_dist = []
    url_missing = []
    for i, row in df.iterrows():
        lc_url = row['lc_website']
        fs_url = row['fs_website']
        if lc_url and fs_url:
            url_dist.append(string_similarity(lc_url, fs_url))
            url_missing.append(0)
        else:
            url_dist.append(np.nan)
            url_missing.append(1)
            
    print('Processing street addresses...')
    house_sim, house_missing = [], []
    street_name_sim, street_name_missing = [], []
    street_type_sim, street_type_missing = [], []
    suite_num_sim, suite_num_missing = [], []
    suite_type_sim, suite_type_missing = [], []
    for i, row in df.iterrows():
        lc_house = row['lc_house']
        fs_house = row['fs_house']
        
        lc_street_name = row['lc_street_name']
        fs_street_name = row['fs_street_name']
        
        lc_street_type = row['lc_street_type']
        fs_street_type = row['fs_street_type']
        
        lc_suite_num = row['lc_suite_num']
        fs_suite_num = row['fs_suite_num']
        
        lc_suite_type = row['lc_suite_type']
        fs_suite_type = row['fs_suite_type']
        
        if lc_house and fs_house:
            house_sim.append(string_similarity(lc_house, fs_house))
            house_missing.append(0)
        else:
            house_sim.append(np.nan)
            house_missing.append(1)
        
        if lc_street_name and fs_street_name:
            street_name_sim.append(string_similarity(lc_street_name, fs_street_name))
            street_name_missing.append(0)
        else:
            street_name_sim.append(np.nan)
            street_name_missing.append(1)
            
        if lc_street_type and fs_street_type:
            street_type_sim.append(string_similarity(lc_street_type, fs_street_type))
            street_type_missing.append(0)
        else:
            street_type_sim.append(np.nan)
            street_type_missing.append(1)
        
        if lc_suite_num and fs_suite_num:
            suite_num_sim.append(string_similarity(lc_suite_num, fs_suite_num))
            suite_num_missing.append(0)
        else:
            suite_num_sim.append(np.nan)
            suite_num_missing.append(1)
        
        if lc_suite_type and fs_suite_type:
            suite_type_sim.append(string_similarity(lc_suite_type, fs_suite_type))
            suite_type_missing.append(0)
        else:
            suite_type_sim.append(np.nan)
            suite_type_missing.append(1)
            
    print('Processing LCS...')
    lcs = []
    for i, row in df.iterrows():
        lc_name = row['lc_name']
        fs_name = row['fs_name']
        lcs.append(calc_lcs(lc_name, fs_name))
    
    d = {'distance': distance,
         'name_sim': name_dist,
         'zip_sim': zip_dist,  'zip_missing': zip_missing,
         'phone_sim': phone_dist, 'phone_missing': phone_missing,
         'url_sim': url_dist, 'url_missing': url_missing,
         'house_sim': house_sim, 'house_missing': house_missing,
         'street_name_sim': street_name_sim, 'street_name_missing': street_name_missing,
         'street_type_sim': street_type_sim, 'street_type_missing': street_type_missing,
         'suite_num_sim': suite_num_sim, 'suite_num_missing': suite_num_missing,
         'suite_type_sim': suite_type_sim, 'suite_type_missing': suite_type_missing,
         'lcs': lcs }
    
    if d_i == 0:
        train_data = pd.DataFrame(d).fillna(0)
    else:
        test_data = pd.DataFrame(d).fillna(0)

print("#####\nProcessing Finished!")

#####
Starting iteration #0
Processing distances...
Processing names...
Processing ZIP codes...
Processing phone numbers...
Processing URLs...
Processing street addresses...
Processing LCS...
#####
Starting iteration #1
Processing distances...
Processing names...
Processing ZIP codes...
Processing phone numbers...
Processing URLs...
Processing street addresses...
Processing LCS...
#####
Processing Finished!


### Impute missing values (replace with mean value)

In [16]:
# from sklearn.preprocessing import Imputer

# ip = Imputer(missing_values = 'NaN')
# ip.fit(pd.concat([train_data, test_data], axis=0))

# train_data = pd.DataFrame(ip.fit_transform(train_data))
# test_data = pd.DataFrame(ip.fit_transform(test_data))

## Model Training
* To-do: Hyperparameter Tuning

### Classifiers

In [17]:
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#from xgboost import XGBClassifier

def cv_run_ada(train_data, train_labels, test_data, test_labels):
    model = AdaBoostClassifier(random_state=1).fit(train_data, train_labels)
    return model.predict(test_data)

def cv_run_bag(train_data, train_labels, test_data, test_labels):
    model = BaggingClassifier(max_features=1.0, random_state=1).fit(train_data, train_labels)
    return model.predict(test_data)

def cv_run_et(train_data, train_labels, test_data, test_labels):
    model = ExtraTreesClassifier(n_estimators=100, max_features=None, random_state=1).fit(train_data, train_labels)
    return model.predict(test_data)

def cv_run_rf(train_data, train_labels, test_data, test_labels):
    model = RandomForestClassifier(n_estimators=100, max_features=None, n_jobs=-1, random_state=1).fit(train_data, train_labels)
    return model.predict(test_data)

def cv_run_dt(train_data, train_labels, test_data, test_labels):
    model = DecisionTreeClassifier(max_features=None, random_state=1).fit(train_data, train_labels)
    return model.predict(test_data)

#def cv_run_xg(train_data, train_labels, test_data, test_labels):
#    model = XGBClassifier().fit(train_data, train_labels)
#    return model.predict(test_data)

### Cross-validation

In [18]:
skf = StratifiedKFold(match_column, n_folds=5, random_state=1, shuffle=True)

overall_corr = 0
wrong_indices = []
for train_index, test_index in skf:
    cv_train_data = train_data.loc[train_index]
    cv_train_labels = match_column[train_index]
    cv_test_data = train_data.loc[test_index]
    cv_test_labels = match_column[test_index]
    
    preds = cv_run_rf(cv_train_data, cv_train_labels, cv_test_data, cv_test_labels)

    fold_corr = sum(preds[cv_test_labels == 1])
    overall_corr += fold_corr
    
    # Collect wrong indices to check
    wrong_ix = [not x for x in preds[cv_test_labels == 1]]
    wrong_indices += list(cv_test_data[cv_test_labels == 1][wrong_ix].index)
        
    fold_acc = fold_corr / float(sum(cv_test_labels))
    print(fold_acc)
    
print("Overall Recall: {}".format(float(overall_corr) / sum(match_column)))

0.5
0.5
0.4
1.0
1.0
Overall Recall: 0.666666666667


In [43]:
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint as sp_randint
import random

params = {'n_estimators': sp_randint(10,200),
         'learning_rate': [random.uniform(0.1, 1.0) for i in range(1,20)],
         'random_state': [1]}

classifier = AdaBoostClassifier()
cv_classifier = RandomizedSearchCV(classifier, param_distributions=params, n_iter=20, cv=5, verbose=3)
model = cv_classifier.fit(train_data, match_column)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] n_estimators=33, learning_rate=0.473577366272, random_state=1 ...
[CV]  n_estimators=33, learning_rate=0.473577366272, random_state=1, score=0.999719 -   1.3s
[CV] n_estimators=33, learning_rate=0.473577366272, random_state=1 ...
[CV]  n_estimators=33, learning_rate=0.473577366272, random_state=1, score=0.999719 -   1.3s
[CV] n_estimators=33, learning_rate=0.473577366272, random_state=1 ...
[CV]  n_estimators=33, learning_rate=0.473577366272, random_state=1, score=0.999790 -   1.3s
[CV] n_estimators=33, learning_rate=0.473577366272, random_state=1 ...
[CV]  n_estimators=33, learning_rate=0.473577366272, random_state=1, score=0.999860 -   1.3s
[CV] n_estimators=33, learning_rate=0.473577366272, random_state=1 ...
[CV]  n_estimators=33, learning_rate=0.473577366272, random_state=1, score=0.999860 -   1.3s
[CV] n_estimators=163, learning_rate=0.893313110297, random_state=1 ..
[CV]  n_estimators=163, learning_rate=0.8933131

[Parallel(n_jobs=1)]: Done  31 tasks       | elapsed:  1.3min
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  6.7min finished





### Misclassified examples

In [48]:
train_data.iloc[wrong_indices, :].transpose()

Unnamed: 0,10577,38693,50111,47997,54951,67663,11535,29727,42480
distance,0.006468,0.08644,0.066754,0.000234,0.004729,0.02508,0.008889,0.062195,0.034435
house_missing,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
house_sim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
lcs,17.0,4.0,3.0,7.0,8.0,5.0,4.0,6.0,9.0
name_sim,0.878049,0.421053,0.375,0.533333,0.5,0.47619,0.555556,0.634146,1.0
phone_missing,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
phone_sim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
street_name_missing,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
street_name_sim,0.727273,0.666667,0.0,0.0,0.166667,0.2,0.090909,0.333333,0.0
street_type_missing,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [49]:
train.iloc[wrong_indices, :].transpose()

Unnamed: 0,10577,38693,50111,47997,54951,67663,11535,29727,42480
lc_id,825acefd3e298274a150,edeba23f215dcc702220,e3f9d84c0c989f2e7928,f7bb0b23ce99cddcd5c3,212dffb393f745df801a,493f5e2798de851ec3b2,c170270283ef870d546b,5f3fd107090d0ddc658b,66ef54d76ff989a91d52
lc_latitude,40.6438,40.7776,40.7746,40.7223,40.7398,40.7582,40.7662,40.714,40.762
lc_longitude,-73.782,-73.9457,-73.9573,-73.988,-73.9896,-73.9923,-73.9778,-73.9969,-73.9785
lc_name,greenwichvillagebistro,yorkgrill,lukes,karaokebohoorchard,brioflatiron,pickabagel,exhalespa,tsungsunsocialclub,starbucks
lc_phone,7187512890,2127720291,2122497070,2127770102,2126732121,2127928008,2125617400,2122269414,2122658610
lc_postal_code,11430,10128,10075,10002,10003,10036,10019,10002,10105
lc_street_address,John F Kennedy International Airport,1690 York Ave,1394 3rd Ave,196 Orchard St,920 Broadway,360 W 42nd St,150 Central Park South,11 Division St,1345 6th Ave
lc_website,none,yorkgrillnyc,lukesbarandgrill,karaokeboho,brioflatiron,pickabagel42ndstreetnyc,exhalespa,none,starbucks
lc_house,,1690,1394,196,920,360,150,11,1345
lc_street_name,John F Kennedy International Airport,York,3rd,Orchard,Broadway,W 42nd,Central Park South,Division,6th


## Prediction preparation
### Model training and prediction

In [50]:
#model = RandomForestClassifier(n_estimators=100, max_features=None, n_jobs=-1, random_state=1).fit(train_data, match_column)
labels = model.predict(test_data)

### Combine with the phone matching set then export

In [51]:
# Build and export the file
lc_col = test['lc_id'][labels2.astype(bool)]
fs_col = test['fs_id'][labels2.astype(bool)]

for lc_id, fs_id in matches_test.iteritems():
    lc_col = lc_col.append(pd.Series(lc_id))
    fs_col = fs_col.append(pd.Series(fs_id))

output = pd.concat([lc_col, fs_col], axis=1)
output.columns = ['locu_id', 'foursquare_id']

#with open('20160417.csv', 'w') as f:
#    output.to_csv(f, index=False, columns = ['locu_id', 'foursquare_id'])
    
# Instabase version
username = "diegoll2k"
repo = "entity-resolution-csds"
with ib.open('/{0}/{1}/fs/Instabase%20Drive/files/matches_4.csv'.format(username,repo)) as f:
  output.to_csv(f, index=False, columns = ['locu_id', 'foursquare_id'])