# Company Name Matching
### Approach
1. Preprocess to noramlize names: remove spaces, characters, lowercase, 
2. Build TfIdf name vectors
3. Index Using [Annoy](https://github.com/spotify/annoy)
4. Evaluate Using accuracy on non -1 companies.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn import svm, linear_model
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import roc_curve, auc,f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict
import unicodedata
import random
random.seed(37)
from annoy import AnnoyIndex
import nltk
import re

In [2]:
pd.set_option('display.max_colwidth', -1)

In [3]:
data_path = "data/"
file_path = "files/"

In [4]:
companies = pd.read_csv(data_path+"G.csv", delimiter="|")
train = pd.read_csv(data_path+"STrain.csv", delimiter="|")

In [5]:
companies.head()

Unnamed: 0,company_id,name
0,634022,PRIMCOM SA
1,324497,The David Isaacs Fund
2,280848,Bramor Enterprises Limited
3,432662,NAVEXIM S.A.
4,524224,Magal Group SA


In [6]:
companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450256 entries, 0 to 450255
Data columns (total 2 columns):
company_id    450256 non-null int64
name          450256 non-null object
dtypes: int64(1), object(1)
memory usage: 6.9+ MB


In [7]:
train.head()

Unnamed: 0,train_index,name,company_id
0,0,TRATTAMENTO Ltd RIFIUTI METROPOLITANI SPA SIGLABILE TRM SPA,177358
1,1,A IRL Fuund,568472
2,2,BMR-500 Kendall LLC 1 Mezz GmbH,195692
3,3,Solich GmbH KG,-1
4,4,Drzyzzga Funds Logi. sp. z oo,404178


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
train_index    100000 non-null int64
name           100000 non-null object
company_id     100000 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB


In [9]:
test = pd.read_csv(data_path+"STest.csv", delimiter=",")
test.head()

Unnamed: 0,test_index,name
0,0,THEking'S ROYAL HUSSARS OFFI. TRUST' TRUST
1,1,Southern Powe rcompany SICAV
2,2,BMO S&P/TSX Ladde. Share ETF Index
3,3,PaI
4,4,Clearview Two


In [10]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
test_index    100000 non-null int64
name          99999 non-null object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [11]:
print train.columns
print test.columns

Index([u'train_index', u'name', u'company_id'], dtype='object')
Index([u'test_index', u'name'], dtype='object')


In [12]:
print train.shape
print train["name"].nunique()

(100000, 3)
99365


In [13]:
print test.shape
print test["name"].nunique()

(100000, 2)
99323


In [14]:
print len(set(test["name"]) & set(train["name"]))

1075


### Sample Data
Here I use tf-idf vectors. Tfidf vectors are sparse and have a very high dimesnion ~15,000. Though I have used scipy sparse matrices, I cannot get sparse format to work with Annoy. Consequently I used the dense format to create indexex. This increased the index size and it became unmanagable. As a result I decided to work with only a sample of data.

1. Sample 50,000 (fraction =1/9) companies from the original company set.
2. To build train sample:
    * select all records with company id prsent in companies sample
    * Take 1/9 fraction of -1 companies 

In [15]:
companies = companies.sample(50000, random_state=37)

In [16]:
companies = companies.reset_index(drop=True)
companies.head()

Unnamed: 0,company_id,name
0,619393,Metzler Vermögensverwaltungsfonds 70
1,587104,Hakvoort Professional B.V.
2,482637,RUDOLF WOLFF SYSTEMATIC FUND LIMITED
3,301258,Western Alliance Bancorporation
4,499801,Ivar Fors & Co AB


In [17]:
train_0 = train.loc[train["company_id"]==-1]
print train_0.shape
train_1 = train.loc[train["company_id"]>=0]

print train_1.shape

(30256, 3)
(69744, 3)


In [18]:
train = train_1.loc[train_1["company_id"].isin(companies["company_id"])]
print train.shape

(7708, 3)


In [19]:
train0_sample = train_0.sample(int((1/9.0)*len(train_0)), random_state=37)
print len(train0_sample)

3361


In [20]:
train = pd.concat([train,train0_sample])

In [21]:
train.shape

(11069, 3)

In [22]:
train = train.reset_index(drop=True)
train.head()

Unnamed: 0,train_index,name,company_id
0,5,BERK ELEY LLP,438615
1,9,RBPA Leeuw,526042
2,19,Waddell & Reed Advisors Funds - Waddell & Reed Advisors Small Cap Fund,281405
3,22,ESCO TORE S.R.L.,79550
4,72,Russell Total Return Fund.) Ltd.,351412


### Pre-process data

In [23]:
def remove_accents(df,**kw_args):
    old_col =  kw_args["old_col"]
    new_col =  kw_args["new_col"]
    def remove_accents_inner(input_str):
        nfkd_form = unicodedata.normalize('NFKD', unicode(input_str, 'utf8'))
        return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
    
    df[new_col] = df[old_col].apply(remove_accents_inner)
    return df

def make_accent_transformer(old_col, new_col):
    return FunctionTransformer(remove_accents, validate=False,
                                         kw_args={"old_col":old_col,"new_col":new_col})

In [24]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/akash/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]|@,;\-\\\+#~!$%^]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z ]')
STOPWORDS = set(stopwords.words('english'))

In [26]:
s = "for from india and add+g-ddc.b.g,c\g-hh ak|ash#sin/gh dee$pt^a su%d"
s=re.sub(REPLACE_BY_SPACE_RE," ",s)
s= re.sub(BAD_SYMBOLS_RE,"",s)
s= ' '.join(word for word in s.split() if word not in STOPWORDS)
print s

india add g ddcbg c g hh ak ash sin gh dee pt su


In [27]:
def clean_name(df,**kw_args):
    old_col =  kw_args["old_col"]
    new_col =  kw_args["new_col"]
    
    def regex_clean(text):
        text = re.sub(REPLACE_BY_SPACE_RE, " ", text)
        text = re.sub(BAD_SYMBOLS_RE, "", text)
        #text= ' '.join(word for word in text.split() if word not in STOPWORDS)
        return text
    
    df[new_col] = df[old_col].str.lower().str.strip()
    df[new_col] = df[old_col].apply(regex_clean)
    
    return df

def make_clean_name_transformer(old_col, new_col):
    return FunctionTransformer(clean_name, validate=False, kw_args={"old_col":old_col,"new_col":new_col})

#### Make name processing pipeline

In [28]:
name_pipeline = make_pipeline(make_accent_transformer("name", "clean_name"),
                                    make_clean_name_transformer("clean_name", "clean_name"))


In [29]:
companies = name_pipeline.transform(companies)
companies.sample(10)

Unnamed: 0,company_id,name,clean_name
37157,78742,PowerShares S&P 500 High Dividend Low Volatility Index ETF,powershares sp 500 high dividend low volatility index etf
15724,571443,Bauschutz GmbH & Co. KG,bauschutz gmbh co kg
32459,50788,DAJIMA Holding B.V.,dajima holding bv
23109,435011,A.J. BRUNT LIMITED,aj brunt limited
37703,244562,FLEX.DOCCIA S.R.L.,flexdoccia srl
33840,254844,Pacific Funds Series Trust - PF Mid-Cap Value Fund,pacific funds series trust pf mid cap value fund
4904,400061,Apotheke zur Post Inh. Anette Penz e.K,apotheke zur post inh anette penz ek
26736,59118,Clipper Oil,clipper oil
31961,156980,T. Rowe Price Funds SICAV - Global Focused Growth Equity Fund,t rowe price funds sicav global focused growth equity fund
24117,472930,Metropolitan Tower Life Insurance Company,metropolitan tower life insurance company


In [30]:
print companies.shape
print companies["clean_name"].nunique()

(50000, 3)
49963


#### Pre-process train

In [31]:
train.head()

Unnamed: 0,train_index,name,company_id
0,5,BERK ELEY LLP,438615
1,9,RBPA Leeuw,526042
2,19,Waddell & Reed Advisors Funds - Waddell & Reed Advisors Small Cap Fund,281405
3,22,ESCO TORE S.R.L.,79550
4,72,Russell Total Return Fund.) Ltd.,351412


In [32]:
print train.shape

(11069, 3)


In [33]:
train = name_pipeline.transform(train)

In [34]:
train.head()

Unnamed: 0,train_index,name,company_id,clean_name
0,5,BERK ELEY LLP,438615,berk eley llp
1,9,RBPA Leeuw,526042,rbpa leeuw
2,19,Waddell & Reed Advisors Funds - Waddell & Reed Advisors Small Cap Fund,281405,waddell reed advisors funds waddell reed advisors small cap fund
3,22,ESCO TORE S.R.L.,79550,esco tore srl
4,72,Russell Total Return Fund.) Ltd.,351412,russell total return fund ltd


#### Check top occuring words

In [35]:
from collections import Counter
count_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = count_vectorizer.fit_transform(companies["clean_name"])

# Vocabulary
vocab = list(count_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))

In [36]:
print (freq_distribution.most_common(100))

[(u'llc', 4877), (u'fund', 4281), (u'limited', 3836), (u'gmbh', 3293), (u'inc', 2717), (u'bv', 2320), (u'srl', 2307), (u'trust', 2225), (u'ltd', 1795), (u'sa', 1675), (u'co', 1445), (u'lp', 1338), (u'global', 1237), (u'de', 1146), (u'kg', 1135), (u'holding', 1124), (u'the', 1077), (u'company', 1071), (u'funds', 1068), (u'ab', 1014), (u'investment', 894), (u'sl', 831), (u'capital', 803), (u'international', 782), (u'of', 758), (u'equity', 755), (u'spa', 753), (u'di', 749), (u'holdings', 742), (u'as', 692), (u'ii', 668), (u'group', 661), (u'sicav', 660), (u'management', 602), (u'investments', 600), (u'portfolio', 564), (u'aps', 562), (u'partners', 543), (u'bond', 521), (u'societa', 521), (u'invest', 490), (u'sro', 482), (u'and', 478), (u'oy', 461), (u'bank', 446), (u'master', 435), (u'corporation', 434), (u'nv', 419), (u'services', 417), (u'plc', 412), (u'income', 411), (u'ag', 383), (u'fonds', 381), (u'beheer', 380), (u'us', 377), (u'spoka', 375), (u'settlement', 368), (u'mbh', 361), (u'

### Use TfidfVectorizer to generate company name vectors

In [37]:
tfidf_vectorizer = TfidfVectorizer( max_df=0.8, ngram_range=(2,3), analyzer="char",
                                       token_pattern='(\S+)', )

In [38]:
company_vectors = tfidf_vectorizer.fit_transform(companies["clean_name"])

In [39]:
type(company_vectors)

scipy.sparse.csr.csr_matrix

In [40]:
company_vectors.shape

(50000, 15375)

In [41]:
companies.tail()

Unnamed: 0,company_id,name,clean_name
49995,243419,State Street Bank and Trust Company World Index Common Trust Funds - State Street 1-3 Year U.S. Agency Index Non-Lending Common Trust Fund,state street bank and trust company world index common trust funds state street 1 3 year us agency index non lending common trust fund
49996,272569,Autobedrijf Vander Stichele NV,autobedrijf vander stichele nv
49997,17447,INVESTEC FUNDS SERIES II - AMERICAN FUND,investec funds series ii american fund
49998,431383,GLG Technology Fund,glg technology fund
49999,240428,MERSEYSIDE ESTATES LIMITED,merseyside estates limited


In [42]:
type(company_vectors[0].todense())
print np.squeeze(np.asarray(company_vectors[0].todense())).shape

(15375,)


### Indexing using Annoy.
Approximate nearest neigbour using random projections. Indexes once built are static files and can be used across processes.

#### Make map of company_ids to Annoy_index_ids

In [43]:
# annoy_ids are from 0 to len(company_ids) -1 
# annoy_ids are stored in list. List index are annoy_ids and the value is teh correspodnign company_id
# Company_ids are stored in a dict. The key is company id and the value is annoy_id
def build_company_annoy_maps(company_ids):
    annoy2company = []
    company2annoy = defaultdict(lambda:0)
    for c_id in company_ids:
        if c_id not in company2annoy:
            annoy2company.append(c_id)
            company2annoy[c_id] = len(annoy2company)-1
    return annoy2company, company2annoy  
        

In [44]:
annoy2company, company2annoy = build_company_annoy_maps(companies["company_id"].values)

In [45]:
print len(annoy2company)
print len(company2annoy)

50000
50000


In [46]:
import random
for i in range(5):
    annoy_id = random.randint(0,50000)
    assert annoy_id ==  company2annoy[annoy2company[annoy_id]]
    print annoy_id, annoy2company[annoy_id], company2annoy[annoy2company[annoy_id]]

34100 482478 34100
4580 73643 4580
30891 185217 30891
42096 185101 42096
41728 188097 41728


#### Build annoy Index
Uncoment below code to build the index

In [47]:
# index_size = len(tfidf_vectorizer.vocabulary_)
# table = AnnoyIndex(index_size)

# for i in range(company_vectors.shape[0]):
#     if i%10000==0:
#         print "indexed %s items"%i
#     table.add_item(i,np.squeeze(np.asarray(company_vectors[i].todense())))
    
# table.build(100)
# table.save(file_path+"annoy_index.ann")

### Load index and search

In [48]:
index_size = len(tfidf_vectorizer.vocabulary_)
table = AnnoyIndex(index_size)
table.load(file_path+"annoy_index.ann")

True

#### Find neighbours of train set

##### Create training name dense vectors

In [49]:
train = train.merge(companies, how="left", on="company_id", suffixes=('','_truth'))

In [50]:
def get_nearest_neighbours(vectors, num_neighbours=1, search_nodes=-1, include_distances=True):
    neighbours = np.empty(shape=(len(vectors),num_neighbours), dtype=np.int32)
    distances = np.empty(shape=(len(vectors),num_neighbours))
#     print len(vectors)
#     print neighbours.shape
#     print distances.shape
    for idx,v in enumerate(vectors):
        annoy_ids, annoy_distances = table.get_nns_by_vector(v, n=num_neighbours, search_k=search_nodes, include_distances=include_distances)
        neighbours[idx,:] = annoy_ids
        distances[idx,:] = annoy_distances
    return neighbours, distances

#### Since we need to predict ground truth we set prediction as the nearest neighbour

In [51]:
%%time
train_vectors = tfidf_vectorizer.transform(train["clean_name"])
train_vectors_dense = np.squeeze(np.asarray(train_vectors.todense()))
neighbours, distances = get_nearest_neighbours(train_vectors_dense)
neighbour_company_ids= np.vectorize(lambda x: annoy2company[x])(neighbours)

CPU times: user 9min 51s, sys: 543 ms, total: 9min 51s
Wall time: 9min 52s


In [52]:
train_vectors.shape

(11069, 15375)

In [53]:
rank_1 = neighbour_company_ids[:,0]
rank_1

array([438615, 526042, 281405, ...,  26071, 305713, 471477])

In [54]:
train["predicted_id"] = rank_1

In [55]:
train= train.merge(companies[['company_id', 'name']], left_on="predicted_id", 
                   right_on="company_id", suffixes=("","_predicted"))

In [56]:
train[["train_index", "name", "company_id", "predicted_id", "name_predicted"]]

Unnamed: 0,train_index,name,company_id,predicted_id,name_predicted
0,5,BERK ELEY LLP,438615,438615,BERKELEY PARTNERS LLP
1,9,RBPA Leeuw,526042,526042,RBPA Leeuw
2,19,Waddell & Reed Advisors Funds - Waddell & Reed Advisors Small Cap Fund,281405,281405,Waddell & Reed Advisors Funds - Waddell & Reed Advisors Small Cap Fund
3,22,ESCO TORE S.R.L.,79550,79550,ESCO TORRE S.R.L.
4,72,Russell Total Return Fund.) Ltd.,351412,351412,Russell Total Return Fund (Quarterly) Ltd.
5,83,Cornel Global B.V.,370746,370746,Cornel B.V.
6,57000,Cornel Inves. B.V.,370746,370746,Cornel B.V.
7,123,Tokio Marine iLfe Insurance Singapore Ltd.,285532,285532,Tokio Marine Life Insurance Singapore Ltd.
8,141,RIjSTO RANTE LO DI DI SPV E C. - S.A.S.,211110,211110,RISTORANTE LO STRETTOIO DI PEDRO DI VITO E C. - S.A.S.
9,22804,RISTORANTE FACEC OOK DI HU C. SNC,-1,211110,RISTORANTE LO STRETTOIO DI PEDRO DI VITO E C. - S.A.S.


In [57]:
total = train.shape[0]
correct_predictions = train.loc[train["company_id"]==train["predicted_id"]].shape[0]
incorrect_predictions = train.loc[train["company_id"]!=train["predicted_id"]].shape[0]
print "Total %d"%total
print "Accuracy | Correct predictions %f"% (float(correct_predictions)/total)
print "Incorrect predictions %f"% (float(incorrect_predictions)/total)

Total 11069
Accuracy | Correct predictions 0.551721
Incorrect predictions 0.448279


In [58]:
train.loc[train["company_id"]==-1].shape

(3361, 9)

#### Accuracy on excluding -1 cases 

In [59]:
train_sub = train.loc[train["company_id"]!=-1]
print np.sum(train_sub["company_id"] == train_sub["predicted_id"])/float(len(train_sub))

0.7922937208095485


That is all!