All data is provided as plain text column separated files:
* Column separated by ‘|’ (bar), one line per entry;
* the test set will have the same size as STrain provided to you now (hence, no memory issues during test).
We provide you up-front with three datasets: G, STrain and sample_submission
* G has as columns:
    * company_id: the id of the company in our ground-truth administration 
    * name: the name of the company in our ground-truth administration
* STrain has as columns:
    * train_index: an index of the company in the external source dataset;
    * name: the name of the company as represented in the external source dataset;
    * company_id: the correct match of this entry to G. Is -1 if correct label is ‘not in G’, otherwise corresponds to G_id
* sample_submission has as columns:
    * test_index: index of the company in an external source dataset (note: different index from STrain, full STest will only be provided during the interview); 
    * company_id: the predicted match of this entry to G. Note that in this file, these are randomly generated predictions.

You need to design, code and train a model that predicts company_id, minimizing the cost function as specified on the previous page 
Please make sure that your trained model:

* accepts as input a plain text file of the format STest, containing two columns test_index and name;
* runs from the command line, using as input the path to the file STest;
* prediction time should be ‘near real time’, i.e. about 1 minute for 10,000 entries (on a regular laptop);
* It should return a file with the above plain text format, including the columns test_index and company_id (an example submission is provided).

In [1]:
import csv
import pandas as pd
import numpy as np

In [2]:
groundTruth = pd.read_csv("Datasets/G.csv", sep='|')
print(list(groundTruth.columns.values))
print(groundTruth.head())
print(groundTruth.shape)

['company_id', 'name']
   company_id                        name
0      634022                  PRIMCOM SA
1      324497       The David Isaacs Fund
2      280848  Bramor Enterprises Limited
3      432662                NAVEXIM S.A.
4      524224              Magal Group SA
(450256, 2)


In [3]:
companyCounts = groundTruth['company_id'].value_counts()
print(companyCounts.describe())
topCompanyCounts = companyCounts.nlargest(20)

count    450256.0
mean          1.0
std           0.0
min           1.0
25%           1.0
50%           1.0
75%           1.0
max           1.0
Name: company_id, dtype: float64


In [5]:
sTrain = pd.read_csv("Datasets/STrain.csv", sep='|')
print(list(sTrain.columns.values))
print(sTrain.head())
print(sTrain.shape)

['train_index', 'name', 'company_id']
   train_index                                               name  company_id
0            0                        ATRION Immo bilien & Co. KG          -1
1            1                            MyTyme Inve stments Inc      356624
2            2                                     Financial USI.      510805
3            3  FlexShares Trust - FlexShares Morningstar Emer...      523467
4            4                                    Health Sinai SF      231108
(100000, 3)


In [6]:
sTrainFilter = sTrain[(sTrain.company_id!=-1)]
print(sTrainFilter.shape)
matchingRecords = sTrainFilter.join(
    groundTruth.set_index("company_id"), on="company_id", lsuffix='_train', rsuffix='_GT')

(69652, 3)


In [6]:
print(matchingRecords.shape)
repeatedCounts = matchingRecords['company_id'].value_counts()
#print(repeatedCounts.describe())
matchingRecords.head()

(69652, 4)


Unnamed: 0,train_index,name_train,company_id,name_GT
1,1,MyTyme Inve stments Inc,356624,MyTyme Investments Inc
2,2,Financial USI.,510805,UBS Financial Services Inc.
3,3,FlexShares Trust - FlexShares Morningstar Emer...,523467,FlexShares Trust - FlexShares Morningstar Emer...
4,4,Health Sinai SF,231108,Sinai Health System Foundation
6,6,"LLC TEBS Fund II, ATAX",277891,"ATAX TEBS II, LLC"


In [9]:
test = [x for x in np.array(sTrainFilter['company_id']) if x in groundTruth['company_id']]
len(test)

48764

#### TF-IDF with N-grams

In [13]:
import re

def ngrams(string, n=3):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

print('All 3-grams in "McDonalds":')
ngrams('McDonalds')

All 3-grams in "McDonalds":


['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']

In [14]:
# Can be parallelized
from sklearn.feature_extraction.text import TfidfVectorizer

company_names = groundTruth['name']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
ground_truth_matrix = vectorizer.fit_transform(company_names)

### Pairwise similarity implementation

In [11]:
from sklearn.metrics.pairwise import cosine_similarity
# A function that given an input query item returns the top-k most similar items 
# by their cosine similarity.
def find_similar(query_vector, td_matrix, top_k = 5):
    cosine_similarities = cosine_similarity(query_vector, td_matrix).flatten()
    related_doc_indices = cosine_similarities.argsort()[::-1]
    return [(index, cosine_similarities[index]) for index in related_doc_indices][0:top_k]

In [16]:
# Transform our string using the vocabulary
str = sTrain['name'][0]
print(str)
transformed = vectorizer.transform([str])
#print (transformed)
#print (vectorizer.inverse_transform(transformed))
query = transformed[0:1]
#print(query)
#print (vectorizer.inverse_transform(query))

#print (post_index, "\tsubreddit:", post_frame.iloc[post_index, 0], "; [body]:", post_frame.iloc[post_index, 5].replace('\n', ''))
print ("\nsimilar:")
for index, score in find_similar(query, ground_truth_matrix, 1):
  print(score, index, groundTruth.iloc[index])

ATRION Immo bilien & Co. KG

similar:
0.4982011041510661 342766 company_id                               250537
name          ATRION Immobilien Verwaltung GmbH
Name: 342766, dtype: object


In [18]:
import time
t1 = time.time()
for idx, row in sTrain.iterrows():
    str = sTrain['name'][idx]
    transformed = vectorizer.transform([str])
    query = transformed[0:1]
    index, score = find_similar(query, ground_truth_matrix, 1)[0]
    sTrain.loc[idx,'match_name'] = groundTruth.iloc[index]['name']
    sTrain.loc[idx,'match_company_id'] = groundTruth.iloc[index]['company_id']
    sTrain.loc[idx,'match_score'] = score
    if idx==10:
        break
t = time.time()-t1
print("SELFTIMED:", t)

SELFTIMED: 9.223871946334839


In [19]:
sTrain.head()

Unnamed: 0,train_index,name,company_id,match_name,match_company_id,match_score
0,0,ATRION Immo bilien & Co. KG,-1,ATRION Immobilien Verwaltung GmbH,250537.0,0.498201
1,1,MyTyme Inve stments Inc,356624,MyTyme Investments Inc,356624.0,0.901331
2,2,Financial USI.,510805,"DS Financial, LLC",152602.0,0.592161
3,3,FlexShares Trust - FlexShares Morningstar Emer...,523467,FlexShares Trust - FlexShares Morningstar Emer...,523467.0,0.962004
4,4,Health Sinai SF,231108,Sinai Health System Foundation,231108.0,0.541243
