All data is provided as plain text column separated files:
* Column separated by ‘|’ (bar), one line per entry;
* the test set will have the same size as STrain provided to you now (hence, no memory issues during test).
We provide you up-front with three datasets: G, STrain and sample_submission
* G has as columns:
    * company_id: the id of the company in our ground-truth administration 
    * name: the name of the company in our ground-truth administration
* STrain has as columns:
    * train_index: an index of the company in the external source dataset;
    * name: the name of the company as represented in the external source dataset;
    * company_id: the correct match of this entry to G. Is -1 if correct label is ‘not in G’, otherwise corresponds to G_id
* sample_submission has as columns:
    * test_index: index of the company in an external source dataset (note: different index from STrain, full STest will only be provided during the interview); 
    * company_id: the predicted match of this entry to G. Note that in this file, these are randomly generated predictions.

You need to design, code and train a model that predicts company_id, minimizing the cost function as specified on the previous page 
Please make sure that your trained model:

* accepts as input a plain text file of the format STest, containing two columns test_index and name;
* runs from the command line, using as input the path to the file STest;
* prediction time should be ‘near real time’, i.e. about 1 minute for 10,000 entries (on a regular laptop);
* It should return a file with the above plain text format, including the columns test_index and company_id (an example submission is provided).

In [1]:
import pandas as pd
import numpy as np
import time
import warnings
warnings.filterwarnings('ignore')

In [2]:
groundTruth = pd.read_csv("Datasets/G.csv", sep='|')
print(list(groundTruth.columns.values))
print(groundTruth.head())
print(groundTruth.shape)

['company_id', 'name']
   company_id                        name
0      634022                  PRIMCOM SA
1      324497       The David Isaacs Fund
2      280848  Bramor Enterprises Limited
3      432662                NAVEXIM S.A.
4      524224              Magal Group SA
(450256, 2)


In [3]:
companyCounts = groundTruth['company_id'].value_counts()
print(companyCounts.describe())
topCompanyCounts = companyCounts.nlargest(20)

count    450256.0
mean          1.0
std           0.0
min           1.0
25%           1.0
50%           1.0
75%           1.0
max           1.0
Name: company_id, dtype: float64


In [4]:
sTrain = pd.read_csv("Datasets/STrain.csv", sep='|')
print(list(sTrain.columns.values))
print(sTrain.head())
print(sTrain.shape)

['train_index', 'name', 'company_id']
   train_index                                               name  company_id
0            0                        ATRION Immo bilien & Co. KG          -1
1            1                            MyTyme Inve stments Inc      356624
2            2                                     Financial USI.      510805
3            3  FlexShares Trust - FlexShares Morningstar Emer...      523467
4            4                                    Health Sinai SF      231108
(100000, 3)


In [5]:
sTrainFilter = sTrain[(sTrain.company_id!=-1)]
print(sTrainFilter.shape)
matchingRecords = sTrainFilter.join(
    groundTruth.set_index("company_id"), on="company_id", lsuffix='_train', rsuffix='_GT')

(69652, 3)


In [6]:
print(matchingRecords.shape)
repeatedCounts = matchingRecords['company_id'].value_counts()
#print(repeatedCounts.describe())
matchingRecords.head()

(69652, 4)


Unnamed: 0,train_index,name_train,company_id,name_GT
1,1,MyTyme Inve stments Inc,356624,MyTyme Investments Inc
2,2,Financial USI.,510805,UBS Financial Services Inc.
3,3,FlexShares Trust - FlexShares Morningstar Emer...,523467,FlexShares Trust - FlexShares Morningstar Emer...
4,4,Health Sinai SF,231108,Sinai Health System Foundation
6,6,"LLC TEBS Fund II, ATAX",277891,"ATAX TEBS II, LLC"


In [7]:
import numpy as np
test = [x for x in np.array(sTrainFilter['company_id']) if x in groundTruth['company_id']]
len(test)

48764

#### TF-IDF with N-grams

In [8]:
import re

def ngrams(string, n=3):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

print('All 3-grams in "McDonalds":')
ngrams('McDonalds')

All 3-grams in "McDonalds":


['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']

In [9]:
# Can be parallelized
from sklearn.feature_extraction.text import TfidfVectorizer
company_names = groundTruth['name']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
ground_truth_matrix = vectorizer.fit_transform(company_names)

In [15]:
sTrainFil = sTrain[0:10000]
train_matrix = vectorizer.transform(sTrainFil['name'])
train_matrix

<10000x78817 sparse matrix of type '<class 'numpy.float64'>'
	with 223152 stored elements in Compressed Sparse Row format>

In [16]:
# Below code calculates cosine similarities and return top results
# Implement LSA
def get_top_sim(sparse_row):
    nnz = sparse_row.getnnz()
    if nnz==0:
        return (0.0, None, -1)
    else:
        #arg_index = [np.argmax(sparse_row.data)]
        arg_index = np.argpartition(sparse_row.data, -1)[-1]
        match_id = sparse_row.indices[arg_index]
        result = (sparse_row.data[arg_index], groundTruth.loc[match_id]['name'], 
                     groundTruth.loc[match_id]['company_id'])
    return result
def cosine_similarities(trainMat, groundTruthMat):
    sim = trainMat.dot(groundTruthMat.T)
    #sim = trainMat*groundTruthMat.T.tocsc()
    return [get_top_sim(row) for row in sim]

In [17]:
import time
t1 = time.time()
res = cosine_similarities(train_matrix, ground_truth_matrix)
t = time.time()-t1
print("Time taken for computing similarities:", t)

Time taken for computing similarities: 36.37892985343933


In [18]:
match_score, match_name , match_company_id = zip(*res)
sTrainFil['match_name'] = np.array(match_name)
sTrainFil['match_company_id'] = np.array(match_company_id)
sTrainFil['match_score'] = np.array(match_score)

In [20]:
sTrainFil

Unnamed: 0,train_index,name,company_id,match_name,match_company_id,match_score
0,0,ATRION Immo bilien & Co. KG,-1,ATRION Immobilien Verwaltung GmbH,250537,0.498201
1,1,MyTyme Inve stments Inc,356624,MyTyme Investments Inc,356624,0.901331
2,2,Financial USI.,510805,"DS Financial, LLC",152602,0.592161
3,3,FlexShares Trust - FlexShares Morningstar Emer...,523467,FlexShares Trust - FlexShares Morningstar Emer...,523467,0.962004
4,4,Health Sinai SF,231108,Sinai Health System Foundation,231108,0.541243
5,5,Auto markt Dinser sGmbH,-1,Baumarkt Bender GmbH,259391,0.287415
6,6,"LLC TEBS Fund II, ATAX",277891,"ATAX TEBS II, LLC",277891,0.634353
7,7,Windermere reds Acquisitions LLC,205639,Breds Windermere Acquisitions L.L.C.,205639,0.822906
8,8,Les Eoliennes de Saint Fraigne,-1,Eoliennes de Lorraine SA,371703,0.542665
9,9,"Tatra Asset Management, spárv. spol., a.s., de...",589052,"Tatra Asset Management, správ. spol., a.s., de...",589052,0.671565
