In what follows, I create functions which predict ID of a retailer's product with a given name, retailer's origin and its 
brand name (feature set) based on a products database which associates product's features with its unique ID. 

First approach uses machine-learning (ML) algorithm which exploits feature set of a product and finds the best match for a 
retailer's product based on a specified rule. Specifically, I use features of a product such as its name "bag of characters" and the name's length, and the origin of its retailer. I hypothesize that two names with similar "bag of characters" are likely to share the same ID. Moreover, I impose penalty on the product's name with which I compare a retailer's product. Last factor in my model is a dummy variable which equals 1 if retailers of two products share common origin. To compare two names, I create a linear function (which can be easily extended to any other shape) with arguments corresponding to product features and estimate its parameters in a training sample. Finally, I estimate accuracy of my model by validation sample approach. Note that this can be also done, say, by K-fold cross-validation, yet the final results shouldn't be considerably different from each other in this case, so I exclude this estimation to save computational time.

Second approach exploits built-in Python function "get_close_matches" from "difflib" library which finds the best match of a 
word among the list of other words. In this case, to predict a retailer product's ID given product database I run through the 
poducts database and search for a product's name which matches the name of the retailer's  product in the best way and assign its ID to the retailer product. 

In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import difflib

In [3]:
# Upload dataset and convert dataframe to list
df = pd.read_csv(r'train_lenses_ds_task.csv', encoding = 'utf8')
Products_list = [df.columns.values.tolist()] + df.values.tolist()
Products_list = Products_list[1:len(Products_list)][:]

In [4]:
# extract key features of a product: its name and retailer's origin  
product_id = [new[0] for new in Products_list ]
origin = [new[2] for new in Products_list ]

In [5]:
# Remove redundant spaces in products names to consider the names as "bag of characters"
NameSplit = []
for new in Products_list:
    newlist = list(new[1])
    while newlist.count(' ') > 0:
        newlist.remove(' ')
    NameSplit.append(newlist)

In [6]:
# merge product_id and product name
products =[]
for i in range(len(product_id)):
    products.append([product_id[i], NameSplit[i], origin[i]])

# First Approach (via Machine Learning)

In [7]:
# Check whether a product ID can be supplied by foreign retailers
found = 0
for i in range(len(products)):
    for j in range(i,len(products)):
        if products[i][2] != products[j][2] and products[i][0] == products[j][0]:
            found = 1
            break
    if found == 1:
        print("Product ID is found in different countries!")
        break 
    if i == len(products) - 1:
        print("Each product ID is domestic!")   

Product ID is found in different countries!


In [8]:
# Train the model    
def train(trainSam,lam1,lam2,lam3):
    predictedID = []
    for obsDraw in trainSam:
        trainRem = trainSam.copy()
        trainRem.remove(obsDraw)
        score = []
        for obsRem in trainRem:
        # Compare countries
            if obsDraw[2] == obsRem[2]:
                countryDummy = 1
            else:
                countryDummy = 0
            # Create list of matched characters of product names
            matchedChars = list(set(obsDraw[1]) & set(obsRem[1])) 
            # Compute total number of matched characters
            matchedCharsNum = 0  
            for newchar in matchedChars:
                matchedCharsNum += obsDraw[1].count(newchar)
            # Penalize lengthy words with which we compare    
            nameLength = len(obsRem[1])
            # Define linear score-function with arguments: number of matched characters, country of retailer and the product name's length  
            score.append(lam1*matchedCharsNum + lam2*countryDummy + lam3*nameLength) 
        # Find first entrance of max score in candidate list   
        indMax = score.index(max(score))  
        # Retrieve all product IDs in remaining sample
        IDpred = [x[0] for x in trainRem][indMax] 
        # augment set of predicted IDs
        predictedID.append([IDpred,obsDraw[1]])
    # Compute number of correctly predicted IDs
    numPred = 0
    for i in range(len(predictedID)):
        if predictedID[i][0] == trainSam[i][0]:
            numPred += 1
    # Return score
    return(numPred/len(trainSam)) 

In [9]:
# Find the model's score on test sample
def validate(trainSam, testSam, lam1, lam2, lam3):
    predictedID = []
    for obsTest in testSam:
        score = []
        for obsTrain in trainSam:
            # Compare countries
            if obsTrain[2] == obsTest[2]:
                countryDummy = 1
            else:
                countryDummy = 0
            # Create list of matched characters of product names
            matchedChars = list(set(obsTest[1]) & set(obsTrain[1])) 
            # Compute total number of matched characters
            matchedCharsNum = 0  
            for newchar in matchedChars:
                matchedCharsNum += obsTest[1].count(newchar)
            nameLength = len(obsTrain[1])
            # Compute linear score-function given estimated parameters lam1,lam2,lam3 from train sample  
            score.append(lam1*matchedCharsNum + lam2*countryDummy + lam3*nameLength) 
        # Find first entrance of max score in candidate list   
        indMax = score.index(max(score))  
        # Retrieve all product IDs
        IDpred = [x[0] for x in trainSam][indMax] 
        # augment set of predicted IDs
        predictedID.append([IDpred,obsTest[1]])
    # Find distinct IDs in training sample which are present in test sample    
    matchedIDs = list(set([x[0] for x in testSam] ) & set([x[0] for x in trainSam]))
    # Total number of IDs in test sample which can be matched with IDs in train sample
    matchedIDsNum = 0
    for newID in matchedIDs:
        matchedIDsNum += [new[0] for new in testSam].count(newID)
    # Compute number of correctly predicted IDs
    corIDsPred = 0
    for i in range(len(predictedID)):
        if predictedID[i][0] == testSam[i][0]:
            corIDsPred += 1
    # Return score
    score = corIDsPred/matchedIDsNum
    return(score)

In [10]:
# Specify share of data to be used in test sample
test_size = 0.4
datatrain, datatest = train_test_split(products, test_size = test_size)
 
# Train model
scores = []
trainResults = []
# Next, I guess some grid values of model parameters. In particular, I use only two ad-hoc values for train sample size of 0.9. 
# However, there's a clear trade-off between the estimation time and the grid fineness. 
 
# Run through grid of parameter lam1 
for lam1 in (10,16):
   # --//-- of parameter lam2
    for lam2 in (0,2):
# --//-- of parameter lam3
        for lam3 in (-2,-0.5):
            score = train(datatrain,lam1,lam2,lam3)
            trainResults.append([lam1,lam2,lam3,score])
            scores.append(score)
# Find index of optimal parameters lam1,lam2,lam3             
maxScoreInd = scores.index(max(scores))
hatLam1 = [x[0] for x in trainResults][maxScoreInd]
hatLam2 = [x[1] for x in trainResults][maxScoreInd]
hatLam3 = [x[2] for x in trainResults][maxScoreInd]
 
# Given optimal model parameters lam1,lam2,lam3 validate the model and find the model's score             
testScore = validate(datatrain, datatest,hatLam1,hatLam2,hatLam3)
print(testScore)

0.6957746478873239


#  Second Approach (via Text Analysis)

In [11]:
# Define score via built-in function get_close_matches
def scorer(trainSam, testSam):
    predictedID = []
    for obsTest in testSam:
        # find the best matched name in training sample for selected item's name in test sample
        bestMatch = difflib.get_close_matches(obsTest[1], [x[1] for x in trainSam])
        # If the best match wasn't found ignore this observation further
        if bestMatch == []:
            indPred = []
            predictedID.append([indPred,obsTest[1]])
        # If the best match was found assign its ID to the selected observation    
        else:
            indMatched = [x[1] for x in trainSam].index(bestMatch[0])  # First entrance of matched item
            indPred = trainSam[indMatched][0]
            predictedID.append([indPred,obsTest[1]])
    # Find distinct IDs in training sample found in test sample    
    matchedIDs = list(set([x[0] for x in testSam] ) & set([x[0] for x in trainSam]))
    # Total number of all such IDs in test sample 
    matchedIDsNum = 0
    for newID in matchedIDs:
        matchedIDsNum += [new[0] for new in testSam].count(newID)
    # Compute total number of correctly predicted IDs
    corIDsPred = 0
    for i in range(len(predictedID)):
        if predictedID[i][0] == testSam[i][0]:
            corIDsPred += 1
    # Return score
    score = corIDsPred/matchedIDsNum
    return(score)

In [12]:
# CV procedure 
test_size = 0.1
datatrain, datatest = train_test_split(products, test_size = test_size)
print(scorer(datatrain, datatest))

0.8704225352112676
