#### For this task, we were asked to allocate the “true” city with Levenhstein distance, using any resources. We set a threshold of the distance between the two words being strictly smaller than half the length of the word. We decided to use this threshold because a word with 5 letters having a distance of 3 with another one could have a completely different meaning while a word composed of 10 letters having a distance of 3 is most likely the same word.

In [1]:
pip install Levenshtein

Collecting Levenshtein
  Downloading Levenshtein-0.18.1-cp38-cp38-macosx_10_9_x86_64.whl (242 kB)
[K     |████████████████████████████████| 242 kB 1.9 MB/s eta 0:00:01
[?25hCollecting rapidfuzz<3.0.0,>=2.0.1
  Downloading rapidfuzz-2.0.11-cp38-cp38-macosx_10_9_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 219 kB/s eta 0:00:01
[?25hCollecting jarowinkler<1.1.0,>=1.0.2
  Downloading jarowinkler-1.0.2-cp38-cp38-macosx_10_9_x86_64.whl (72 kB)
[K     |████████████████████████████████| 72 kB 325 kB/s eta 0:00:01
[?25hInstalling collected packages: jarowinkler, rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.18.1 jarowinkler-1.0.2 rapidfuzz-2.0.11
Note: you may need to restart the kernel to use updated packages.


In [2]:
import Levenshtein as lev
import pandas as pd
import json

In [4]:
rawcities = pd.read_csv(f'raw_cities.csv')
normalizedcities= pd.read_csv(f'normalized_cities.csv')

In [5]:
def citymatch(rawcity,normalizedcities):
    wordmatched = 'None'
    try:
        distancew= len(rawcity)
        for ncity in normalizedcities["city"]:
            Distance = lev.distance(ncity, rawcity)
            if Distance<len(rawcity)/2-1:
                distancew = Distance
                wordmatched= ncity
    except TypeError:
        wordmatched = 'None'
    return wordmatched

In [6]:
def checkkey(wordkey:str, wordtomatch:str, dictionary:dict):
    if wordkey in dictionary.keys():
        dictionary[wordkey]= wordtomatch
    else:
        print("none")
    return dictionary

In [7]:
def listcities(normalized,dictionary:dict):
    newdict = dictionary
    for city in dictionary.keys():
        newdict = checkkey(city, citymatch(city, normalized), newdict)
    return newdict

In [9]:
original_dictionary = {}
for city in rawcities["city"]:
    original_dictionary.update({city:""})
    
#original_dictionary

In [10]:
citiesmatched= listcities(normalizedcities, original_dictionary)
print(citiesmatched)

{'cleron': 'cleon', 'aveillans': 'avillers', 'paray-vieille-poste': 'paray vieille poste', 'issac': 'assac', 'rians': 'rians', 'rebais': 'rebais', 'sevran': 'sevran', 'brindas': 'brindas', 'houchin': 'bouchon', 'vendome': 'vendome', 'hossegor': 'None', 'huningue': 'huningue', 'laissaud': 'laissaud', 'la creche': 'lamarche', 'mamoudzou': 'mamoudzou', 'plumaugat': 'plauzat', 'pouzauges': 'pouzauges', 'bedarrides': 'bedarrides', 'larmor-plage': 'lamorlaye', 'le rochereau': 'None', 'le poet laval': 'poët laval', 'porto vecchio': 'porto vecchio', 'aulnay sous bois': 'jagny sous bois', 'marnay sur seine': 'mery sur oise', 'fleury-les-aubrais': 'fleury les aubrais', 'bussy saint georges': 'saint georges', 'salon de provence |': 'salon de provence', 'saint jean de paracole': 'saint jean de beauregard', 'gouy': 'gouy', 'theys': 'theys', 'vitre': 'titre', 'essert': 'essert', 'parigny': 'perigny', 'bailleul': 'bailleul', 'montrabe': 'montrabot', 'espinasse': 'espinas', 'porticcio': 'None', 'riedi

In [11]:
with open("dictcitiesmatched.json", "w") as write_file:
    json.dump(citiesmatched, write_file, indent=4)

#### Q. Write your conclusion : does this method works ?
Looking at the results, this method seems to work efficiently in most cases. The words not matched with any other normalized words are usually words that do not seem to have any match by checking manually. This can be adjusted with the threshold depending on the accuracy requested. If we are assured the words given are meant to match with one of the cities, we could remove the threshold and the program would give us the city that is the most similar to the city given. We could also modify it to give us any city name that seems to match (ie have a distance smaller than the threshold), but this would need human intervention to check the results and pick. However, this was for matching words that should have a very similar structure together. In some other cases this technique might not work efficiently if you need to match words that have the same origin or meaning with different letters for example.


#### Q. Could you do a better work with additional resources ?

As partially mentioned above, this algorithm could be improved using different techniques. Adding a database linking words like ‘soccer’ and ‘football’ would increase the efficiency of the program. Cleaning the data beforehand would also be beneficial as some characters such as an accent might be written in unicode, increasing the distance between the words drastically.
