# Introduction
The following code will analyze the `name` column to consolidate similar words and check for spelling errors.  
**TODOS**:  
* [X] Hard-coded fuzzy matching
* [X] fuzzywuzzy library
* [X] comparison of performance
* [X] implement into preprocessing code

In [1]:
import numpy as np
from fuzzywuzzy import process, fuzz

from src.utils import import_dataset, lazy_ldist, cluster_fuzz

In [2]:
dat = import_dataset()
dat = dat['name']

# 0. Quick test - manually implemented recursive levenstein distance
Under `src/utils.py`. Future string distance uses `fuzzywuzzy` library
```python
>>> lazy_ldist('test_insertion', 'test_insertion_extra')
6
>>> lazy_ldist('test_deletion', 'test')
9
>>> lazy_ldist('asdf', 'dsdf')
1
>>> lazy_ldist('test_replacement', 'xesd_replacement')
# Should be 2, but takes infinite due insertion. Same happens with actual data
```

# 1. Raw method: Identify similar models and assign top model
Group similar models together and assumes model with most frequency is the correct model from cluster.
1. Create `model_dict` with each model frequency and assign each (model, freq) from as Nodes
2. Create one Link for each pair of variables
3. Start with each node in its own cluster, and iterate through list of Links, sorted in descending order of fuzz.ratio_score and do clustering.
4. Nodes will be added to a cluster when one of their Links have fuzz.ratio above ratio_threshold 

In [3]:
# Get models as first word from name column and create model_dictionary with frequency. 
models = dat.apply(lambda x: x.split(' ')[0])

model_dict = {}
for model in models:
    model_dict[model] = model_dict.get(model, 0)+1

In [4]:
cluster_fuzz(model_dict, ratio_threshold=70)

There are 32 clusters
Variables for correlation threshold 70
---------------------------------------
1 ford: 51 | CLUSTER: []
2 chevrolet: 43 | CLUSTER: [(89, 'chevroelt')]
1 plymouth: 31 | CLUSTER: []
1 dodge: 28 | CLUSTER: []
1 amc: 28 | CLUSTER: []
2 toyota: 25 | CLUSTER: [(92, 'toyouta')]
1 datsun: 23 | CLUSTER: []
1 buick: 17 | CLUSTER: []
1 pontiac: 16 | CLUSTER: []
2 volkswagen: 15 | CLUSTER: [(95, 'vokswagen')]
1 honda: 13 | CLUSTER: []
1 mercury: 11 | CLUSTER: []
2 mazda: 10 | CLUSTER: [(80, 'maxda')]
1 oldsmobile: 10 | CLUSTER: []
1 fiat: 8 | CLUSTER: []
1 peugeot: 8 | CLUSTER: []
1 audi: 7 | CLUSTER: []
1 volvo: 6 | CLUSTER: []
1 vw: 6 | CLUSTER: []
1 chrysler: 6 | CLUSTER: []
1 renault: 5 | CLUSTER: []
1 subaru: 4 | CLUSTER: []
1 opel: 4 | CLUSTER: []
1 saab: 4 | CLUSTER: []
1 chevy: 3 | CLUSTER: []
1 bmw: 2 | CLUSTER: []
1 cadillac: 2 | CLUSTER: []
2 mercedes-benz: 2 | CLUSTER: [(76, 'mercedes')]
1 triumph: 1 | CLUSTER: []
1 capri: 1 | CLUSTER: []
1 hi: 1 | CLUSTER: []
1 n

We see that volkswagen & vw, and chevrolet & chevy is not clustered together using levenshtein distance. 

# 2. Affinity Propogation to identify simiilar models

In [53]:
import numpy as np
from sklearn.cluster import AffinityPropagation
    
words = list(model_dict.keys()) #So that indexing with a list will work
# print(words)
lev_similarity = -1*np.array([[fuzz.ratio(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="euclidean", max_iter=100, damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = [words[i] for i in list(np.nonzero(affprop.labels_==cluster_id)[0])]
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

 - *chevrolet:* chevrolet, chevy, chrysler, chevroelt
 - *datsun:* datsun, audi, fiat, renault, honda, subaru, nissan
 - *volkswagen:* volkswagen, oldsmobile, vokswagen
 - *opel:* ford, dodge, peugeot, opel, volvo
 - *toyouta:* plymouth, pontiac, toyota, toyouta
 - *maxda:* saab, mazda, maxda
 - *capri:* buick, amc, hi, capri, cadillac, triumph
 - *vw:* bmw, vw
 - *mercedes:* mercury, mercedes-benz, mercedes


