# Demo 3 | String Matching using `nltk.metrics`
<hr>

Using the Lazada CIKM 2017 Dataset, a few string matching techniques are being explored. These matches leverage on the `nltk.metrics` package. Each algorithm will be explained briefly

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 80
pd.options.display.max_rows = None

import lzd_utils

from nltk.metrics.distance import (
    binary_distance,
    edit_distance, jaccard_distance, 
    jaro_winkler_similarity,
    masi_distance, demo)

In [2]:
# Read from CSV file
df = lzd_utils.read_lazada_csv()
df = df[df.country=='sg']

# Slice, dice and re-index the data
df_mobiles = df.copy()
df_mobiles = df_mobiles[(df.category_lvl_2=='Mobiles')]
mobiles_titles = df_mobiles['title']
mobiles_titles = mobiles_titles.sort_values(ascending=True)
mobiles_titles.reset_index(inplace=True, drop=True)
print(mobiles_titles.count())
display(mobiles_titles)

51


0                      (EXPORT) Sony Xperia E Dual C1605 4GB Unlocked Smartphone Gold
1                                        (EXPORT) Sony Xperia V LT25i 8GB Phone Black
2                                 (Refurbished) Apple iPhone 5 16GB Black(Black 16GB)
3                                           (Refurbished) Apple iPhone 5S 32GB (Gold)
4     AIEK X6 1.0 inch Quad Band Card Phone Bluetooth 3.0 FM Audio Player (Black) ...
5                                                     Apple iPhone 6 plus 16GB (Gold)
6                                            Apple iPhone 6S 4.7inch 64GB (Rose Gold)
7                                               Apple iPhone 7 Plus 256GB (Rose Gold)
8                                         BlackBerry P100 PlayBook 16GB Wi-Fi (Black)
9      DOOGEE Titans T3 Smartphone 4.7'' Android 6.0 3GB RAM+32GB ROM EU Plug (Brown)
10                                                    Galaxy A7 2017 Peach(Pink 32GB)
11    HTC Desire 830 -DUAL SIM 4G - ANDROID - 5.5GHZ -

In [3]:
# Slice, dice and re-index the data
df_laptops = df.copy()
df_laptops = df_laptops[(df_laptops.category_lvl_2=='Laptops')]
laptops_titles = df_laptops['title']
laptops_titles = laptops_titles.sort_values(ascending=True)
laptops_titles.reset_index(inplace=True, drop=True)
print(laptops_titles.count())
display(laptops_titles)

69


0                          (REFURBISHED) Lenovo X201 Core i5 4GB RAM 320GB HDD 12.1".
1     (Refurbished) Dell Precision M4700 15.6" 3rd Gen Intel Core i7 16GB RAM 750G...
2                           (Refurbished) Lenovo Yoga 2 13.3" Intel Core i7-4510U 8GB
3                                           (Refurbished) MacBook 12-inch (5JY32LL/A)
4     13-inch Apple MacBook Air MMGF2 ENGLISH KEYBOARD- Intel Core i5 1.6 GHz Dual...
5     3-Button 3D USB Optical Scroll LED Mice Mouse for Notebook Laptop Desktop (E...
6     ASUS K401UQ-FA074T i7-7500U GT940MX 2GB DDR3 14INCH FHD 8GB 1TB 24GB SSD WIN 10
7                                       ASUS T100CHI 4GB/128GB ( REFURBISHED )(Black)
8                       ASUS T300CHI-FH011H 8GB Intel Core M-5Y71 2.9Ghz 12.5" Laptop
9     Acer Predator 17 X (GX-791-779W) 17.3"/i7-6820HK/32GB DDR4/512GB SSD+1TB/Nvi...
10                                   Acer Switch Alpha 12 (SA5-271-57A0) with WIN Pro
11          Acer Switch Alpha 12 (SA5-271P-76GA) - 12"

### Appendices

`demo()` gives an example implementation of the various distance metrics.

In [4]:
demo()

Edit distance btwn 'rain' and 'shine': 3
Edit dist with transpositions btwn 'rain' and 'shine': 3
Jaro similarity btwn 'rain' and 'shine': 0.6333333333333333
Jaro-Winkler similarity btwn 'rain' and 'shine': 0.6333333333333333
Jaro-Winkler distance btwn 'rain' and 'shine': 0.3666666666666667
Edit distance btwn 'abcdef' and 'acbdef': 2
Edit dist with transpositions btwn 'abcdef' and 'acbdef': 1
Jaro similarity btwn 'abcdef' and 'acbdef': 0.9444444444444444
Jaro-Winkler similarity btwn 'abcdef' and 'acbdef': 0.95
Jaro-Winkler distance btwn 'abcdef' and 'acbdef': 0.050000000000000044
Edit distance btwn 'language' and 'lnaguaeg': 4
Edit dist with transpositions btwn 'language' and 'lnaguaeg': 2
Jaro similarity btwn 'language' and 'lnaguaeg': 0.9166666666666666
Jaro-Winkler similarity btwn 'language' and 'lnaguaeg': 0.9249999999999999
Jaro-Winkler distance btwn 'language' and 'lnaguaeg': 0.07500000000000007
Edit distance btwn 'language' and 'lnaugage': 3
Edit dist with transpositions btwn 'l

**References**

- [Github / minhcp](https://github.com/minhcp/CIKMCup17) for the dataset
- [nltk.metrics package](https://www.nltk.org/api/nltk.metrics.html)