# Shingling, lemmatization and Jaccard's similarity

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2017 Florian Leitner. All rights reserved.

## Shingling: generating n-grams

### Definition

In [1]:
def shingle(s, k, joiner='_'):
    """Generate `k`-sized n-grams of a string or list of strings `s`."""
    k = min(len(s), k)
    merge = lambda i: i
    
    if joiner is not None and isinstance(s, list):
        merge = lambda i: joiner.join(i)
    
    for i in range(len(s) - k + 1):
        yield merge(s[i:i+k])

### Examples

Create all character trigrams of a string:

In [2]:
list(shingle('string', 3))

['str', 'tri', 'rin', 'ing']

Create all word trigrams of a tokenized sentence:

In [3]:
list(shingle(['This', 'is', 'a', 'tokenized', 'sentence', '.'], 3))

['This_is_a', 'is_a_tokenized', 'a_tokenized_sentence', 'tokenized_sentence_.']

If you want to avoid joining the higher n-grams of a token sequence:

In [4]:
list(shingle(['This', 'is', 'a', 'tokenized', 'sentence', '.'], 3, None))

[['This', 'is', 'a'],
 ['is', 'a', 'tokenized'],
 ['a', 'tokenized', 'sentence'],
 ['tokenized', 'sentence', '.']]

### Shingle output is not unique

In [5]:
seq = list(shingle('amabama', 2))
print("len(", type(seq), ") =", len(seq))

len( <class 'list'> ) = 6


In [6]:
seq = set(shingle('amabama', 2))
print("len(", type(seq), ") =", len(seq))

len( <class 'set'> ) = 4


## Lemmatization with Spacy

To run these examples, you need to have `spacy` installed and downloaded the English web model:

```bash
pip3 install spacy
python3 -m spacy download en
```

In [7]:
import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

Load SpaCy's English NLP model and instantiate a Lemmatizer object.

In [8]:
nlp = spacy.load("en_core_web_sm")
lemmatize = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

In [9]:
help(lemmatize)

Help on Lemmatizer in module spacy.lemmatizer object:

class Lemmatizer(builtins.object)
 |  Lemmatizer(index=None, exceptions=None, rules=None, lookup=None)
 |  
 |  The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
 |  lookup tables.
 |  
 |  DOCS: https://spacy.io/api/lemmatizer
 |  
 |  Methods defined here:
 |  
 |  __call__(self, string, univ_pos, morphology=None)
 |      Call self as a function.
 |  
 |  __init__(self, index=None, exceptions=None, rules=None, lookup=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  adj(self, string, morphology=None)
 |  
 |  is_base_form(self, univ_pos, morphology=None)
 |      Check whether we're dealing with an uninflected paradigm, so we can
 |      avoid lemmatization entirely.
 |  
 |  lookup(self, string)
 |  
 |  noun(self, string, morphology=None)
 |  
 |  punct(self, string, morphology=None)
 |  
 |  verb(self, string, morphology=None)
 |  
 |  ----------------------------------

Notice that the call to lemmatize runs a *list* of values: there can be case where there is remaining ambiguity, SpaCy can return a list of possible lemmas given the input.

In [10]:
for token in nlp("I was here"):
    print(token, token.pos_, lemmatize(str(token), token.pos), sep="\t")

I	PRON	['i']
was	VERB	['be']
here	ADV	['here']


## Jaccard's similarity

### Definition

In [11]:
def jaccard(seq1, seq2):
    """The Jaccard similarity between two sequences: |∩(X,Y)| / |∪(X,Y)|."""
    x = set(seq1)
    y = set(seq2)
    # Python3 does int->float casting on integer divisions
    # See "Banker's rounding, probabilies, and underflows"
    return len(x & y) / len(x | y)

### Examples

In [12]:
jaccard('string', 'string')

1.0

In [13]:
jaccard('string', 'strang')

0.7142857142857143

In [14]:
jaccard('string', 'other')

0.2222222222222222

### Issues

In [15]:
jaccard('string', 'other') # too large?

0.2222222222222222

In [16]:
jaccard('karlos', 'carol') # ok-ish, but high?

0.5714285714285714

In [17]:
jaccard('word', 'ford') # great

0.6

In [18]:
jaccard('alabama', 'malba') # really bad!

1.0

Therefore, it typically is better to use higher order; **Bigram** similarity:

In [19]:
jaccard(shingle('string', 2), shingle('other', 2)) # great

0.0

In [20]:
jaccard(shingle('karlos', 2), shingle('carol', 2)) # too small?

0.125

In [21]:
jaccard(shingle('word', 2), shingle('ford', 2)) # a bit too small?

0.5

In [22]:
jaccard(shingle('alabama', 2), shingle('malba', 2)) # still too large?

0.42857142857142855

Or even character **trigrams**:

In [23]:
jaccard(shingle('string', 3), shingle('other', 3)) # great

0.0

In [24]:
jaccard(shingle('karlos', 3), shingle('carol', 3)) # really bad!

0.0

In [25]:
jaccard(shingle('word', 3), shingle('fort', 3)) # really bad!

0.0

In [26]:
jaccard(shingle('alabama', 3), shingle('malba', 3)) # great

0.0

### Multigram Jaccard similarity

In [27]:
def multigram(seq):
    return list(seq) + list(shingle(seq, 2))
                                                   
print(sorted(multigram('string')))
print(sorted(multigram(['This', 'is', 'a', 'tokenized', 'sentence', '.'])))

['g', 'i', 'in', 'n', 'ng', 'r', 'ri', 's', 'st', 't', 'tr']
['.', 'This', 'This_is', 'a', 'a_tokenized', 'is', 'is_a', 'sentence', 'sentence_.', 'tokenized', 'tokenized_sentence']


In [28]:
jaccard(multigram('karlos'), multigram('carol')) # reasonable...

0.3333333333333333

In [29]:
jaccard(multigram('word'), multigram('ford')) # reasonable...

0.5555555555555556

In [30]:
jaccard(multigram('alabama'), multigram('malba')) # reasonable...

0.6363636363636364

In [31]:
jaccard(multigram('string'), multigram('other')) # resasonable...

0.1111111111111111

In [32]:
jaccard(multigram('string'), multigram('string')) # of course (but do check!)

1.0

In [33]:
jaccard(multigram('string'), multigram('strang')) # reasonable...

0.5714285714285714

### Conclusion

In the end, depending on the scenario, a problem-specific decision for using uni-, bi- or, maybe a trigram sets has to be made.