# What is NLP?

Natural language processing (NLP) is a branch of artificial intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language. 


# Tokens

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:

Input: Friends, Romans, Countrymen, lend me your ears; 
Output: Friends | Romans | Contrymen | lend | me | your | ears

# Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. Examples:

* Walking, walked, walks, walk: Walk
* Construction, constructed, constructor: Construct
* Catwalk, catty, cats: Cat

# Lemmas

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.

Examples:

1. The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
2. The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.
3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.

# N-Grams

In general, n-gram means splitting a string in sequences with the length n. So if we have this string “abcde”, then bigrams are: ab, bc, cd, and de while trigrams will be: abc, bcd, and cde while 4-grams will be abcd, and bcde.

Another example would be the phrase: "I like to eat pancakes in the morning". If we were to tokenize and group the tokens in groups of 3, that is, 3-Grams or trigrams we would get the following tokens:
1. I like to
2. like to eat
3. to eat pancakes
4. eat pancakes in
5. pancakes in the
6. in the morning

# Similarity Functions

## Jaccard Index
The Jaccard index is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

<img src="images/jaccard_formula.svg">

<img src="images/visual_jaccard.png">

Jaccard Index is not limited to NLP applications, we can also use it in Computer Vision:
<img src="images/stop_sign_jaccard.jpg">

### Practical examples


Consider the following phrases:

1. The bird is singing
2. The dog is barking
3. The lady is singing

#### Example 1
If we were to calculate the Jaccard Index similarity between 1 and 2 we would get:

* The union of all tokens: The | bird | dog | is | singing | barking
* The intersection of the two sets: The | is

We can then calculate the Jaccard Index by calculating the coefficient between the length of the intersection set divided by the length of the union set:

Jaccard Index = 2/6 = 0.333 = *33% Similar*

#### Example 2

If we were to calculate the Jaccard Index similarity between 1 and 3 we would get:

* The union of all tokens: The | bird | lady | is | singing 
* The intersection of the two sets: The | is | singing

We can then calculate the Jaccard Index by calculating the coefficient between the length of the intersection set divided by the length of the union set:

Jaccard Index = 3/5 =  0.6 = *60% Similar*

### Going further

Now, consider the following phrases

> I won then I lost

> I lost then I won

If we were to just tokenize them and calculate the Jaccard Index between them, we would get a 100% Similarity (an index of 1.0). However, a person can understand that they do not carry the same meaning given its context. To address this, we could instead use N-Grams

#### Using Trigrams
* The Union of all tokens: I won then | won then I | then I lost | I lost then | lost then I | then I won
* The intersection: NULL

Jaccard Index: 0/6 = 0% similar

# Let's get to coding!

## 1. Basic spelling checker using Jaccard Index

We are writing our own text editor and we would like to implement a basic spelling checker that can catch and suggest corrections for typos.

In [15]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Charlie\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [11]:
mistake = "ligting"
 
words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

suggestions = []

for word in words:
    jd = nltk.jaccard_distance(set(mistake), set(word))
    suggestions.append((word, jd))

suggestions.sort(key=lambda sug: sug[1])

for s in suggestions:
    print(f"{s[0]}: {s[1]}")

listing: 0.16666666666666666
lighting: 0.16666666666666666
linking: 0.3333333333333333
living: 0.3333333333333333
walking: 0.5
drawing: 0.6666666666666666
orange: 0.7777777777777778
bag: 0.8571428571428571
apple: 0.875
zoo: 1.0


## 2. Using N-Grams with Jaccard

Let's check how similar two text are, we'll first calculate according to its unigrams and we'll see an example of how we can carry contextual meaning into our distance calculations.

In [12]:
sentences = [
    "I won then I lost",
    "I lost then I won",
    "It might help to re-install Python if possible.",
    "It can help to install Python again if possible.",
]

In [16]:
tokens = [nltk.word_tokenize(s) for s in sentences]

print(f"Tokens: {tokens}")

Tokens: [['I', 'won', 'then', 'I', 'lost'], ['I', 'lost', 'then', 'I', 'won'], ['It', 'might', 'help', 'to', 're-install', 'Python', 'if', 'possible', '.'], ['It', 'can', 'help', 'to', 'install', 'Python', 'again', 'if', 'possible', '.']]


In [27]:
unigrams = [set(nltk.ngrams(tk, n=1)) for tk in tokens]

trigrams = [set(nltk.ngrams(tk, n=3)) for tk in tokens]

print(f"Unigrams: {unigrams}")

print("\n*********************\n")

print(f"Trigrams: {trigrams}")

Unigrams: [{('I',), ('won',), ('then',), ('lost',)}, {('I',), ('won',), ('then',), ('lost',)}, {('help',), ('Python',), ('.',), ('if',), ('to',), ('It',), ('might',), ('re-install',), ('possible',)}, {('help',), ('Python',), ('.',), ('if',), ('to',), ('can',), ('It',), ('install',), ('possible',), ('again',)}]

*********************

Trigrams: [{('I', 'won', 'then'), ('won', 'then', 'I'), ('then', 'I', 'lost')}, {('I', 'lost', 'then'), ('then', 'I', 'won'), ('lost', 'then', 'I')}, {('help', 'to', 're-install'), ('It', 'might', 'help'), ('Python', 'if', 'possible'), ('might', 'help', 'to'), ('re-install', 'Python', 'if'), ('to', 're-install', 'Python'), ('if', 'possible', '.')}, {('Python', 'again', 'if'), ('install', 'Python', 'again'), ('It', 'can', 'help'), ('to', 'install', 'Python'), ('help', 'to', 'install'), ('can', 'help', 'to'), ('again', 'if', 'possible'), ('if', 'possible', '.')}]


In [29]:
# Jaccard distance between sentence 1 and 2

print("Comparing the following phrases:")
print(sentences[0])
print(sentences[1])

print("\n*********\n")

print(f"Jaccard distance between sentence 1 and 2 using unigrams: {nltk.jaccard_distance(unigrams[0], unigrams[1])}")
print(f"Jaccard distance between sentence 1 and 2 using trigrams: {nltk.jaccard_distance(trigrams[0], trigrams[1])}")

Comparing the following phrases:
I won then I lost
I lost then I won

*********

Jaccard distance between sentence 1 and 2 using unigrams: 0.0
Jaccard distance between sentence 1 and 2 using trigrams: 1.0


In [30]:
# Jaccard distance between sentece 3 and

print("Comparing the following phrases:")
print(sentences[2])
print(sentences[3])

print("\n*********\n")

print(f"Jaccard distance between sentence 3 and 4 using unigrams: {nltk.jaccard_distance(unigrams[2], unigrams[3])}")
print(f"Jaccard distance between sentence 3 and 4 using trigrams: {nltk.jaccard_distance(trigrams[2], trigrams[3])}")

Comparing the following phrases:
It might help to re-install Python if possible.
It can help to install Python again if possible.

*********

Jaccard distance between sentence 1 and 2 using unigrams: 0.4166666666666667
Jaccard distance between sentence 1 and 2 using trigrams: 0.9285714285714286
