Welcome to the first installment of my NLP series. My main objective for creating this series is to introduce the main tasks of NLP to the reader in an understandable fashion. I will follow the order of [Speech and Language Proceesing](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdfstart) by Jurafsky.  I hope the reader of this series find it useful.

#1. Regex

Regular expressions (or regex, for short) is a fundamental skill that every NLP practitioner has to have. We use them to search  the patterns through texts and make modifications based on our intentions. These intentions often involve text preprocessing and normalization.

In [1]:
import re
import nltk
import textblob
import spacy


nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data]   Package wordnet is already up-to-date!

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data]   Package punkt is already up-to-date!


True

Here is our first search. I'll slowly increase the complexity, so please be patient if you get bored in the first examples. The first sentence of Harry Potter:

In [2]:
sentence = "Mr and Mrs Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They \
were the last people you’d expect to be involved in anything \
strange or mysterious, because they just didn’t hold with such \
nonsense."

In [3]:
re.search("Dursley",sentence)

<re.Match object; span=(11, 18), match='Dursley'>

The output tells us that the word we're looing for is starts at the 11. position and ends at 18. position. Now let's say I want to find Mr and Mrs at the same time. Since we are searching muliple words, we will use `.finditer()`:

In [4]:
for word in re.finditer("Mr.",sentence):
  print(word)

<re.Match object; span=(0, 3), match='Mr '>

<re.Match object; span=(7, 10), match='Mrs'>


`.` means find *any* character after Mr if there is. If you have noticed it have also captured the whitesapace after Mr. We can correct this:

In [5]:
for word in re.finditer("Mr\w{0,1}",sentence):
  print(word)

<re.Match object; span=(0, 2), match='Mr'>

<re.Match object; span=(7, 10), match='Mrs'>


Perfect. Now the `\w` matches any alphanumeric characters and {0,1} indicates that only match 0 or 1 occurence. Now imagine that there are multiple Mr and Mrs but I only want to capture if the Mr is in the beginning of the sentence:

In [6]:
trial = "Mr Dursley Mrs Dursley, Mr Ariely."
for word in re.finditer("^Mr",trial):
  print(word)

<re.Match object; span=(0, 2), match='Mr'>


or in the end of the sentence:

In [7]:
trial = "Mr Dursley Mrs Dursley, Mr"
for word in re.finditer("Mr$",trial):
  print(word)

<re.Match object; span=(24, 26), match='Mr'>


Maybe I want to make the search case insensitively:

In [8]:
trial = "Mr Dursley Mrs Dursley, Mr Ariely and cat sound mr." # Don't think about the meanings for now I've just made them up.
for word in re.finditer("[Mm]r",trial):
  print(word)

<re.Match object; span=(0, 2), match='Mr'>

<re.Match object; span=(11, 13), match='Mr'>

<re.Match object; span=(24, 26), match='Mr'>

<re.Match object; span=(48, 50), match='mr'>


Now let's proceed with more complex patterns. Let's say I want to find all the percentages within the text.

In [9]:
trial = "20% 100% 1000% ema%" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("\w{2,3}%",trial):
  print(word)

<re.Match object; span=(0, 3), match='20%'>

<re.Match object; span=(4, 8), match='100%'>

<re.Match object; span=(10, 14), match='000%'>

<re.Match object; span=(15, 19), match='ema%'>


But I don't want to catch ema%. Moreover 000 should bw 1000. So I use \d or [0-9] (They are the same thing).

In [10]:
trial = "20% 100% 1000% ema%" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("\d{2,3}%",trial):
  print(word)

<re.Match object; span=(0, 3), match='20%'>

<re.Match object; span=(4, 8), match='100%'>

<re.Match object; span=(10, 14), match='000%'>


Now we solved the first part of the problem but still, we have 000 instead of 1000 in the output. So if we don't know how many digits before the percentage sign then we may fail to catch the all patterns correctly. For this situations we use `+` (one or more occurences).

In [11]:
trial = "20% 100% 1000% ema%" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("\d+%",trial):
  print(word)

<re.Match object; span=(0, 3), match='20%'>

<re.Match object; span=(4, 8), match='100%'>

<re.Match object; span=(9, 14), match='1000%'>


I can do the same thing with a different pattern:

In [12]:
trial = "20% 100% 1000% ema%" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("\w*%\s",trial):
  print(word)

<re.Match object; span=(0, 4), match='20% '>

<re.Match object; span=(4, 9), match='100% '>

<re.Match object; span=(9, 15), match='1000% '>


 \s stands for white spaces and * stands for 0 or more occurances. Don't forget to eliminate wihtespaces afterwards.

In [13]:
trial = "20% 100% 1000% ema%" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("\w*%\s",trial):
  print(word.group()[:-1])

20%

100%

1000%


We just saw that the ^ sign is used for indicating find the pattern if the line *starts with* thatt pattern. Alternatively, we can use ^ in square brackets ([^]) to indicate *not*. Here is an example:

In [14]:
trial = " Mars, mars" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("[^A-Z]\s\w+",trial):
  print(word)

<re.Match object; span=(5, 11), match=', mars'>


Let's decode one by one.
* [^A-Z] -> Not starts with capital letters
* \s -> whitespaces
* \w+ -> one or more occurances of alphanumeric characters

Since *Mars* is violating the first rule it is not captured by regex.

One last thing that I want to show is `|` (or). Here is the most basic application:

In [15]:
trial = "mars,martini" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("martini|mars",trial):
  print(word)

<re.Match object; span=(0, 4), match='mars'>

<re.Match object; span=(5, 12), match='martini'>


We can combine `|` with paranthesis to define a pattern rather than a word:

In [16]:
trial = "mars,martini" # Don't think about the meanings for now I've just made them up.
for word in re.finditer("(mar)(tini|s)",trial):
  print(word)

<re.Match object; span=(0, 4), match='mars'>

<re.Match object; span=(5, 12), match='martini'>


Congratulations! You've just completed the first part. In my experience, regex is easily forgotten if you don't use it regularly. For example, whenever I need to use regex, first I visit this [website](https://www.w3schools.com/python/python_regex.asp) to refresh my memory about the meanings of the symbols, and then I code my patterns."

#2. Text Normalization

Language is a complex structure. Therefore, often we need to preprocess it before giving a text directly to the model. Here are the main preprocessing steps:


## 1.1. Case folding
 Often refers to the lowering the text. By doing that we can overcome the probelms about case sensitivity.



In [17]:
sentence = "Mr and Mrs Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They \
were the last people you’d expect to be involved in anything \
strange or mysterious, because they just didn’t hold with such \
nonsense."
print(sentence.lower())

mr and mrs dursley, of number four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.


##1.2. Tokenization

Tokenization is the term used for splitting the words. We can do it manually or by using a library. What I recommend is using a library to do this because it is often done more professionally. For illustration purposes, let's do a couple of tokenization that are not perfect:

In [18]:
sentence = "Mr and Mrs Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They \
were the last people you’d expect to be involved in anything \
strange or mysterious, because they just didn’t hold with such \
nonsense."

print(sentence.split())

['Mr', 'and', 'Mrs', 'Dursley,', 'of', 'number', 'four,', 'Privet', 'Drive,', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal,', 'thank', 'you', 'very', 'much.', 'They', 'were', 'the', 'last', 'people', 'you’d', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious,', 'because', 'they', 'just', 'didn’t', 'hold', 'with', 'such', 'nonsense.']


If you noticed the punctuations were stuck to the words. So if I use `split()` in particular, I would remove the punctuations first or split them from the words. Here is a way to exclude the punctuations using re:

In [19]:
print(re.findall("\w+",sentence))

['Mr', 'and', 'Mrs', 'Dursley', 'of', 'number', 'four', 'Privet', 'Drive', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', 'thank', 'you', 'very', 'much', 'They', 'were', 'the', 'last', 'people', 'you', 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', 'because', 'they', 'just', 'didn', 't', 'hold', 'with', 'such', 'nonsense']


Since punctuations are not alphanumeric, they were excluded.

In [20]:
import string

char_list = []
for char in sentence:
  if char in string.punctuation:
    char_list.append(' ')

  char_list.append(char)

punctuation_spaced = ''.join(char_list)
print(punctuation_spaced)



Mr and Mrs Dursley , of number four , Privet Drive , were proud to say that they were perfectly normal , thank you very much . They were the last people you’d expect to be involved in anything strange or mysterious , because they just didn’t hold with such nonsense .


Now I can tokenize:

In [21]:
print(punctuation_spaced.split())

['Mr', 'and', 'Mrs', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you’d', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn’t', 'hold', 'with', 'such', 'nonsense', '.']


Now let's proceed to some more common ways to tokenize a text. We have several modules for this task.

In [22]:
#use nltk tokenizer
print(nltk.word_tokenize(sentence))

['Mr', 'and', 'Mrs', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', '’', 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn', '’', 't', 'hold', 'with', 'such', 'nonsense', '.']


In [23]:
#use spacy tokenizer
nlp = spacy.load("en_core_web_sm")

tokenized_text = nlp(sentence)

In [24]:
tokens = []
for token in tokenized_text:
  tokens.append(token)

print(tokens)

[Mr, and, Mrs, Dursley, ,, of, number, four, ,, Privet, Drive, ,, were, proud, to, say, that, they, were, perfectly, normal, ,, thank, you, very, much, ., They, were, the, last, people, you, ’d, expect, to, be, involved, in, anything, strange, or, mysterious, ,, because, they, just, did, n’t, hold, with, such, nonsense, .]


##1.3. Lemmatization

Lemmatization is the process in which the word is reduced to its root or more formally,

*to reduce the different forms of a word to one single form, for example, reducing "builds", "building",or "built" to the lemma "build"* [1](https://dictionary.cambridge.org/dictionary/english/lemmatize)

In [25]:
texts = ["builds", "building", "built"]
[textblob.Word(text).lemmatize() for text in texts]

['build', 'building', 'built']

Now if you have noticed, the results are not the same as the dictionary output. This is just because we didn't specify the type of the word. Is it noun, verb, or adjective? Since we did not include our preferences, the algorithm ran in default mode which is noun. We know that this should be a verb. So let's change accordingly and get the results.

In [26]:
texts = ["builds", "building", "built"]
[textblob.Word(text).lemmatize('v') for text in texts]

['build', 'build', 'build']

##1.4. Stemming

Stemming is similar to the lemmatization but instead of using a dictionary, it uses heuristics to chop the words. This may result in words having no actual meaning. To compare these two, I picked 10 random *nouns* from a [random noun generator](https://randomwordgenerator.com/noun.php) and processed the words in both ways:

In [27]:
nouns = ["industry", "agency","hearing","promotion","opportunity","mom","manufacturer","database","skill","hotel"]
print("LEMMA ----- STEM\n")
for noun in nouns:
  print(textblob.Word(noun).lemmatize(),"-----",textblob.Word(noun).stem())

LEMMA ----- STEM



industry ----- industri

agency ----- agenc

hearing ----- hear

promotion ----- promot

opportunity ----- opportun

mom ----- mom

manufacturer ----- manufactur

database ----- databas

skill ----- skill

hotel ----- hotel


##1.5. Sentence Segmentation

Seence segmentation is nothing but splitting the text into sentences. Again several modules supoort this. Here is how you can do it on textblob:

In [28]:
text = "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They were the last \
people you'd expect to be involved in anything strange or mysterious, \
because they just didn't hold with such nonsense. \
Mr. Dursley was the director of a firm called Grunnings, which made \
drills. He was a big, beefy man with hardly any neck, although he did \
have a very large mustache. Mrs. Dursley was thin and blonde and had \
nearly twice the usual amount of neck, which came in very useful as she \
spent so much of her time craning over garden fences, spying on the \
neighbors. The Dursleys had a small son called Dudley and in their \
opinion there was no finer boy anywhere. "

text = textblob.TextBlob(text)
print(text.sentences)

[Sentence("Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much."), Sentence("They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense."), Sentence("Mr. Dursley was the director of a firm called Grunnings, which made drills."), Sentence("He was a big, beefy man with hardly any neck, although he did have a very large mustache."), Sentence("Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors."), Sentence("The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.")]


**Summary:** Text normalization is a standardization process. It often involves, tokenization, lemmatization, stemming and sentence segmentation.

#3. N-grams

N-gram refers to the sequential chunks having length of n. They are pretty important in NLP applications and used in models. I find n grams similar to the sliding windows concept in time series. Let's build a custom one first.

In [29]:
def create_bigrams(text):
  bigrams = []
  tokenized_text = re.findall("\w+",text)
  for i in range(len(tokenized_text)-1):
    bigrams.append((tokenized_text[i],tokenized_text[i+1]))

  return bigrams

In [30]:
text = "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They were the last \
people you'd expect to be involved in anything strange or mysterious, \
because they just didn't hold with such nonsense. \
Mr. Dursley was the director of a firm called Grunnings, which made \
drills. He was a big, beefy man with hardly any neck, although he did \
have a very large mustache. Mrs. Dursley was thin and blonde and had \
nearly twice the usual amount of neck, which came in very useful as she \
spent so much of her time craning over garden fences, spying on the \
neighbors. The Dursleys had a small son called Dudley and in their \
opinion there was no finer boy anywhere. "

bigrams = create_bigrams(text)
print(bigrams)


[('Mr', 'and'), ('and', 'Mrs'), ('Mrs', 'Dursley'), ('Dursley', 'of'), ('of', 'number'), ('number', 'four'), ('four', 'Privet'), ('Privet', 'Drive'), ('Drive', 'were'), ('were', 'proud'), ('proud', 'to'), ('to', 'say'), ('say', 'that'), ('that', 'they'), ('they', 'were'), ('were', 'perfectly'), ('perfectly', 'normal'), ('normal', 'thank'), ('thank', 'you'), ('you', 'very'), ('very', 'much'), ('much', 'They'), ('They', 'were'), ('were', 'the'), ('the', 'last'), ('last', 'people'), ('people', 'you'), ('you', 'd'), ('d', 'expect'), ('expect', 'to'), ('to', 'be'), ('be', 'involved'), ('involved', 'in'), ('in', 'anything'), ('anything', 'strange'), ('strange', 'or'), ('or', 'mysterious'), ('mysterious', 'because'), ('because', 'they'), ('they', 'just'), ('just', 'didn'), ('didn', 't'), ('t', 'hold'), ('hold', 'with'), ('with', 'such'), ('such', 'nonsense'), ('nonsense', 'Mr'), ('Mr', 'Dursley'), ('Dursley', 'was'), ('was', 'the'), ('the', 'director'), ('director', 'of'), ('of', 'a'), ('a', 

or we can extend it for the n grams.

In [31]:
def create_n_grams(text,n):
  ngrams = []
  tokenized_text = re.findall('\w+',text)
  for i in range(len(tokenized_text)-n+1):
    ngrams.append(tuple(tokenized_text[i:i+n]))

  return ngrams


In [32]:
text = "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They were the last \
people you'd expect to be involved in anything strange or mysterious, \
because they just didn't hold with such nonsense. \
Mr. Dursley was the director of a firm called Grunnings, which made \
drills. He was a big, beefy man with hardly any neck, although he did \
have a very large mustache. Mrs. Dursley was thin and blonde and had \
nearly twice the usual amount of neck, which came in very useful as she \
spent so much of her time craning over garden fences, spying on the \
neighbors. The Dursleys had a small son called Dudley and in their \
opinion there was no finer boy anywhere. "

ngrams = create_n_grams(text,3)
print(ngrams)

[('Mr', 'and', 'Mrs'), ('and', 'Mrs', 'Dursley'), ('Mrs', 'Dursley', 'of'), ('Dursley', 'of', 'number'), ('of', 'number', 'four'), ('number', 'four', 'Privet'), ('four', 'Privet', 'Drive'), ('Privet', 'Drive', 'were'), ('Drive', 'were', 'proud'), ('were', 'proud', 'to'), ('proud', 'to', 'say'), ('to', 'say', 'that'), ('say', 'that', 'they'), ('that', 'they', 'were'), ('they', 'were', 'perfectly'), ('were', 'perfectly', 'normal'), ('perfectly', 'normal', 'thank'), ('normal', 'thank', 'you'), ('thank', 'you', 'very'), ('you', 'very', 'much'), ('very', 'much', 'They'), ('much', 'They', 'were'), ('They', 'were', 'the'), ('were', 'the', 'last'), ('the', 'last', 'people'), ('last', 'people', 'you'), ('people', 'you', 'd'), ('you', 'd', 'expect'), ('d', 'expect', 'to'), ('expect', 'to', 'be'), ('to', 'be', 'involved'), ('be', 'involved', 'in'), ('involved', 'in', 'anything'), ('in', 'anything', 'strange'), ('anything', 'strange', 'or'), ('strange', 'or', 'mysterious'), ('or', 'mysterious', 'b

Here is an easier way to do:

In [33]:
text = "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say \
that they were perfectly normal, thank you very much. They were the last \
people you'd expect to be involved in anything strange or mysterious, \
because they just didn't hold with such nonsense. \
Mr. Dursley was the director of a firm called Grunnings, which made \
drills. He was a big, beefy man with hardly any neck, although he did \
have a very large mustache. Mrs. Dursley was thin and blonde and had \
nearly twice the usual amount of neck, which came in very useful as she \
spent so much of her time craning over garden fences, spying on the \
neighbors. The Dursleys had a small son called Dudley and in their \
opinion there was no finer boy anywhere. "

text = textblob.TextBlob(text)
print(text.ngrams(n = 2))

[WordList(['Mr', 'and']), WordList(['and', 'Mrs']), WordList(['Mrs', 'Dursley']), WordList(['Dursley', 'of']), WordList(['of', 'number']), WordList(['number', 'four']), WordList(['four', 'Privet']), WordList(['Privet', 'Drive']), WordList(['Drive', 'were']), WordList(['were', 'proud']), WordList(['proud', 'to']), WordList(['to', 'say']), WordList(['say', 'that']), WordList(['that', 'they']), WordList(['they', 'were']), WordList(['were', 'perfectly']), WordList(['perfectly', 'normal']), WordList(['normal', 'thank']), WordList(['thank', 'you']), WordList(['you', 'very']), WordList(['very', 'much']), WordList(['much', 'They']), WordList(['They', 'were']), WordList(['were', 'the']), WordList(['the', 'last']), WordList(['last', 'people']), WordList(['people', 'you']), WordList(['you', "'d"]), WordList(["'d", 'expect']), WordList(['expect', 'to']), WordList(['to', 'be']), WordList(['be', 'involved']), WordList(['involved', 'in']), WordList(['in', 'anything']), WordList(['anything', 'strange'