# Pre-processing

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1. Case folding</p>

This is the act of converting every token to be uniformly **lower case** or **upper case**. 

<center>
<img src='https://i.postimg.cc/JhQRD6Jk/case-folding.jpg' width=600>
</center>
    
This can be beneficial because it will **reduce the number of unique tokens** in a corpus, i,e. the size of the **vocabulary**, hence make the processing of these tokens more memory and computational effecient. The downside however is **information loss**. 

For example `"Green"` (name) has a different meaning to `"green"` (colour) but both would get the **same token** if case folding is applied. Whether it makes sense to use case folding **depends on the application** (is speed or accuracy more important).

In [1]:
# Import spacy library
import spacy

# Load language model
nlp = spacy.load("en_core_web_sm")

To case fold to lower cases we can use the `.lower` attribute. 

In [2]:
# Tokenize
s = "The train to London leaves at 10am on Tuesday."
doc = nlp(s)

# Case fold
print([t.lower_ for t in doc])

['the', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on', 'tuesday', '.']


We might want to be **more granular** and only case fold if certain conditions are met. For example, we could **skip the first word** in a sentence.

In [3]:
# Conditional case folding
print([t.lower_ if not t.is_sent_start else t.text for t in doc])

['The', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on', 'tuesday', '.']


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2. Stop word removal</p>

Stop words are words that **appear commonly** but **carry little information**. Examples include, `"a"`, `"the"`, `"of"`, `"an"`, `"this"`,`"that"`. Similar to case folding, removing stop words can **improve efficiency** but comes at the cost of **losing contextual information**. 

<center>
<img src='https://i.postimg.cc/B6XY2bkG/stop-word-removal.jpg' width=600>
</center>

The choice of whether to use stop word removal will depend on the task being performed. For some tasks like **topic modelling** (identifying topics in text), contextual information is not as **important** compared to a task like **sentiment analysis** where the stop word `"not"` can change the sentiment completely. 

Also note that different libraries have **different** stop word lists so you might want to **tune** your list depending on the application. Spacy's language model has **over 300 stop words**. 

In [4]:
# Print spacy's stop word list
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

{'me', 'part', 'yours', '‘m', 'other', 'whose', 'whom', 'put', 'too', 'nowhere', 'using', 'either', 'whoever', 'towards', 'hundred', 'formerly', 'thus', 'those', 'please', 'which', 'nor', 'noone', 'you', 'used', 'get', 'whatever', 'do', 'not', 'although', 'whereafter', 'serious', 'four', 'so', 'due', 'hers', 'show', 'when', 'in', 'still', 'amongst', 'should', 'at', 'however', 'his', 'thereupon', 'five', 'its', 'whenever', 'whereas', 'as', 'must', '‘s', 'further', 'latterly', 'three', 'into', 'hereafter', 'empty', 'sixty', 'twelve', 'us', 'why', '‘ve', 'two', 'up', 'anyhow', 'enough', 're', 'namely', 'these', 'beforehand', 'much', 'indeed', 'to', 'being', 'until', "n't", 'become', 'here', 'though', 'ours', 'was', 'alone', 'thereby', "'ve", 'say', 'forty', 'how', 'along', 'through', 'yet', 'around', 'seem', 'nine', 'we', 'then', 'meanwhile', '’ll', 'nobody', 'yourself', 'never', "'re", 'your', 'among', 'my', 'whither', 'even', 'does', 'did', 'also', 'third', 'nothing', 'if', 'some', 'aga

To remove stop words, we use the `.is_stop` attribute.

In [5]:
# Stop word removal
print([t.text for t in doc if not t.is_stop])

['train', 'London', 'leaves', '10', 'Tuesday', '.']


Depending on the application, you might want to **customize spacys** stop word list. This can be done as follows.

In [6]:
nlp.Defaults.stop_words.add("ergo")
nlp.Defaults.stop_words.remove("whatever")

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3. Stemming</p>

Stemming is the act of **reducing a word to its stem** by **removing suffixes** and sometimes prefixes depending on the language.

For example, the words `"developed"` and `"developing"` both have the stem `"develop"`.

While this technique also reduces the size of the vocabulary, it can result in **invalid words**, for example `"studies"` might be stemmed to `"studi"`. For this reason, stemming is rarely used these days. It turns out there is a **better altenative**, called **lemmatization**, which we'll look at next.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">4. Lemmatization</p>

Lemmatization reduces a word down to its **lemma**, i.e. dictionary form. 

While this is similar to stemming, it also takes into account things like **tenses** and **synonyms**. For example, the words `"did"`, `"done"` and `"doing"` would be converted to the base form `"do"`.

<center>
<img src='https://i.postimg.cc/0NwRqt5S/lemmatization.jpg' width=600>
</center>
    
It also takes into account whether a word is a **noun**, **verb** or **adjective** on deciding whether to lemmatize. For example, it might not modify some adjectives so not to change their meaning. (`"energetic"` is different to `"energy"`).

Lemmatization is generally prefered to stemming because it is **more accurate and robust** while still offering the same benefit of vocabulary size reduction. It does however remove your ability to distinguish different **tenses**, which may be important for some applications.

In [7]:
# Tokenize
s = "She was the fastest swimmer."
doc = nlp(s)

We can view the lemmatization using the `.lemma_` attribute.

In [8]:
# Lemmatization
print([(t.text,t.lemma_) for t in doc])

[('She', 'she'), ('was', 'be'), ('the', 'the'), ('fastest', 'fast'), ('swimmer', 'swimmer'), ('.', '.')]


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">5. Part-of-speech tagging</p>

Part-of-speech tagging is the method of **classifying how a word is used in a sentence**, for example, **noun, verb, adjective**.

<center>
<img src='https://i.postimg.cc/Jh5wnsJ7/pos-tagging.jpg' width=600>
</center>

This is very helpful because it can help us understand the **intent or action** of an ambiguous word. For example, when we say `"Hand me a hammer."`, the word `"hand"` is a **verb** (doing word) as opposed to `"The hammer is in my hand."` where it is a **noun** (thing) and has a different meaning. 

We can access the part-of-speech tags using the `".pos_"` attribute.

In [9]:
# Part-of-speech
print([(t.text,t.pos_) for t in doc])

[('She', 'PRON'), ('was', 'AUX'), ('the', 'DET'), ('fastest', 'ADJ'), ('swimmer', 'NOUN'), ('.', 'PUNCT')]


A full description of the tags can be found using `"spacy.explain"`.

In [10]:
print([(t.pos_,spacy.explain(t.pos_)) for t in doc])

[('PRON', 'pronoun'), ('AUX', 'auxiliary'), ('DET', 'determiner'), ('ADJ', 'adjective'), ('NOUN', 'noun'), ('PUNCT', 'punctuation')]


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">6. Named Entity Recognition</p>

Named Entity Recognition (NER) is the act of tagging **named entities** in text. 

A **named entity** is anything that can be referred to by a **proper name** and usually has the **proper noun** tag. Common examples include a person, cities, countries and companies. Note that it is common to extend entities to include money, time, dates, etc. 

<br>
<br>
<center>
<img src='https://i.postimg.cc/VNbYfqBx/ner.png' width=600>
</center>
<br>
<br>


NER can help **categorize and organize** a corpus. It is especially useful, for example, in helping **chatbots** raise accurate support tickets depending on the customer problem. 

Some of the **challenges** to building a state-of-the-art NER model include **type ambiguity**, where one word can have multiple meanings (e.g. Amazon - river or company?) and the fact that **entities can span multiple tokens** (e.g. John Smith). Luckily, spacy has very good NER model that we can utilize. 

In [11]:
# Tokenize
s = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(s)

There are two ways to do NER in spacy. The **first way** is via the `.ent_type_` attribute.

In [12]:
# Named Entity Recognition
print([(t.text,t.ent_type_) for t in doc])

[('Apple', 'ORG'), ('is', ''), ('looking', ''), ('at', ''), ('buying', ''), ('U.K.', 'GPE'), ('startup', ''), ('for', ''), ('$', 'MONEY'), ('1', 'MONEY'), ('billion', 'MONEY')]


In [13]:
# Only print entities
print([(t.text,t.ent_type_) for t in doc if t.ent_type != 0])

[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$', 'MONEY'), ('1', 'MONEY'), ('billion', 'MONEY')]


Like before, we can `spacy.explain` to understand each tag.

In [14]:
# Entity explanation
print('ORG:', spacy.explain('ORG'))
print('GPE:', spacy.explain('GPE'))
print('MONEY:', spacy.explain('MONEY'))

ORG: Companies, agencies, institutions, etc.

GPE: Countries, cities, states

MONEY: Monetary values, including unit


The **second way** to do NER in spacy is to use the `.ents` attribute. 

In [15]:
# Named Entity Recognition
print([(ent.text, ent.label_) for ent in doc.ents])

[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]


Notice this time how `$1 billion` is grouped into **one entity**, whereas before each token was a separate entity.

Finally, we can **visualize** the entities using a spacy built-in function.

In [16]:
# Import function
from spacy import displacy

# Visualize entities
displacy.render(doc, style='ent', jupyter=True)