# Module 2: Text Normalisation - Preprocessing
<br>


## Table of Contents
<br>

<a href="#Module 2: Text Normalisation - Preprocessing"><font size="+1">Module 2: Text Normalisation - Preprocessing</font></a>
<ol>
  <li>What is Text Normalisation?</li>
  <li>Lowercasing</li>
  <li>Remove Punctuation</li>
  <li>Tokenize</li>
  <li>Lemmatization</li>
  <li>Stemming</li>
  <li>Stopword Removal</li>
  <li>Remove Numbers</li>
  <li>Remove the words having length less than 2</li>
  <li>Using Stanza (not available for ONS devices)</li>
  <li>Challenges</li>
</ol>

**Learning Outcomes:** 

* Explain the concept of normalisation, 

* Execute the following preprocessing steps to a dataset using nltk: 

    + Lowercase 

    + Tokenize 

    + Lemmatization 
    
    + Stemming 

    + Removing stop words and punctuation
 
 
*  Differentiate between lemmatization and stemming. 
<br>

Additionally you should be able to:

* Execute tokenisation and lemmatisation using the Stanford Stanza library

<br>


In [None]:
# Import the libraries used in this module
import nltk
nltk.data.path.append("../local_packages/nltk_data")
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import re

In [None]:
# Use of stanza is optional
# it will not work on some locked down devices (such as ONS machines)
#import stanza

We will load in a data set that will be used to explore text processing techniques.

In [None]:
# Our data is stored in a pickle, a filetype that lets us store python objects
patents = pd.read_pickle('../data/Patent_Dataset.pkl')

# our index is incorrect, quickly reset it
patents = patents.reset_index(drop=True)

patents.head()

### 2.1 What is Text Preprocessing?

<br>

Pre-processing text means converting it to a more convenient, standard form used for your specific application. It aims to put all text on a level playing field. It requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose, one size fits all preprocessing procedure. 
<br>
To normalise text some key preprocessing steps could be undertaken, for example:

<ul>
  <li>Lowercasing</li>
  <li>Removing all irrelevant characters (Numbers and Punctuation)</li>
  <li>Removing words that aren't useful to us - "Stopwords" </li>
  <li>Stemming</li>
  <li>Lemmatization</li>
</ul>

The above is often referred to as *cleaning* the text. It is also part of an NLP *pipeline*

<br>
Typically, the text also undergoes <strong>tokenisation</strong> (segmenting text into words) to aid the normalisation process.
<br>

Once through preprocessing your text becomes more predictable and analyzable.  

The pre-processing steps undertaken depends on the task. It is not necessary to do any or all of the steps shown below.

There are a range of terminology that are related and used in NLP pre-processing:

* Pre-processing - the broad term for steps to make text ready for analysis
* Text Normalisation - reducing the variance in text by converting similar words to the same word
* Text Cleaning - removing unecessary parts of text such as irrelevant words or characters

Text preprocessing choices are highly dependent on the application. Also included in pre-processing is feature representation which involves converting text to numerical represenation, this is covered within the "Intermediate NLP" course.

### 2.2 Lowercasing
<br>
Lowercasing all the text data is one of the simplest and most effective forms of text normalisation. 

Lowercasing is applicable to some NLP tasks and significantly helps with consistency of the output.

It is useful because in a lot of applications where we are trying to capture meaning, capitalised words have the same meaning as those in lowercase. For example `Cat` (noun) at the beginning of a sentence is referring to the same thing as `cat` in the middle of a sentence.

<br>

In [None]:
# Viewing the text before we lowercase it
# using the column "abstract" from the data frame and
# observing the first row in that column (index 0)
patents["abstract"][0]

In [None]:
# lowercase the text using pandas string methods
patents['abstract_lower'] = patents['abstract'].str.lower()

In [None]:
# viewing the text after lowercasing
patents["abstract_lower"][0]

Warning: lowercasing can be really useful in reducing the difference in words in a text. However, this is not always what we want.

Upper casing text can also be informative to use. For example - if we were looking to extract all "Named Entities" - such as people's names, locations, products or organisations, then having a capital letter could be really useful!

For example:


<i>"A turkey may march in Turkey in May or March"</i>


In the above we will want to treat "Turkey" and "turkey" differently, as well as May/may and March/march.

We decide whether to lowercase based on the task at hand, and whether doing so would help us.

### 2.3 Remove Punctuation
<br>
Punctuation is sometimes irrelevant to the task at hand (such as counting word frequencies). By removing all or certain punctuation we remove the 'quirks' of individual sentences, standardising the text.

Python provides a constant called `string.punctuation` that provides a great list of punctuation characters. 

<br>

We are going to remove the punctuation from our pandas text data. To do this we need to know what the punctuation characters are. We then need a function that will remove punctuation from a string.

In [None]:
# display the standard string punctuation
print(string.punctuation)

In [None]:
# we need to create a regular expression (covered in the next chapter)
# which captures all the above punctuation characters
"[{}]".format(string.punctuation)

In [None]:
# Below is a function that uses regex to remove punctuation from strings
def remove_punct(ptext):
    # replace any punctuation with nothing "", effectively removing it
    ptext = re.sub(string=ptext,
                   pattern="[{}]".format(string.punctuation), 
                   repl="")
    return ptext

# by making a function that works for one piece of text
# we can then apply the function to all the pandas text

In [None]:
# viewing our text before removing punctuation
patents["abstract"][0]

In [None]:
# apply removing punctuation function to all elements in the column "abstract"
patents['abstract_no_punct'] = patents['abstract'].apply(remove_punct)

In [None]:
patents['abstract_no_punct'][0]

We can see above there are no longer punctuation marks. This may make our text not make perfect gramatical sense!

Removing punctuation is one approach to cleaning text. A broader approach is removing non-alphanumeric text. This would cover punctuation, numbers and any other text that is not letters.

We can use basic approaches, like removing all text that isn't letters. Or we could use more advanced methods that take into account the relationships between words.

For example, if we simply removed all non-alphanumeric text from `"let's"` we would get `"lets"` which has a different meaning to `"let us"`.

### 2.4 Tokenize
<br>
Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks and others can be considered as tokens. Tokenization makes it easier to process the text eg find its frequency. 


Consider the following text:

`"This is the text given"`

By tokenising the text we receive the following tokens:

`"This", "is", "the" "text", "given"` 

In python this often means we go from a text string to a list of strings.

Tokenizers can range in complexity, from using `.split(" ")` on text, to more rule based or machine learning-based implementations. Below we use `nltk`'s word tokenizer.

As some tokenisers can take into account punctuation we do not need to always remove punctuation.

We are going to first look at a basic example of tokenization, then use a more advanced method from `nltk`.

In [None]:
# Basic tokenization example
original_text = "This is the text given"

# Create tokens by splitting the text on each " " space
tokens = original_text.split(" ")

print("Original text:\n\t", original_text)
print("Original text 'type':\n\t", type(original_text).__name__)
print("Tokenized text:\n\t", tokens)
print("Tokenized text 'type':\n\t", type(tokens).__name__)

In [None]:
# Instead of using split, we can tokenise using functions from nltk, 
# other libraries, or create our own
# For our basic text example this will provide the same result
# For more complex/larger text this will be better than splitting
nltk_tokens = nltk.word_tokenize(original_text)
nltk_tokens

In [None]:
# text before tokenization
patents["abstract"][0]

In [None]:
# Apply to 'abstract' column in dataframe
patents['abstract_tokens'] = patents['abstract'].apply(nltk.word_tokenize)

In [None]:
patents["abstract_tokens"][0]

Text can also be split to identify sentences.

This can be done using the **sent_tokenize()** function from nltk.

Think about what separates sentences, typically punctuation, this can be used to determine where sentences start and end.

In [None]:
# Apply tokenisation to abstract column in dataframe
patents['abstract_sentences'] = patents['abstract'].apply(nltk.sent_tokenize)

In [None]:
# text before sentence segmentation
patents["abstract"][0]

In [None]:
# text after sentence segmentation
patents["abstract_sentences"][0]

The above abstract is one long sentence! Let's look at another abstract to see the result.

In [None]:
patents["abstract_sentences"][1]

There we go, each sentence has become it's own element in a list. After tokenizing the sentences we could then tokenize the individual words within the longer sentence string.

### 2.5 Stemming
<br>
<strong> The process of reducing inflected (or sometimes derived) words to their word stem; that is, their base or root form.</strong>  

<br>

For example, the words; *argue, argued, argues, arguing* reduce to the stem *argu*. 

Usually stemming is a crude heuristic process that chops off the ends of words in the hope of achieving the root correctly most of the time.

Stemming aims to remove the excess part of the word to be able to identify words that are similar.

There are different algorithms that can be used in the stemming process, but the most common in English is <strong>Porter Stemmer.</strong>  The rules contained in this algorithm are divided into different phases.The purpose of these rules is to reduce the words to their root.

The danger here lies in the possibility of overstemming where words like “universe” and “university” are reduced to the same root of “univers”.

Below we are going to look at the results of using a stemmer on text. The resulting strings will not necessarily be real words, but they will reduce the diversity of the words in the text. We apply stemming to each word in a text, which means each word needs to be separate from it's neighbours. This is done using tokenising, so we can take a string of words, break them up, then apply a process to each independently.

In [None]:
# Using the PorterStemmer from nltk
words_to_normalise = "The Ones and twos argues who is winning in todays matches"

# generate tokens
tokens = nltk.word_tokenize(words_to_normalise)

# note we have not applied any other preprocessing to the text
tokens

In [None]:
# Loop through each token in the list and apply the stemming
tokens_stemmed = [PorterStemmer().stem(token) for token in tokens]

# A list of stemmed words
print(tokens_stemmed)
# The has clearly applied some processing to the text, beyond chopping
# off an ending. This PorterStemmer has also lowercased the text

Throughout this course we are going to follow a similar code structure to the bellow:

* write a function that performs what we want on a single piece of data (such as a string, or list depending on context)
* apply this function to every row in the data set

There are often ways we can use pandas itself to do string manipulation which may be more efficient than creating a custom function, but for some data structures like lists to deal with the pandas solution is harder to understand.

In [None]:
## Applying stemming to the pandas data
# Define stemming function

def stemming(ptoken):
    # create stemming object
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in ptoken]    

In [None]:
# tokens pre-stemming
patents['abstract_tokens'][0]

In [None]:
patents['abstract_tokens_stemmed'] = patents['abstract_tokens'].apply(stemming)

In [None]:
# tokens post stemming
patents['abstract_tokens_stemmed'][0]

In [None]:
# comparing the pre and post stemmed tokens
list(zip(patents['abstract_tokens'][0], patents['abstract_tokens_stemmed'][0]))

Looking at those results we can see that the number of different tokens will reduce in a big corpus. However, this does make some of the resulting tokens less informative. There's a tradeoff here between how understandable our resulting words are, and how much we reduce the diversity.

### 2.6 Lemmatization
<br>

Lemmatisation uses vocabulary and morphological analysis of words to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the <strong> lemma.</strong>  Most lemmatisers achieve this using a lookup table and so this process, when you have large volumes of text may be slower than the alternative; stemming. However, it is often a recommended approach in a variety of applications due to it's accuracy.

Also, sometimes, the same word can have multiple different ‘lemma’s. So, based on the context it’s used in, you should identify the ‘part-of-speech’ (POS) tag for the word in that specific context and extract the appropriate lemma. POS tagging will be discussed in more detail in later modules.

A part of speech refers to the type of a word. For example we can have nouns, verb, adjectives and so on to describe a word. We can however go into much more depth to describe what a word part of speech is by introducing more and more types. Below a basic example is shown. When we know what part of speech a word is we can better lemmatize it.

<img src="../pics/posTag.png" alt="Example tagging">

The WordNet lemmatizer shown below is a "lexical" lemmatizer. It contains a wide range of words, their lemmas and when called takes whatever word is given, and returns the lemma if it has an entry for it. If it doesn't have an entry it will just return the origin word.

<br>

Bewlow we are going to follow a similar set of steps to the stemming above. We will analyse the effect of lemmatising on our toy example, then on the pandas patents data set.

In [None]:
# Using the WordNetLemmatizer from nltk
# without using the parts of speech tagging it doesn't perform incredibly well
# But it does always result in valid words where it runs

# Run tokenizer
tokens = nltk.word_tokenize(words_to_normalise)
print(words_to_normalise)
tokens

In [None]:
tokens_lemmed = [WordNetLemmatizer().lemmatize(token) for token in tokens]
print(tokens_lemmed)
type(tokens_lemmed)

Compared to the stemmer we have more words that have been missed / not touched. However, "todays" -> "today", "matches" -> "match" and so on, which are good normalisations. The words that have been lemmatized are now actual words, in comparison with the stemming method.

In [None]:
# Define the lemmatize() function

def lemmatise(ptokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in ptokens]

In [None]:
# tokens before lemmatization
patents["abstract_tokens"][0]

In [None]:
# apply lemmatisation to all tokens in column
patents['abstract_tokens_lemmatised'] = patents['abstract_tokens'].apply(lemmatise)

In [None]:
# tokens after lemmatization
patents["abstract_tokens_lemmatised"][0]

In [None]:
# Comparison of normalisation
list(zip(patents["abstract_tokens"][0], patents["abstract_tokens_lemmatised"][0]))

The above shows that in this case lemmatizing have not been particularly effective. Lemmatizing needs to understand more about the words just the words in isolation to be able to normalize them effectively. (If you look closely, "tufts" has changed to "tuft"). The reason for this poor performance is that the lemmatizer is assuming that every token is a Noun. There are fewer changes that need to be made to normalize nouns than for other parts of speech, such as verbs. For example a verb "running", should be converted to "run". But if the lemmatizer thinks "running" is a noun, then it will not have a lookup to check for it and therefore not change anything.

The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.
Performance can be improved if we add the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().
An example of this will be provided later.


**Lemmatization and Stemming:** Stemming operates on each word without considering the context and it cannot discriminate between different word meaning. Lemmatization, however, takes into account the part of speech and the context. Stemming is explained in more detail below.

Caring -> Lemmatization -> ‘Care’ 
<br>
Caring -> Stemming -> ‘Car’


**Example:**
 
<i>"better"</i>: has <i>"good"</i> as its lemma and "better" as its stem <br>
<i>"walking"</i>: has  <i>"walk"</i> as its lemma and stem <br>
<i>"meeting"</i>: can be either the base a noun or a verb depending on the context, eg <i>"in our last meeting"</i> or <i>"We are meeting again tomorrow"</i>. 
<br>
Lemmatization can select the appropriate lemma based on the context, unlike stemming.
<br>



<img src="../pics/stemlemm.png" alt="Comparison of stemming and lemmatization outcomes">

<br>

Stemming is often a faster process, but less accurate / useful. So the specific use case is important in deciding which method to choose.

### 2.7 Stopword Removal

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”). Stop words can be filtered from the text to be processed. There is no universal list of stop words in NLP, however the nltk module contains a list of stop words. Removal of stopwords can boost performance in machine learning classification tasks.

Stop word removal is commonly applied in search systems, text classification applications, topic modeling, topic extraction and others. Stop word lists can come from pre-established sets or you can create a custom one for your domain.

Often we will need to select out own stopwords for a task, because not all text is the same and all language can be different. What is "unimportant" in one analysis/application may be crucial in another.

The reason this often works is because stopwords often do not add meaning to the sentence. For example if you wanted to work out what was important in the sentence below:

"The cat ate a mouse"

The same sentence without stopwords conveys most of the meaning:

"cat ate mouse"

Below we are going to work through an example of removing stopwords from a text.

In [None]:
# Display the basic stopwords given by nltk
print(stopwords.words('english'))

In [None]:
# you can set the language for stopwords
# We have used a set() below, which is a data structure
# that is useful for checking if some data is present somewhere else efficiently 
# (checking membership)
stop_words_french = set(stopwords.words('french'))
print(stop_words_french)

In [None]:
del stop_words_french
# let's ensure we are using English stopwords for this project
stop_words = set(stopwords.words('english'))

**Note:** the default stopwords from `nltk` while useful, were created using domain knowledge and algorithms, rather than using a statistical method alone. Stopword collections are best custom made and specific for an individual application. The information on how it was curated is patchy at best. It was created for the [Snowball project](https://snowballstem.org/projects.html).

In [None]:
# Define a function to remove stopwords from list of tokens
def clean_stopwords(tokens):
    # define stopwords
    stop_words = set(stopwords.words('english'))
    # loop through each token and if the word isn't in the set 
    # of stopwords keep it
    return [item for item in tokens if item not in stop_words]

In [None]:
# Pre stopword removal
patents['abstract_tokens'][0]

In [None]:
patents['tokens_no_stops'] = patents['abstract_tokens'].apply(clean_stopwords)

In [None]:
# Post stopword removal
patents['tokens_no_stops'][0]

In [None]:
# These were the stopwords that were removed
found_stopwords = []

# go through each unique token
for token in set(patents['abstract_tokens'][0]):
    if token in stop_words:
        found_stopwords.append(token)
        
found_stopwords

### 2.8 Remove Specific Values

We need to be able to remove specific parts of strings, such as numbers, apostrophies or other non-alphanumeric data. This can be done in a range of ways. 

We can either loop through tokens and remove / edit those with the values we don't want. Or we can use regular expressions (explained in the next chapter) to substitute our specific values in a string. Or a combination of both!

Below we are going to show the looping method, as we haven't yet explored regex in detail. We will look at how removing tokens that are alphabetic.

In [None]:
# define a function that keeps only alphabetic text
# creates new list for only tokes where alphanbetic == True

def remove_num(ptokens):
    return [token for token in ptokens if token.isalpha()]

In [None]:
# create some sample tokens, only one contains no digits
alphanumeric = ["Hello", "H3llo", "He11o", "Hell0"]

# remove_num detects the digits and removes those tokens
remove_num(alphanumeric)

In [None]:
# Removing numbers this way will keep only words that have no digits or punctuation in them.
print(" ".join(remove_num("007 Not sure@ if this % was #fun! 558923 What do# you think** of it.? $500USD!".split(" "))))

We need to be careful in the situation above that we are only removing what we want to remove. How we alter / clean our text will impact the results of our analysis.

If we wanted to remove tokens that were only digits, we could use the `token.isdigit()` method. There is rarely a one size fits all approach to cleaning, with trade offs in each approach.

In [None]:
# lets look at a new text, row 21, which contains some digits
patents["abstract"][21]

In [None]:
# Apply removal function
patents['abstract_no_nums'] = patents['abstract_tokens'].apply(remove_num)

In [None]:
# ensure you only use remove_num on tokenized text, otherwise it tokenizes every character.
patents["abstract_tokens"][21]

In [None]:
# We can see there are no longer any tokens with digits in it
# "1x10", "-4" have been removed
patents["abstract_no_nums"][21]

### 2.9 Remove the words having length less than 2

Remove words with two or fewer characters from a document. This could be useful in removing further words that are "semantically empty". This crosses over into stopword removal, but is more heavy handed.

Some common words with only two characters "to", "it" carry very little meaning in some circumstances.

We are goubg to create a function that keeps only strings of length $\gt$ 2. This will be applied to example text, and then the patents data set. Depending on the distribution of word length, a processing step like this could have a big impact!

In [None]:
def remove_short_tokens(ptokens):
    return [token for token in ptokens if len(token) > 2]

In [None]:
# define a sentence with varying word length
lyric = "I'm blue, da ba dee da ba daa"
lyric_tokens = nltk.word_tokenize(lyric)
lyric_tokens

In [None]:
# only "longer" words are kept
remove_short_tokens(lyric_tokens)

Note how the way our text has been normalised may have an impact on whether the token is kept. For example, if we have "are" -> "is", the new token would be removed by this rule.

In [None]:
# pre short word removal
patents["abstract_tokens"][0]

In [None]:
# use the short word removal
patents['abstract_no_small'] = patents['abstract_tokens'].apply(remove_short_tokens)

In [None]:
# after removal
patents["abstract_no_small"][0]

The resulting text above looks in a way "cleaner", it contains mostly words which have a semantic meaning, no punctuation and few non-sensical tokens. But is that useful to us? It depends on what we are trying to do. If we were analysing the frequency of punctuation.. probably not particularly useful.

### Perform all the preprocessing steps wrapped into one function

We can combine all of our preprocessing steps into one function for ease of use and reproducibility. It's important to note that the order we call each processing function matters here. If we remove short tokens early, we may find it harder to lemmatize the text. Without punctuation, we would be unable to tokenize sentences. Stopwords can be removed, but if the tokens have not been normalized somehow then they will not necessarily be picked up. We need to consider how we process text, and the effect of that processing carefully.

In [None]:
def preprocessing_with_lemmatisation(raw_data):
    """Function to perform all preprocessing steps with lemmatisation"""
    ptext = raw_data.lower()
    ptext = remove_punct(ptext)
    ptext = nltk.word_tokenize(ptext)
    ptext = lemmatise(ptext)
    ptext = remove_num(ptext)
    ptext = clean_stopwords(ptext)
    ptext = remove_short_tokens(ptext)

    return ptext

In [None]:
def preprocessing_with_stemming(raw_data):
    """Function to perform all preprocessing steps with stemming"""
    ptext = raw_data.lower()
    ptext = remove_punct(ptext)
    ptext = nltk.word_tokenize(ptext)
    ptext = stemming(ptext)
    ptext = remove_num(ptext)
    ptext = clean_stopwords(ptext)
    ptext = remove_short_tokens(ptext)
        
    return ptext

In [None]:
# Perform all processing at once
patents['processed_with_lem'] = patents['abstract'].apply(preprocessing_with_lemmatisation)

patents["processed_with_lem"][0]

In [None]:
patents['processed_with_stem'] = patents['abstract'].apply(preprocessing_with_stemming)

patents["processed_with_stem"][0]

Compare the results of the different approaches.

### Pipelines

Often in practice we can combine many of our preprocessing, cleaning and normalisation steps into a single pipeline that will reproducibly apply the same steps to different data.

We can:

* Combine steps into functions as shown above
* Use packages such as `sklearn` to build analysers and processors such as in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (more covered in Intermediate NLP).
* Use a spaCy pipeline (module 5)
* Use a Stanza pipeline (shown below)

### 2.10 Using Stanza (optional)

**Note:** The Stanza parts of this course are optional based on software requirements. The Stanza material is useful to be aware of but not crucial to your understanding of basic NLP.

Stanza is a Python natural language analysis package. It contains tools which can be used in a pipeline in order to:

* convert a string containing text into lists of sentences and words.
* generate base forms of words.
* generate parts of speech and morphological features.
* give a syntactic structure dependency parse.
* recognize named entities.

(*adapted from Stanford University, 2020*).
<br>

<img src="../pics/stanzapipeline.png" alt="Flow chart of a typical Stanze pipeline" width=550>


<br>

Some example on how to use stanza for language processing shown below - further examples highlighted in the language syntax and structure module.

#### Pipeline

To start annotating text with Stanza, you would typically start by building a Pipeline that contains Processors, each fulfilling a specific NLP task you desire (e.g., tokenization, part-of-speech tagging, syntactic parsing, etc). The pipeline takes in raw text or a Document object that contains partial annotations, runs the specified processors in succession, and returns an annotated Document.

#### Processors

Processors are units of the neural pipeline that perform specific NLP functions and create different annotations for a Document. The neural pipeline supports the following processors:

tokenize, mwt(expands multi word expressions), pos(part of speech), lemma, depparse(dependency parsing),
ner(named entity recognition)


In [None]:
# the English model download below requires ~0.5 Gb of memory
# For some networked devices the code below will not run


In [None]:
#stanza.download('en') # download English model
#nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence

In [None]:
print(doc) 

#### Tokenisation and sentence splitting

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')

for index, sentence in enumerate(doc.sentences, start=1):
    print(f'====== Sentence {index} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

#### Lemmatisation

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
doc = nlp('Barack Obama was born in Hawaii.')
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

### 2.11 Challenges

As mentioned previously, with each processing technique there are challenges associated or areas where the technique performs poorly. It is important to be aware of what these could be for each step in your processing pipeline.
<br>

#### 2.12 Punctuation
Words like Ph.D that have a ., but the sentence does not finish would require an exception function. Additionally words like "don't", "won't" also need to be handled with caution.

We can either create rules for each of these situations (many built in to modern preprocessing methods). Learn the rules (with machine learning methods) or create lexicons (files containing text data) which have mappings for specific cases.

#### 2.13 Consistency
Using different methods for lemmatization may give different results- staying consistent throughout your work will ease your processing and will not mess with your results. 

Good programming documentation can help with this, as well as exploratory analysis and evaluation of our chosen methods. There is rarely a one size fits all choice for text normalisation. 

#### 2.14 Stemming
Usually stemming is not preferred. If you do want to use stemming to help you find more words that are closely related, then it would be better if you keep the stemmised and the non-stemmised version of the word. This will help you present the results as the end.


<br>

#### Exercises
<br>

<ol>
  <li>Import the Hep Dataset and perform the following preprocessing steps to the "Text" column.</li>
    
        Lowercasing
        Remove Punctuation
        Tokenize
        Lemmatization
        Stemming
        Stopword Removal
        Remove Numbers
        Remove the words having length less than 2
        Tokenise, sentence splitting and lemmatisation using Stanza
 
</ol>

Guidelines: 

* Change the "Text" columnn from list to a string before undertaking pre-processing.  <br>
* Perform the preprocessing steps in the same way as done to the patent dataset abstract column. <br>
* Once punctuation removal, tokenisation, lemmatisation, stemming undertaken put the results in new columns in the df. <br> 
* Apply lemmatisation and stemming on text that has been tokenised <br> 
* Make a copy of the df once loaded in using copy() <br>



In [None]:
hep = pd.read_pickle("../data/Hep_Dataset.pkl")

In [None]:
hep.head()

In [None]:
hep.shape

In [None]:
# Write your code here


#### References


https://mc.ai/text-preprocessing-for-nlp-and-machine-learning-tasks/