<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session C3. Natural Language Processing
The field of study that focuses on the interactions between human language and computers is called natural language processing (NLP).

NLP is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. NLP systems are used exploiting the signals in our language used to predict all of the aforementioned features: people’s age (Nguyen et al., 2011; Rosenthal & McKeown, 2011), gender (Alowibdi et al., 2013; Ciot et al., 2013; Liu & Ruths, 2013), personality (Park et al., 2015), job title (Preoţiuc-Pietro et al., 2015a), income (Preoţiuc-Pietro et al., 2015b), and much more (Volkova et al., 2014, 2015).

In NLP, word embeddings have been at the forefront of this progress, which has expanded to include flexible model architectures (Hovy, 2021). The most publicly visible example of this shift is probably the translation quality of services like Google Translate (Wu et al., 2016).

A collection of fundamental tasks appear frequently across various NLP projects (Vajjala et al., 2020). Let’s briefly introduce them (Figure 1):

<img src='images/nlp_tasks.png' style='height: 400px; float: right'>

*Language modeling* is the task of predicting what the next word in a sentence will be based on the history of previous words. The goal of this task is to learn the probability of a sequence of words appearing in a given language. Language modeling is useful for building solutions for a wide variety of problems, such as **speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction**.

*Text classification* is the task of bucketing the text into a known set of categories based on its content. Text classification is by far the most popular task in NLP and is used in a variety of tools, from **email spam identification** to **sentiment analysis**.

*Information extraction* is the task of extracting relevant information from text, such as **calendar events from emails** or the **names of people mentioned** in a social media post.

*Information retrieval* is the task of finding documents relevant to a user query from a large collection. Applications like **Google Search** are well-known use cases of information retrieval.

*Conversational agent* is the task of building dialogue systems that can converse in human languages. **Alexa** and **Siri** are some common applications of this task.

*Text summarization* aims to create short summaries of longer documents while retaining the **core content** and preserving the **overall meaning** of the text.

*Question answering* is the task of building a system that can automatically answer questions posed in natural language.

*Machine translation* is the task of converting a piece of text from one language to another. Tools like **Google Translate** are common applications of this task.

*Topic modeling* is the task of uncovering the topical structure of a large collection of documents. Topic modeling is a common text-mining tool and is used in a wide range of domains, from **literature** to **bioinformatics**.



Understanding human language is considered as a difficult task due to its complexity. For example, there is an infinite number of different ways to arrange words in a sentence. 

Also, words can have several meanings and contextual information is necessary to correctly interpret sentences as every language is unique and ambiguous. The ambiguity can be in lexical and syntactic forms.

- In lexical ambiguity, a single word has two or more possible meanings. For example, "I saw bats".
- In syntactic ambiguity, a single sentence or a sequence of words have multiple possible meanings. For example, "The chicken is ready to eat".

This session will help you understand the basic and advanced NLP concepts and show you how to implement using the most advanced and popular NLP libraries, such as  <a href="https://spacy.io/">spaCy</a> and <a href="https://radimrehurek.com/gensim/">Gensim</a>.

<!-- – <a href="https://www.nltk.org/">NLTK</a>, <a href="https://spacy.io/">spaCy</a>, <a href="https://radimrehurek.com/gensim/">Gensim</a>, and <a href="https://huggingface.co/">Hugging Face</a>. -->

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn about basics for the Natural Language Processing. In subsession **6.1**, we will extract useful information / facts (communication symbols) from tweets. In **6.2**, we will show how to implement a text preprocessing pipeline using XXX data at the end of which stands the document-term matrix that is ready for analysis (such as topic modeling). In **6.3**, we will deal with word and document similarities using similarity metrics and word/document embeddings (not the pretrained ones); also Zipf's Law.
</div>

## 6.1. Extracting entities from tweet texts

### 6.1.1. Extracting patterns using Regular Expressions

<a href="https://docs.python.org/3/howto/regex.html">Regular expressions</a> (also called regex, regexes, regex pattern, regexp, or REs) are a sequence of characters that define a search pattern. They are used in programming and text processing to match and manipulate strings of text based on a specific pattern.
A regular expression is a pattern used to match one or more text strings. It is usually composed of a combination of characters, symbols, and metacharacters. Metacharacters are special characters that have a specific meaning in regular expressions, for example, the period (.) that matches any single character, or the asterisk (*) that matches zero or more occurrences of the preceding character.
Regular expressions can be used to perform a variety of operations on text data, such as searching for specific patterns, replacing text with other text, or extracting specific information from a text string. 
Some common examples of regular expressions are matching an email addresses, phone numbers, dates, and URLs.

Regular expressions can be complex and difficult to read, but they are a powerful tool for manipulating and processing text data. Luckily, there are many resources that can help us write the correct regular expression for our task. Also, Python has built-in mobule (`re`) to use regular expressions.

<img src='images/Regular_Expressions_Cheat_Sheet.png'>

In the following examples, we will use the top 500 retweeted tweets from the TweetsCOV19 dataset, which was introduced in [Session 2: Data handling and visualization](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/2_data_handling_and_visualization.ipynb). To read and practice with this data, we need to import neccessary libraries below. If you have some difficulties with importing/installing, please check out the [Session 1: Computing environment](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) for further installation information.


In [1]:
import pandas as pd
import re
import emoji
import string

Let's import the data and visualize the first rows:

In [2]:
tweets_df = pd.read_csv('./data/TweetsCOV19/top_500_retweeted_tweets.csv', encoding = "utf-8")
tweets_df.head() 

Unnamed: 0,tweet_id,text,retweets
0,1265465820995411973,"This was me, and I want to make one thing clea...",257467
1,1266553959973445639,Mike Pence caught on hot mic delivering empty ...,135818
2,1258750892448387074,THE PANDEMIC IS STILL HAPPENING. THE PANDEMIC ...,88667
3,1263579286201446400,"This just happened on live tv. Wow, what a dou...",82495
4,1266546753182056453,Mask on,66604


In this example, we will extract all URLs from the text of the twee. A possible regular expression to match an URL is:

`http[s]*\S+`

This regular expression will match all strings that starts with `http`, or eventually with `https`, followed by non-empty spaces. 

We will use the `findall` function from the Python module `re` to match all URLs in text of the tweets:

In [3]:
# we create a new column where we store all the URLs mentioned in the tweet extracted using regex
tweets_df['urls'] = tweets_df['text'].apply(lambda x: re.findall("http[s]*\S+", x))
tweets_df['urls'].values[0]

['https://t.co/349TZijtD8']

We can also extract **mentions and hashtags** applying opportune regular expressions:

In [4]:
tweets_df['mentions'] = tweets_df['text'].apply(lambda x: re.findall("@[a-zA-Z0-9_]{1,50}", x))
print(tweets_df['mentions'].values[-1])

tweets_df['hashtags'] = tweets_df['text'].apply(lambda x: re.findall("#[a-zA-Z0-9_]{1,50}", x))
print(tweets_df['hashtags'].values[-5])

['@realDonaldTrump']
['#COVID']


For **emoji** extraction, in addition to regex, we will use the library called emoji (if not installed before, please install it before running the following cell). This library helps us transform emojis into the related codes (i.e., texts). Once the emojis are converted to text, we apply the same logic applied so far with regex to find them. 

The full list of emojis and related codes is available here: https://unicode.org/emoji/charts/full-emoji-list.html

Let's look at and example:

In [5]:
emoji.demojize("😂")

':face_with_tears_of_joy:'

We can apply this approach to the whole dataset:

In [6]:
def extract_emojis(text, return_codes=False):
    # first turn emojis into related text code
    text_de = emoji.demojize(text)
    # second find all emojis text code
    emojis_list_de = re.findall(r'(:[\a-z]+:)', text_de)
    # reconvert text code to emojis
    list_emoji = [emoji.emojize(x) for x in emojis_list_de]

    if return_codes:
        return emojis_list_de
    else:
        return list_emoji

tweets_df['emoji'] = tweets_df['text'].apply(extract_emojis)
tweets_df['emoji_text'] = tweets_df['text'].apply(extract_emojis, return_codes=True)

tweets_df.tail()

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text
495,1264986843948277760,"People who say ‘well, he’s doing the best he c...",9033,[https://t.co/5POEhfB6vi],[],[#COVID],[],[]
496,1260425005483073538,This young woman was killed in her home for no...,9021,[https://t.co/JzPgOzm4Rm],[],[#BreonnaTaylor],[],[]
497,1259587972728533000,I be like “oh shit my mask” like I’m Batman or...,8994,[],[],[],[😂😂],[:face_with_tears_of_joy::face_with_tears_of_j...
498,1266251584461090816,Really disappointed by @SAfridiOfficial‘s comm...,8984,[],"[@SAfridiOfficial, @narendramodi]",[],[🇮🇳],[:India:]
499,1266728243236950018,Let's be clear about what's happening:\n\n→ Am...,8974,[],[@realDonaldTrump],[],[],[]


Let's see the final results from our extraction example and sort values according to mentions.

In [7]:
tweets_df.sort_values(by='mentions', ascending=False)

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text
489,1258617080430997505,A Black New York State Senator (@zellnor4ny) a...,9151,[https://t.co/NoT8g4uAli],"[@zellnor4ny, @YourFavoriteASW]",[],[],[]
464,1266956300908363776,NEW: A volunteer on Kushner's coronavirus resp...,9327,[https://t.co/jvs2h4IfNQ],[@yabutaleb7],[],[: A volunteer on Kushner's coronavirus respon...,[: A volunteer on Kushner's coronavirus respon...
347,1260559563972960256,Wow! The Front Page @washingtonpost Headline r...,11591,[],[@washingtonpost],[],[],[]
360,1262940294305071104,it would appear that @vp was joking about carr...,11196,[https://t.co/hI9cO4lxcX],[@vp],[],[],[]
412,1261718681882693632,Very happy to present this unseen image of @ta...,10245,[https://t.co/3dzvynlUq3],"[@tarak9999, @DabbooRatnani]","[#HappyBirthdayNTR, #StayHomeStaySafe]",[😎\n\n📸 By @DabbooRatnani \n\n#HappyBirthdayNT...,[:smiling_face_with_sunglasses:\n\n:camera_wit...
...,...,...,...,...,...,...,...,...
167,1256717572373913605,Update: Got her permission with a fuck yeah. T...,19289,[https://t.co/MqV0QJ0D8h],[],[],[],[]
165,1265624335898869760,"Y'all, the mask goes OVER your nose.",19351,[],[],[],[],[]
164,1258599146522464256,Because if its Baghdad its okay for this to ha...,19457,[https://t.co/UdFy61zoT5],[],[],[],[]
163,1266343312304324608,I gotta be honest the worst looting I've ever ...,19527,[],[],[],[],[]


As a final exercise, let's clean text from urls, hashtags, mentions, and emojis for further text analysis.

In [8]:
def remove_urls(text):
    # find all URLs in text using regex
    urls = re.findall("http[s]*\S+", text)
    # iterate through the URLs and remove them
    for url in urls:
        text = text.replace(url, "")
    return text


def remove_hashtags(text):
    # find all hashtags in text using regex
    hashtags = re.findall("@[a-zA-Z0-9_]{1,50}", text)
    # iterate through the hashtags and remove them
    for hashtag in hashtags:
        text = text.replace(hashtag, "")
    return text


def remove_mentions(text):
    # find all mentions in text using regex
    mentions = re.findall("#[a-zA-Z0-9_]{1,50}", text)
    # iterate through the mentions and remove them
    for mention in mentions:
        text = text.replace(mention, "")
    return text


def remove_emojis(text):
    # find all emoji in text
    emojis = extract_emojis(text, return_codes=False)
    # iterate through the emojis and remove them
    for emoji in emojis:
        text = text.replace(emoji, "")
    return text


def clean_text(text):
    # create a cleaning pipeline 
    text = remove_urls(text)
    text = remove_hashtags(text)
    text = remove_mentions(text)
    text = remove_emojis(text)
    return text

tweets_df['cleaned_text'] = tweets_df['text'].apply(lambda x: clean_text(x))

Let's see how it worked:

In [9]:
print('Original Tweet:', tweets_df.text.values[412])
print('\n\nCleaned Tweet:', tweets_df.cleaned_text.values[412])

Original Tweet: Very happy to present this unseen image of @tarak9999 .. I hope you all like it 😎

📸 By @DabbooRatnani 

#HappyBirthdayNTR 🎉

#StayHomeStaySafe 🙏🏼 https://t.co/3dzvynlUq3


Cleaned Tweet: Very happy to present this unseen image of  .. I hope you all like it  


### 6.1.2. Extracting named entities

A named entity is a real-life object which can be identified and denoted with a proper name. Named Entities can be a place, person, organization, time, object, or geographic entity. For example, named entities would be Joe Biden, New York city, and congress. Named entities are usually instances of entity instances. For example, Joe Biden is an instance of a politician/person, New York City is an instance of a place, and congress is as instance of an organization. 

**Named Entity Recognition** (NER) is the process of NLP for identifying and classifying named entities. The raw and structured text are used to find out named entities, which are classified into persons, organizations, places, money, time, etc. NER systems are developed with various linguistic approaches, as well as statistical and machine learning methods. 

NER model first identifies an entity and then categorizes the entity into the most suitable class. Some of the common types of Named Entities will be as follows and others can be found in the further example of a Wikipedia page text.

1. Organisations : NASA, CERN

2. Places: Istanbul, Germany

3. Money: 1 Billion Dollars, 50 Euros

4. Date: 24th January 2023, season 4

5. Person: Richard Feynman, George Floyd
 
<img src='images/NER.png' style='height: 500px; float: left'>

<div class='alert-info'>
<big><b>Insight</b></big>

    
For NLP tasks like NER, POS tagging, dependency parsing, word vectors and more, <a href="https://spacy.io/">spaCy</a> has distinct features that provide clear advantage for processing text data and modeling. It is the most trending and advanced free open-source library for implementing NLP in Python nowadays. 
    
An important thing about NER models is that their ability to understand Named Entities depending on the data they have been trained on. There are many applications of NER. NER can be used for content classification, the various Named Entities of a text can be collected, and based on that data, the content themes can be understood.
    
We can use spaCy very easily for NER tasks. However, we need to consider training our own data for research, commercial, and business specific needs, the spaCy model generally performs well for all types of text data. 
    
</div>

As usual, let's import necessary libraries and packages and start with a toy example from our tweets dataframe, which is the second line of the text column. 

In [10]:
import spacy 

# before loading it we need to install this module via: #!python -m spacy download en_core_web_sm
NER = spacy.load("en_core_web_sm")

# Print the second tweet of our dataset
raw_text = tweets_df.cleaned_text[1]
print(raw_text)

Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 


Now, we print the data on the Named Entities found in this raw text sample from our dataset.

In [11]:
# extract the entities using the spacy objects previously defined in the
NER_text = NER(raw_text)

# show all the entities extracted from the text
for word in NER_text.ents:
    print(word.text, word.label_)

Mike Pence PERSON
PPE ORG


<div class='alert-info'>
<big><b>Insight</b></big>
    
Here, PPE is a context specific word to be labeled as organization. In the COVID-19 context like in our example, it stands for "personal protective equipment"; which is not an organization. On the other hand, as an abbreviation of the Philosophy, Politics, and Economics Society, PPE can be labeled as an organization.
</div>  

Now, let's run NER on the full dataset and find out the output with Named Entities and who are the most cited people:

In [35]:
import spacy
from collections import Counter

# Load the pre-trained model with NER
nlp = spacy.load("en_core_web_sm")

# Define an array to store the people cited
persons_cited = []

# Loop over each text and analyze it with spaCy's NER
for text in tweets_df.cleaned_text:
    doc = NER(text)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            # If the entity is a person, add it to the array 
            persons_cited.append(ent.text)

citations_count = Counter(persons_cited)
citations_count.most_common(5)

[('Twitter', 11),
 ('George Floyd', 9),
 ('Donald Trump', 7),
 ('Flynn', 7),
 ('Fauci', 7)]

## 6.2. Text representation: Implementing a preprocessing pipeline

## 6.2.1. Word Descriptors

To refer to the entire collection of documents/observations, we use the word corpus (plural corpora). The raw text data often referred to as *text corpus* has punctuations, suffices, and stop words that do not give us important information. To have more useful information for NLP tasks, Text Preprocessing involves preparing the text corpus. Let's start with basic terminology of NLP.

### Tokens and splitting 

The set of all the unique terms in our data is called the vocabulary. Each element in this set is called a type. Each occurrence of a type in the data is called a token. 

Let's practice: Our sentence is

>“Today is a great day with learning NLP, such a power tool!”

Thi sentece has 14 tokens but only 13 types (namely, 'Today', 'is', 'a', 'great', 'day', 'with', 'learning', 'NLP', ',', 'such', 'a', 'powerful', 'tool', '!'). Note that types can also include punctuation marks and multiword expressions.

In other words, the words of a text document/file separated by spaces and punctuation are called as tokens.

#### What is a Tokenization?
The process of extracting tokens from a text file/document is referred as tokenization. Let's see an example below of a tokenization process using spaCy:

In [13]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Print the original and tokenized text
print('Original text:', text)
print('\nTokens in the text:',)

for token in doc:
    print('\t', token.text)

print('\nTotal tokens:', len(doc))


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 

Tokens in the text:
	 Mike
	 Pence
	 caught
	 on
	 hot
	 mic
	 delivering
	 empty
	 boxes
	 of
	 PPE
	 for
	 a
	 PR
	 stunt
	 .

Total tokens: 16


We can also push furhter our analysis and extract the vocabulary from the corpus of tweets from the previous dataset. Since the vocabulary of a text corpus is the collection of unique tokens present in that corpus, we will just need to tokenize each single tweet and keep unique occurence of each token:

In [20]:
from collections import Counter

# Process all tweets with spaCy and extract all tokens
tokens = []
for text in tweets_df.cleaned_text:
    doc = nlp(text)
    for token in doc:
        tokens.append(token.text)

# Count the occurrences of each token and create a vocabulary of unique tokens
vocabulary = Counter(tokens)

# Print the extracted vocabulary
print("Size of extracted vocabulary: {0}".format(len(vocabulary)))

Size of extracted vocabulary: 3281


### Lemmatization
When we look up a word in a dictionary, we usually just look for the base form. This dictionary base form is called the **lemma**.
For instance, we might see forms like “go”, “goes”, “went”, “gone”, or “going” and we look up dictionary in a lemmatized form, such as "go" (Hovy, 2020). These words have clearly different meaning, in some contexts it is not fundamental to distinguish them. On the contrary, it is much more convenient to trace them back to their lemma. Indeed, this may simplify some analysis and allow easier extraction of relevant information from the text. Let's see an example of lemmatization applied to the corpus of tweets using spaCy:

In [21]:
# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy and perform lemmatization
doc = nlp(text)

# Print words and extractes lemmas
for token in doc:
    print("{0} -> {1}".format(token.text, token.lemma_))

# Finally we can recover the text of the tweet after lemmatization
print('\n\nOriginal text:', text)
lemmatized_text = " ".join([token.lemma_ for token in doc])
print('Lemmatized text:', lemmatized_text)

Mike -> Mike
Pence -> Pence
caught -> catch
on -> on
hot -> hot
mic -> mic
delivering -> deliver
empty -> empty
boxes -> box
of -> of
PPE -> PPE
for -> for
a -> a
PR -> pr
stunt -> stunt
. -> .


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 
Lemmatized text: Mike Pence catch on hot mic deliver empty box of PPE for a pr stunt .


### Stemming 

Another strategy to reduce different forms of a word to a common base or root form is stemming. Stemming involves removing the suffixes of words to create a simplified form of the word. For example, the stem of the words "running," "runner," and "run" is "run." This can be achieved using several algorithms like the one developed by Porter (1980). This algorithm defines a number of suffixes and the order in which they should be removed or replaced. These actions are then applied iteratively untill a word is reduced to its stem.

Note how, although similar, stemming and lemmatization are different and give different results. Generally speaking, lemmatization tends to produce more accurate and meaningful results with respect to stemming. Nonethelss, stemming is often faster and simpler to implement, which makes it useful for tasks that require real-time processing or have limited computational resources.

An implementation of the Porter stemmer is available in the Python library NLTK. Let's see an example:

In [22]:
# run this to install NLTK
# !pip install nltk

In [23]:
# download popular NLTK data
# !python -m nltk.downloader popular

In [24]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# This performs tokenization on the text (NLTK equivalalent of what we did with spaCy)
tokens = word_tokenize(text)

# Create a PorterStemmer object
stemmer = PorterStemmer()

# Apply stemming to each word in the text
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Let's see results 
for token, stem in zip(tokens, stemmed_tokens):
    print("{0} -> {1}".format(token, stem))

# Finally we can recover the text of the tweet after lemmatization
print('\n\nOriginal text:', text)
stemmed_text = " ".join(stemmed_tokens)
print('Stemmed text:', stemmed_text)

Mike -> mike
Pence -> penc
caught -> caught
on -> on
hot -> hot
mic -> mic
delivering -> deliv
empty -> empti
boxes -> box
of -> of
PPE -> ppe
for -> for
a -> a
PR -> pr
stunt -> stunt
. -> .


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 
Stemmed text: mike penc caught on hot mic deliv empti box of ppe for a pr stunt .


### N-grams

In natural language processing (NLP), **N-grams** are contiguous sequences of n elements from a given text sample, where an element can be a word, a character, or part of speech. In most cases, n-grams are created from a text by dragging a window of size n over the text and extracting the sequences of n elements that fall within that window.

N-grams are used in a variety of NLP tasks such as language modeling, machine translation, and text classification. By extracting n-grams from a text, it is possible to capture the local context of a word or word sequence, which can help improve the accuracy of many NLP tasks.

For example, a bigram (n=2) is "natural language", a trigram (n=3) is "natural language processing", and a 4-gram (n=4) is "natural language processing task". By examining the frequency of different n-grams in a text or corpus, it is possible to gain insight into the distribution of words and their relationships.

N-grams can also be used to generate new texts through techniques such as n-gram language modeling. In this approach, the probabilities of different N-grams in a text are used to generate a new text that is similar in style and content to the original text.

However, it should be noted that n-grams can be constrained by the sparsity problem, especially for larger values of n. That is, as the value of n increases, the number of unique n-grams in a text can increase rapidly, making it difficult to capture meaningful patterns or relationships. Therefore, choosing an appropriate value of n is an important consideration in many NLP tasks.

Let's see an example of  N-grams extraction applied to the corpus of tweets using spaCy:

In [25]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Define the function to extract n-grams
def extract_ngrams(doc, n):
    ngrams = []
    for i in range(len(doc) - n + 1):
        ngram = " ".join([doc[j].text for j in range(i, i + n)])
        ngrams.append(ngram)
    return ngrams

# Extract unigrams, bigrams, and trigrams from the text
unigrams = extract_ngrams(doc, 1)
bigrams = extract_ngrams(doc, 2)
trigrams = extract_ngrams(doc, 3)

# Print the extracted n-grams
print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)


Unigrams: ['Mike', 'Pence', 'caught', 'on', 'hot', 'mic', 'delivering', 'empty', 'boxes', 'of', 'PPE', 'for', 'a', 'PR', 'stunt', '.']
Bigrams: ['Mike Pence', 'Pence caught', 'caught on', 'on hot', 'hot mic', 'mic delivering', 'delivering empty', 'empty boxes', 'boxes of', 'of PPE', 'PPE for', 'for a', 'a PR', 'PR stunt', 'stunt .']
Trigrams: ['Mike Pence caught', 'Pence caught on', 'caught on hot', 'on hot mic', 'hot mic delivering', 'mic delivering empty', 'delivering empty boxes', 'empty boxes of', 'boxes of PPE', 'of PPE for', 'PPE for a', 'for a PR', 'a PR stunt', 'PR stunt .']


Alternatively, we can use Gensim, another popular library for NLP, to automatically extract the most common n-grams:

In [26]:
import gensim

# gensim expect as input tokenized texts
texts = []
for text in tweets_df.cleaned_text.values:
    texts.append(word_tokenize(text))

# extract bigrams
bigrams = gensim.models.Phrases(texts, min_count=5, threshold=100)
texts_bigrams = [bigrams[text] for text in texts]

# visualize the extracted bigrams
extracted_bigrams = []
for text in texts_bigrams:
    for el in text:
        if "_" in el:
            extracted_bigrams.append(el)

extracted_bigrams = set(extracted_bigrams)
print(extracted_bigrams)


{'&_amp', 'BREAKING_:', '_Candy', 'White_House', 'Dr._Fauci', 'George_Floyd', 'IS_STILL', 'United_States', '_Truth_Or_Dare', 'THE_PANDEMIC', 'social_distancing', 'tear_gas', '큥이_에리_기가막힌_케미스트리'}


## 6.2.2. Stopwords

In natural language processing (NLP), stop words refer to words that are frequently used in a language but usually do not have much meaning or semantic value when used in context. Examples of stop words in English are "the", "a", "an", "and", "in", "on", "is", "are", "for", "with", and so on.

Stop words are usually removed from text during preprocessing in NLP tasks such as text classification, sentiment analysis, and information retrieval. The reason is that they do not contribute much to the overall meaning or topic of a text and can potentially degrade algorithm performance by adding noise to the data. Removing stop words can also help reduce the size of vocabulary and improve the efficiency of text processing algorithms.

However, there are certain cases where the inclusion of stop words in the analysis may be useful or even necessary. For example, stopwords can be useful in tasks such as authorship attribution, to identify common themes, or writing styles. In such cases, it is important to carefully consider the use of stop words and their potential impact on the analysis

We will now see a simple example on how to remove Stop Words from a text using spaCy:

In [27]:
from spacy.lang.en.stop_words import STOP_WORDS

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Define the list of stop words
stop_words = list(STOP_WORDS)

# Remove stop words from the text
filtered_text = [token.text for token in doc if not token.text not in stop_words]
stop_words_removed = [token.text for token in doc if token.text in stop_words]

# Print the original and filtered text, and the stop words removed
print("Original tokens: ", [token.text for token in doc])
print("Filtered tokens:", filtered_text)
print("\nStop words removed: ", stop_words_removed)


Original tokens:  ['Mike', 'Pence', 'caught', 'on', 'hot', 'mic', 'delivering', 'empty', 'boxes', 'of', 'PPE', 'for', 'a', 'PR', 'stunt', '.']
Filtered tokens: ['on', 'empty', 'of', 'for', 'a']

Stop words removed:  ['on', 'empty', 'of', 'for', 'a']


Depending on the task, one can also add custom stop words. This can be easily done by appending additional words to the stop words list:

In [28]:
print(len(stop_words))
stop_words.append("place")
print(len(stop_words))

326
327


## 6.2.3. Parts of Speech

**Part of speech tagging** (POS) is the process of assigning a part of speech to each word in a sentence, such as noun, verb, adjective, or adverb. POS Tagging is an important step in many NLP applications, such as named entity recognition, sentiment analysis, and machine translation.

The goal of POS tagging is to identify the grammatical structure of a sentence by labelling each word with its corresponding part of speech. This information can then be used to extract meaning and context from the text. For example, knowing whether a word is a noun or a verb can help determine the subject and predicate of a sentence.

POS tagging is typically performed using machine learning algorithms, such as hidden Markov models, conditional random fields, or neural networks. These algorithms are trained on annotated text corpora in which each word is labelled with the corresponding word type. After training, the algorithm can then predict the word type for a new unseen text.

POS tagging is not always an easy task, as some words may have multiple possible word types depending on the context. For example, "run" can be a verb ("I run every morning") or a noun ("I went for a run"). In these cases, the algorithm must use contextual clues to determine the most likely part of speech for the word.

Overall, POS tagging is an important technique in NLP that helps extract meaning and context from texts by identifying the grammatical structure of sentences.

English has 9 main categories:

- verb — Expresses an action or a state of being. E.g. jump, is, write, become
- noun — identifies a person, a place or a thing or names of particular of one of these (pronoun). E.g. man, house, happiness
- pronoun — can replace a noun or noun phrase. E.g. she, we, they, it
- determiner — Is placed in front of a noun to express a quantity or clarify what the noun refers to — briefly a noun introducer. E.g. my, that, the, many
- adjective — modifies a noun or a pronoun. E.g. pretty, old, blue, smart
- adverb — modifies a verb, an adjective, or another adverb. E.g. gently, extremely, carefully, well
- preposition — Connect a noun/pronoun to other parts of the sentence. E.g. by, with, about, until
- conjunction — glue words, clauses, and sentences together. E.g. and, but, or, while, because
- interjection — Expresses emotion in a sudden or exclamatory way. E.g. oh!, wow!, oops!


<img src='images/POS.png' style='height: 200px; float: center'>




We will now see a simple example on how to perform POS on a text using spaCy:

In [29]:
import spacy

# load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# iterate over each token in the doc and print its text and POS tag
for token in doc:
    print(token.text, token.pos_)


Mike PROPN
Pence PROPN
caught VERB
on ADP
hot ADJ
mic NOUN
delivering VERB
empty ADJ
boxes NOUN
of ADP
PPE PROPN
for ADP
a DET
PR NOUN
stunt NOUN
. PUNCT


If the meaning of a POS tag is not clear to us, we ask spaCy to explain it:

In [30]:
spacy.explain("PROPN")

'proper noun'

Finally, let's see how spaCy POS tagging works on more tricky examples:

In [31]:
# here the word run is used as verb
text1 = "I run every morning"

# here the word run is used as a noun
text2 = "I went for a run"

# POS tagging of sentence 1
for token in nlp(text1):
    print(token.text, token.pos_)

print("\n")
# POS tagging of sentence 2
for token in nlp(text2):
    print(token.text, token.pos_)


I PRON
run VERB
every DET
morning NOUN


I PRON
went VERB
for ADP
a DET
run NOUN


We can see that spaCy correctly tag the word "run" differently in these two examples. 

## 6.2.3. Bag-of-words and TF-IDF Preprocessing

So far we have introduced what are tokenization, stemming, lemmatization, stop words, n-grams, and part of speech tagging. As we have seen, these are all preprocessing techniques that aims at cleaning, removing unnecessary information, and extracting structure from the text. As a last step of this section, we will introduce two techniques, the Bag-of-Words (BoW) approach and its natural extension, a technique called TF-IDF (Term Frequency-Inverse Document Frequency). These approaches combines the techniques described so far, and finally prepares the texts for the actual analysis, such as topic modeling, text classification, and sentiment analysis.


In natural language processing (NLP), a **"bag of words"** is a representation of a text document that describes the occurrence of words in it. It is a simple and commonly used approach to convert text data into a numerical format that can be used for analysis and machine learning.
The bag-of-words model ignores the order and structure of the text and only considers the frequency of occurrence of each word in the document. The resulting representation is a "bag" of words in which each word is represented as a separate feature, and the value of each feature is the count of the corresponding word in the document.

The bag-of-words representation of a corpus is usually stored in a matrix called the **document-term matrix**, where each row represents a document and each column represents a term (i.e., a word). The value in each cell is the number of occurrences of the corresponding term in the corresponding document. The document-term matrix is generally a sparse matrix, meaning that it contains a large number of zero elements and few non-zero ones. Indeed, most documents only contain a small fraction of the possible words in a language, and most words occur in only a subset of the documents. This means that the vast majority of the entries in the document-term matrix are zero.

We will now see how to derive the document-term matric using spaCy and Gensim. For this task, we will use another dataset that contains news articles. Let's import the data:



In [32]:
import pandas as pd
import numpy as np 
news = pd.read_csv("./data/news_subset.csv")
news.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/eleven-american...,11 American Troops Injured In Iran Attack On I...,WORLD NEWS,The United States military originally said no ...,"Eric Beech, Reuters",2020-01-17
1,https://www.huffingtonpost.com/entry/gus-kenwo...,Olympian Gus Kenworthy Burns Ivanka Trump: 'TF...,SPORTS,The first daughter led the U.S. delegation dur...,Alana Horowitz Satlin,2018-02-25
2,https://www.huffingtonpost.com/entry/watch-ins...,WATCH: Inspiring Woman Living with Spinal Musc...,WELLNESS,"When Alyssa was just 5 months old, she was dia...","HooplaHa - Only Good News, Contributor\nHoopla...",2013-05-20
3,https://www.huffingtonpost.com/entry/dad-deliv...,"Brent Farrell, Dad, Knocked Down Locked Door T...",PARENTING,"A week before Henry's quick delivery, Katherin...",Jessica Samakow,2012-04-09
4,https://www.huffingtonpost.com/entry/how-polit...,How Politically Correct Culture Influences My ...,PARENTING,"I may not abandon my child in the wilderness, ...","Toni Nagy, Contributor\nwriter, podcaster, ton...",2014-01-24


Before deriving the document-term matrix, we preprocess the articles using the technique described before: tokenization, lemmatization (or stemming), stop words and punctuation removal: 

In [36]:
import gensim
import spacy
import re
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")

def clean_text(text):

    # remove punctuation and special characters
    pattern = r"[^\w\s]"
    text_clean = re.sub(pattern, "", text)

    # remove numbers
    pattern = r"\d+"
    text_clean = re.sub(pattern, "", text_clean)

    # remove all non-ASCII characters
    pattern = r"[^\x00-\x7F]+"
    text_clean = re.sub(pattern, "", text_clean)

    # remove new line characters
    text_clean.replace("\n", "")

    # remove empty spaces left by regex
    text_clean = ' '.join(text_clean.split())
    
    return text_clean


def tokenization(texts):
    #return [text.split(" ") for text in texts]
    return [word_tokenize(text) for text in texts]


def remove_stop_words(texts, stop_words=[]):
    if stop_words == []:
        stop_words = list(STOP_WORDS)
    return [[word for word in doc if word.lower() not in stop_words] for doc in texts]


def add_bigrams(texts):
    bigrams = gensim.models.Phrases(texts, min_count=5, threshold=100)
    return [bigrams[text] for text in texts]


def stemming(texts):
    stemmer = PorterStemmer()
    return [[stemmer.stem(word) for word in doc] for doc in texts]


def lemmatization(texts):
    texts_lemma = []
    for text in texts:
        doc = nlp(" ".join(text)) 
        texts_lemma.append([token.lemma_ for token in doc])
    return texts_lemma


def pipeline(corpus):
    print("Cleaning text...")
    corpus = [clean_text(text) for text in corpus]

    print("Tokenization...")
    corpus = tokenization(corpus)

    print("Lowercasing...")
    corpus = [[el.lower() for el in text] for text in corpus]

    print("Stop Words removal...")
    corpus = remove_stop_words(corpus)

    print("Extract bigrams...")
    corpus = add_bigrams(corpus)

    print("Stemming...")
    corpus = stemming(corpus)

    print("Stop Words removal after stemming...")
    corpus = remove_stop_words(corpus)

    print("Removing tokens that are too short...")
    corpus = [[c for c in text if len(c) > 2] for text in corpus]

    return corpus

Run the preprocessing pipeline:

In [37]:
corpus = []
for index, row in news.iterrows():
    corpus.append(row.headline + ". " + row.short_description)
corpus = np.array(corpus)

# run the preprocessing pipeline
corpus = pipeline(corpus)

Cleaning text...
Tokenization...
Lowercasing...
Stop Words removal...
Extract bigrams...
Stemming...
Stop Words removal after stemming...
Removing tokens that are too short...


We are now ready to extract the document-term matrix from the preprocessed corpus. We will use and implementation of the BoW approach in Gensim:

In [38]:
from gensim.corpora import Dictionary

# we create a dictionary
dictionary = Dictionary(corpus)

# we filter very common and very rare words
#dictionary.filter_extremes(no_below=10, no_above=0.3)

# covert the corpus to bag of words format 
document_term_matrix = [dictionary.doc2bow(text) for text in corpus]

In the previous cell we have created a dictionary (i.e., a collection of all the words appearing in the corpus) and the document_term matrix. Let's see these outputs:

In [39]:
print("Number of words in the dictionary: {0}".format(len(dictionary)))
print("Dictionary first 5 elements (id, token):", list(dictionary.items())[:5])

print("\nFirst document in bag-of-words format (raw):", document_term_matrix[0])
print("First document in bag-of-words format (word, frequency):", [[dictionary[id], freq] for id, freq in document_term_matrix[0]])

Number of words in the dictionary: 40358
Dictionary first 5 elements (id, token): [(0, 'alasad'), (1, 'american'), (2, 'attack'), (3, 'base'), (4, 'erbil')]

First document in bag-of-words format (raw): [(0, 1), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 3), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
First document in bag-of-words format (word, frequency): [['alasad', 1], ['american', 1], ['attack', 2], ['base', 2], ['erbil', 1], ['hurt', 1], ['injur', 1], ['iran', 2], ['iraq', 1], ['iraqi', 1], ['jan', 1], ['militari', 3], ['missil', 1], ['origin', 1], ['said', 1], ['service_memb', 1], ['troop', 1], ['united_st', 1]]


We also save the output for future analysis:

In [40]:
import pickle as pkl

with open("./output/dict_gensim.pkl", "wb") as file:
    pkl.dump(dictionary, file)

with open("./output/corpus.pkl", "wb") as file:
    pkl.dump(corpus, file)

with open("./output/document_term_matrix.pkl", "wb") as file:
    pkl.dump(document_term_matrix, file)

Despite its simplicity, the BoW approach is a powerful technique to turn collections of text into a numerical format that can then be inputed to a variety of models for several applications, including topic modeling. Nonetheless, it has some limitations, such as:

- Importance of rare words: The bag of words model assigns equal weight to all words in a document, regardless of their importance or rarity. 
- Discrimination of common words: The bag of words model assigns high weights to common words, which are not very informative and may not be discriminative for distinguishing between different documents. 

These limitations, can be corrected using the **Term Frequency-Inverse Document Frequency** (TF-IDF) matrix instead of the simple document-matrix. The idea behind TF-IDF is to assign a weight to each word in a document based on how frequently it occurs in the document and how important it is in the overall corpus. This weight is calculated by multiplying two factors:

- **Term Frequency (TF)**: this is a measure of how often a word occurs in a document. It is calculated by dividing the number of occurrences of a word in a document by the total number of words in the document. The TF value for a word is high if it occurs very often in a document, and low if it occurs only a few times. In mathematical terms, the TF value for word $t$ in document $d$ is:

$
\begin{align}
tf(t, d) = \frac{f_{t,d}}{\sum_{t' \in d}f_{t', d}}
\end{align}
$

- **Inverse Document Frequency (IDF)**: This is a measure of how important a word is in a corpus. It is calculated by dividing the total number of documents in the corpus by the number of documents containing the word. The IDF value for a word is high if it occurs in a few documents and low if it occurs in many documents. In general, it is used the logarithm of the IDF factor. Indeed, if a word appears in only a very small number of documents, the resulting IDF value can be very large. This can lead to a situation where the TF-IDF weight of a word is dominated by its IDF value, even if its term frequency (TF) is relatively low. Taking the logarithmof the IDF value has the effect of compressing the range of possible IDF values and reducing the impact of very high IDF values. In mathematical terms, the TF value for word $t$ in a corpus $D$ of $N$ document is:

$
\begin{align}
idf(t, D) = log \frac{N}{|\{d \in D : t \in d\}|}
\end{align}
$


The TF-IDF weighting $w_{t, d, D}$ for a word $t$ is then calculated by multiplying the TF value and the IDF value for that word:
$
\begin{align}
w_{t, d, D} = tf(t, d) \times idf(t, D)
\end{align}
$

The higher the TF-IDF weighting, the more important the word is in the document or corpus. We can simply derive the TF-IDF matrix using Gensim starting from the bag-of-words representation of the preprocessed corpus previously obtained:




In [41]:
from gensim.models import TfidfModel

# fit TF-IDF model
model = TfidfModel(document_term_matrix)
tf_idf = model[document_term_matrix]

Let's see the output:

In [42]:
print("\nFirst document in TF-IDF format (raw):", tf_idf[0])
print("First document in TF-IDF format (word, frequency):", [[dictionary[id], freq] for id, freq in tf_idf[0]])


First document in TF-IDF format (raw): [(0, 0.3048256577720754), (1, 0.10267055678042152), (2, 0.249209864967932), (3, 0.29483215702461396), (4, 0.3048256577720754), (5, 0.15325778767051162), (6, 0.17868562942168534), (7, 0.33670071042346267), (8, 0.175367327477358), (9, 0.2023439280457968), (10, 0.18208408221404682), (11, 0.44882891302624855), (12, 0.20161212056132805), (13, 0.16219714724153844), (14, 0.0900998951919583), (15, 0.22500552666505164), (16, 0.17774119448204392), (17, 0.14904502951897228)]
First document in TF-IDF format (word, frequency): [['alasad', 0.3048256577720754], ['american', 0.10267055678042152], ['attack', 0.249209864967932], ['base', 0.29483215702461396], ['erbil', 0.3048256577720754], ['hurt', 0.15325778767051162], ['injur', 0.17868562942168534], ['iran', 0.33670071042346267], ['iraq', 0.175367327477358], ['iraqi', 0.2023439280457968], ['jan', 0.18208408221404682], ['militari', 0.44882891302624855], ['missil', 0.20161212056132805], ['origin', 0.16219714724153

Finally, we save the TF-IDF object for further analysis:

In [43]:
with open("./output/tf_idf_gensim.pkl", "wb") as file:
    pkl.dump(tf_idf, file)

# References

Hovy, D. (2020). Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press.

Hovy, D. (2021). Text Analysis in Python for Social Scientists: Prediction and Classification. Cambridge University Press.

Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

He, R., & McAuley, J. (2016, April). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web (pp. 507-517).: http://jmcauley.ucsd.edu/data/amazon/index_2014.html


https://www.machinelearningplus.com/nlp/natural-language-processing-guide/



Image Credits

[1] Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media. Chapter 1: https://www.oreilly.com/library/view/practical-natural-language/9781492054047/ch01.html

POS image credit: https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/
spacy pipeline image credit: https://spacy.io/usage/linguistic-features

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbi & Nicolò Gozzi

Contributors: Haiko Lietz & Pouria Mirelmi & ..?

Acknowledgements: ...

Version date: 27 April 2023

License: ...
</div>