<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

## Introduction to Computational Social Science methods with Python

# Session 6: Natural Language Processing
The field of study that focuses on the interactions between human language and computers is called natural language processing (NLP).

NLP is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. NLP systems are used exploiting the signals in our language used to predict all of the aforementioned features: people’s age (Nguyen et al., 2011; Rosenthal & McKeown, 2011), gender (Alowibdi et al., 2013; Ciot et al., 2013; Liu & Ruths, 2013), personality (Park et al., 2015), job title (Preoţiuc-Pietro et al., 2015a), income (Preoţiuc-Pietro et al., 2015b), and much more (Volkova et al., 2014, 2015).

In NLP, word embeddings have been at the forefront of this progress, which has expanded to include flexible model architectures (Hovy, 2021). The most publicly visible example of this shift is probably the translation quality of services like Google Translate (Wu et al., 2016).

A collection of fundamental tasks appear frequently across various NLP projects (Vajjala et al., 2020). Let’s briefly introduce them (Figure 1):

<img src='images/nlp_tasks.png' style='height: 400px; float: right'>

*Language modeling* is the task of predicting what the next word in a sentence will be based on the history of previous words. The goal of this task is to learn the probability of a sequence of words appearing in a given language. Language modeling is useful for building solutions for a wide variety of problems, such as **speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction**.

*Text classification* is the task of bucketing the text into a known set of categories based on its content. Text classification is by far the most popular task in NLP and is used in a variety of tools, from **email spam identification** to **sentiment analysis**.

*Information extraction* is the task of extracting relevant information from text, such as **calendar events from emails** or the **names of people mentioned** in a social media post.

*Information retrieval* is the task of finding documents relevant to a user query from a large collection. Applications like **Google Search** are well-known use cases of information retrieval.

*Conversational agent* is the task of building dialogue systems that can converse in human languages. **Alexa** and **Siri** are some common applications of this task.

*Text summarization* aims to create short summaries of longer documents while retaining the **core content** and preserving the **overall meaning** of the text.

*Question answering* is the task of building a system that can automatically answer questions posed in natural language.

*Machine translation* is the task of converting a piece of text from one language to another. Tools like **Google Translate** are common applications of this task.

*Topic modeling* is the task of uncovering the topical structure of a large collection of documents. Topic modeling is a common text-mining tool and is used in a wide range of domains, from **literature** to **bioinformatics**.



Understanding human language is considered as a difficult task due to its complexity. For example, there is an infinite number of different ways to arrange words in a sentence. 

Also, words can have several meanings and contextual information is necessary to correctly interpret sentences as every language is unique and ambiguous. The ambiguity can be in lexical and syntactic forms.

- In lexical ambiguity, a single word has two or more possible meanings. For example, "I saw bats".
- In syntactic ambiguity, a single sentence or a sequence of words have multiple possible meanings. For example, "The chicken is ready to eat".

This session will help you understand the basic and advanced NLP concepts and show you how to implement using the most advanced and popular NLP libraries, such as  <a href="https://spacy.io/">spaCy</a> and <a href="https://radimrehurek.com/gensim/">Gensim</a>.

<!-- – <a href="https://www.nltk.org/">NLTK</a>, <a href="https://spacy.io/">spaCy</a>, <a href="https://radimrehurek.com/gensim/">Gensim</a>, and <a href="https://huggingface.co/">Hugging Face</a>. -->

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn about basics for the Natural Language Processing. In subsession **6.1**, we will extract useful information / facts (communication symbols) from tweets. In **6.2**, we will show how to implement a text preprocessing pipeline using XXX data at the end of which stands the document-term matrix that is ready for analysis (such as topic modeling). In **6.3**, we will deal with word and document similarities using similarity metrics and word/document embeddings (not the pretrained ones); also Zipf's Law.
</div>

<div class='alert alert-block alert-danger'>
<big><b>Reminder</b></big>
    
ONLY use pip unless there is no Conda option. Please make sure that ALL packages we need are installed. If you need further information how to install and import packages and libraries, please check out 
<a href="https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb"> Session 1: Computing environment </a>.
</div>

## 6.1. Extracting entities from tweet texts

<div class='alert alert-block alert-danger'>
<big><b>Notes from the content document (remove after finalizing it)</b></big>
    
Extracting information from tweets; start with using regular expressions (Hovy has section; Ali has slides in summer school 2022) to extract tweets with certain properties (e.g., tweets with hashtags, tweets with URLs); demonstrate that it is important to remove punctuation at a certain step in the extraction process (certainly after URL extraction but before hashtag extraction); sensitize users about the order of operations; data: table of tweet texts of the 100 most retweeted tweets in the TweetsCOV19 dataset from session 2
mentioned users
hashtags
urls
named entities (with Spacy: https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/; or does gensim have anything?)
lexicon-based sentiment (using spacy of pysenti? Or gensim?)
emojis
</div>

### 6.1.1. Extracting facts using regular expressions

<a href="https://docs.python.org/3/howto/regex.html">Regular expressions</a> (called regex, regexes, regex pattern, regexp, or REs) specify search patterns. Typical examples of regular expressions are the patterns for matching email addresses, phone numbers, and credit card numbers.

Regular expressions are essentially a specialized programming language embedded in Python, and you can interact with regular expressions via the built-in `re` module in Python, which has some functions that match a string for a pattern:

- `match()`
- `search()`
- `findall()`
- `finditer()`

Pattern... character set...

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, can you please add here the image about regex functions, a nicer one maybe you can find online and cite it of course??? Add logos for the libraries, such as pandas, regex, spacy, gensim and if I forget anything, pls check out the notebook and add those logos like we did in other sessions. Thank you!
</div>

We will use the top 500 retweeted tweets from the TweetsCOV19 dataset, which was introduced in [Session 2: Data handling and visualization](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/2_data_handling_and_visualization.ipynb). To read and practice with this data, we need to import neccessary libraries below. If you have some difficulties with importing/installing, please check out the [Session 1: Computing environment](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) for further installation information.


In [1]:
import pandas as pd
import re
import emoji
import string

In [2]:
tweets_df = pd.read_csv('./data/TweetsCOV19/top_500_retweeted_tweets.csv', encoding = "utf-8")

In [3]:
tweets_df.head() 

Unnamed: 0,tweet_id,text,retweets
0,1265465820995411973,"This was me, and I want to make one thing clea...",257467
1,1266553959973445639,Mike Pence caught on hot mic delivering empty ...,135818
2,1258750892448387074,THE PANDEMIC IS STILL HAPPENING. THE PANDEMIC ...,88667
3,1263579286201446400,"This just happened on live tv. Wow, what a dou...",82495
4,1266546753182056453,Mask on,66604


Let's start using `findall` function from regex to extract **urls, mentions, and hashtags** in tweets (i.e., the column of text in our dataframe).

In [4]:
tweets_df['urls'] = tweets_df['text'].apply(lambda x: re.findall("http[s]*\S+", x))

tweets_df['mentions'] = tweets_df['text'].apply(lambda x: re.findall("@([a-zA-Z0-9_]{1,50})", x))
# if you want to keep mention sign with mentioned string, use this following code
#tweets_df['mentions'] = tweets_df['text'].apply(lambda x: re.findall(r'@\w+ ?', x))

tweets_df['hashtags'] = tweets_df['text'].apply(lambda x: re.findall("#([a-zA-Z0-9_]{1,50})", x))
# if you want to keep hashtag sign hashtag's string, use this following code
#tweets_df['hashtags'] = tweets_df['text'].apply(lambda x: re.findall(r'#\w+ ?', x))


In [5]:
tweets_df.head()

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags
0,1265465820995411973,"This was me, and I want to make one thing clea...",257467,[https://t.co/349TZijtD8],[],[]
1,1266553959973445639,Mike Pence caught on hot mic delivering empty ...,135818,[https://t.co/IduvGhiPwj],[],[]
2,1258750892448387074,THE PANDEMIC IS STILL HAPPENING. THE PANDEMIC ...,88667,[],[],[]
3,1263579286201446400,"This just happened on live tv. Wow, what a dou...",82495,[https://t.co/dQKheEcCvb],[],[]
4,1266546753182056453,Mask on,66604,[],[],[]


For **emoji** extraction, in addition to regex, we will use the library called emoji (if not installed before, please install it before running the following cell). This library helps us transform emojis into the related codes (i.e., texts). Once the emojis are converted to text, we apply the same logic applied so far with regex to find them. 

The full list of emojis and related codes is available here: https://unicode.org/emoji/charts/full-emoji-list.html

In [6]:
def extract_emojis(text, return_codes=False):
    # first turn emojis into related text code
    text_de = emoji.demojize(text)
    # second find all emojis text code
    emojis_list_de = re.findall(r'(:[!_\-\w]+:)', text_de)
    # reconvert text code to emojis
    list_emoji = [emoji.emojize(x) for x in emojis_list_de]

    if return_codes:
        return emojis_list_de
    else:
        return list_emoji

tweets_df['emoji'] = tweets_df['text'].apply(extract_emojis)
tweets_df['emoji_text'] = tweets_df['text'].apply(extract_emojis, return_codes=True)

tweets_df.tail()

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text
495,1264986843948277760,"People who say ‘well, he’s doing the best he c...",9033,[https://t.co/5POEhfB6vi],[],[COVID],[],[]
496,1260425005483073538,This young woman was killed in her home for no...,9021,[https://t.co/JzPgOzm4Rm],[],[BreonnaTaylor],[],[]
497,1259587972728533000,I be like “oh shit my mask” like I’m Batman or...,8994,[],[],[],"[😂, 😂]","[:face_with_tears_of_joy:, :face_with_tears_of..."
498,1266251584461090816,Really disappointed by @SAfridiOfficial‘s comm...,8984,[],"[SAfridiOfficial, narendramodi]",[],[🇮🇳],[:India:]
499,1266728243236950018,Let's be clear about what's happening:\n\n→ Am...,8974,[],[realDonaldTrump],[],[],[]


Let's see the final results from our extraction example and sort values according to mentions.

In [7]:
tweets_df.sort_values(by='mentions', ascending=False)

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text
489,1258617080430997505,A Black New York State Senator (@zellnor4ny) a...,9151,[https://t.co/NoT8g4uAli],"[zellnor4ny, YourFavoriteASW]",[],[],[]
464,1266956300908363776,NEW: A volunteer on Kushner's coronavirus resp...,9327,[https://t.co/jvs2h4IfNQ],[yabutaleb7],[],[],[]
347,1260559563972960256,Wow! The Front Page @washingtonpost Headline r...,11591,[],[washingtonpost],[],[],[]
360,1262940294305071104,it would appear that @vp was joking about carr...,11196,[https://t.co/hI9cO4lxcX],[vp],[],[],[]
412,1261718681882693632,Very happy to present this unseen image of @ta...,10245,[https://t.co/3dzvynlUq3],"[tarak9999, DabbooRatnani]","[HappyBirthdayNTR, StayHomeStaySafe]","[😎, 📸, 🎉, 🙏🏼]","[:smiling_face_with_sunglasses:, :camera_with_..."
...,...,...,...,...,...,...,...,...
167,1256717572373913605,Update: Got her permission with a fuck yeah. T...,19289,[https://t.co/MqV0QJ0D8h],[],[],[],[]
165,1265624335898869760,"Y'all, the mask goes OVER your nose.",19351,[],[],[],[],[]
164,1258599146522464256,Because if its Baghdad its okay for this to ha...,19457,[https://t.co/UdFy61zoT5],[],[],[],[]
163,1266343312304324608,I gotta be honest the worst looting I've ever ...,19527,[],[],[],[],[]


As a final exercise, let's clean text from urls, hashtags, mentions, and emojis for further text analysis.

In [8]:
def clean_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ' ') # remove links
    return text

def clean_all_entities(text):
    entity_prefixes = ['#', '@'] # remove hashtags and mentions
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
            text = emoji.replace_emoji(text, replace="!") # remove emojis
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)

tweets_df['cleaned_text'] = tweets_df['text'].apply(lambda x: clean_all_entities(clean_links(x)))


In [9]:
print('Original Tweet:', tweets_df.text.values[412])
print('\n\nCleaned Tweet:', tweets_df.cleaned_text.values[412])

Original Tweet: Very happy to present this unseen image of @tarak9999 .. I hope you all like it 😎

📸 By @DabbooRatnani 

#HappyBirthdayNTR 🎉

#StayHomeStaySafe 🙏🏼 https://t.co/3dzvynlUq3


Cleaned Tweet: Very happy to present this unseen image of I hope you all like it ! ! By ! !


### 6.1.2. Extracting named entities

A named entity is a real-life object which can be identified and denoted with a proper name. Named Entities can be a place, person, organization, time, object, or geographic entity. For example, named entities would be Joe Biden, New York city, and congress. Named entities are usually instances of entity instances. For example, Joe Biden is an instance of a politician/person, New York City is an instance of a place, and congress is as instance of an organization. 

**Named Entity Recognition** (NER) is the process of NLP for identifying and classifying named entities. The raw and structured text are used to find out named entities, which are classified into persons, organizations, places, money, time, etc. NER systems are developed with various linguistic approaches, as well as statistical and machine learning methods. 

NER model first identifies an entity and then categorizes the entity into the most suitable class. Some of the common types of Named Entities will be as follows and others can be found in the further example of a Wikipedia page text.

1. Organisations : NASA, CERN

2. Places: Istanbul, Germany

3. Money: 1 Billion Dollars, 50 Euros

4. Date: 24th January 2023, season 4

5. Person: Richard Feynman, George Floyd
 
<img src='images/NER.png' style='height: 500px; float: left'>

<div class='alert-info'>
<big><b>Insight</b></big>

    
For NLP tasks like NER, POS tagging, dependency parsing, word vectors and more, <a href="https://spacy.io/">spaCy</a> has distinct features that provide clear advantage for processing text data and modeling. It is the most trending and advanced free open-source library for implementing NLP in Python nowadays. 
    
An important thing about NER models is that their ability to understand Named Entities depending on the data they have been trained on. There are many applications of NER. NER can be used for content classification, the various Named Entities of a text can be collected, and based on that data, the content themes can be understood.
    
We can use spaCy very easily for NER tasks. However, we need to consider training our own data for research, commercial, and business specific needs, the spaCy model generally performs well for all types of text data. 
    
</div>

As usual, let's import necessary libraries and packages and start with a toy example from our tweets dataframe, which is the second line of the text column. 

In [10]:
import spacy 

# before loading it we need to install this module via: #!python -m spacy download en_core_web_sm
NER = spacy.load("en_core_web_sm")

# Print the second tweet of our dataset
raw_text = tweets_df.cleaned_text[1]
print(raw_text)

Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt


Now, we print the data on the Named Entities found in this raw text sample from our dataset.

In [11]:
# extract the entities using the spacy objects previously defined in the
NER_text = NER(raw_text)

# show all the entities extracted from the text
for word in NER_text.ents:
    print(word.text, word.label_)

Mike Pence PERSON
PPE ORG


<div class='alert-info'>
<big><b>Insight</b></big>
    
Here, PPE is a context specific word to be labeled as organization. In the COVID-19 context like in our example, it stands for "personal protective equipment"; which is not an organization. On the other hand, as an abbreviation of the Philosophy, Politics, and Economics Society, PPE can be labeled as an organization.
</div>  

Now, let's run NER on the full dataset and find out the output with Named Entities and who is the most cited Person:

In [12]:
for tweet in tweets_df.cleaned_text:
    for word in NER(tweet).ents:
        print(word.text, word.label_)

Mike Pence PERSON
PPE ORG
3 CARDINAL
less than 30 CARDINAL
season 2 DATE
season 1 DATE
2 years ago DATE
one day DATE
SHE ORG
Costco ORG
a few weeks ago DATE
500 billion dollars MONEY
1200 CARDINAL
Baghdad GPE
Giving GPE
SNES Edition Nintendo Switch Just RETWEET WORK_OF_ART
24 hours TIME
Barclays ORG
ER ORG
Update FedEx ORG
19 CARDINAL
Trump ORG
That’s Simply Not True RT WORK_OF_ART
November DATE
one CARDINAL
NYPD ORG
NYPD ORG
Donald Trump PERSON
69 CARDINAL
Playbook PERSON
the Pandemic Preparedness Office ORG
75 CARDINAL
today DATE
Costco ORG
World ORG
CoronaVirus ORG
China GPE
American NORP
COVID ORG
today DATE
Manhattan GPE
Manhattan GPE
911 CARDINAL
1 4 DATE
10 years ago DATE
Mark Zuckerberg PERSON
CNN ORG
China GPE
THIS WEEK DATE
HOME Practice ORG
Watermelon Sugar Video Out ORG
China GPE
Joe Biden PERSON
the United States GPE
decades DATE
Candace Owens PERSON
Michigan GPE
Twitter PERSON
today DATE
HAIR ORG
HAIR ORG
Twitter PERSON
the 13 year olds DATE
Covid PERSON
the China Virus V

## 6.2. Text representation: Implementing a preprocessing pipeline

<div class='alert alert-block alert-danger'>
<big><b>Notes from the content document (remove after finalizing it):</b></big>
    

Text representation: creating the document-term matrix; creating the preprocessing pipeline (using Spacy; for inspiration: https://mahadev001.github.io/Mahadev-Upadhyayula/Sentiment%20Analysis%20via%20NLP/Sentiment%20Analysis%20using%20NLP%20with%20Spacy%20and%20%20SVM.html); output should be a sparse matrix for use in ML
Tokenization
Stopword removal
Stemming vs lemmatization
POS tagging
N-gram detection (mention that results differ depending on which position in the pipeline it’s performed: before/after stopword removal or lemmatization)
tf*idf
</div>

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, can you please add this link to the references?
</div>


## 6.2.1. Word Descriptors

To refer to the entire collection of documents/observations, we use the word corpus (plural corpora). The raw text data often referred to as *text corpus* has punctuations, suffices, and stop words that do not give us important information. To have more useful information for NLP tasks, Text Preprocessing involves preparing the text corpus. Let's start with basic terminology of NLP.

### Tokens and splitting 

The set of all the unique terms in our data is called the vocabulary. Each element in this set is called a type. Each occurrence of a type in the data is called a token. 

Let's practice: Our sentence is

>“Today is a great day with learning NLP, such a power tool!”

Thi sentece has 14 tokens but only 13 types (namely, 'Today', 'is', 'a', 'great', 'day', 'with', 'learning', 'NLP', ',', 'such', 'a', 'powerful', 'tool', '!'). Note that types can also include punctuation marks and multiword expressions.

In other words, the words of a text document/file separated by spaces and punctuation are called as tokens.

#### What is a Tokenization?
The process of extracting tokens from a text file/document is referred as tokenization. Let's see an example below of a tokenization process using spaCy:

In [13]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Print the original and tokenized text
print('Original text:', text)
print('\nTokens in the text:',)

for token in doc:
    print('\t', token.text)

print('\nTotal tokens:', len(doc))


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt

Tokens in the text:
	 Mike
	 Pence
	 caught
	 on
	 hot
	 mic
	 delivering
	 empty
	 boxes
	 of
	 PPE
	 for
	 a
	 PR
	 stunt

Total tokens: 15


We can also push furhter our analysis and extract the vocabulary from the corpus of tweets from the previous dataset. Since the vocabulary of a text corpus is the collection of unique tokens present in that corpus, we will just need to tokenize each single tweet and keep unique occurence of each token:

In [14]:
from collections import Counter

# Process all tweets with spaCy and extract all tokens
tokens = []
for text in tweets_df.cleaned_text:
    doc = nlp(text)
    for token in doc:
        tokens.append(token.text)

# Count the occurrences of each token and create a vocabulary of unique tokens
vocabulary = Counter(tokens)

# Print the extracted vocabulary
print(vocabulary)



### Lemmatization
When we look up a word in a dictionary, we usually just look for the base form. This dictionary base form is called the **lemma**.
For instance, we might see forms like “go”, “goes”, “went”, “gone”, or “going” and we look up dictionary in a lemmatized form, such as "go" (Hovy, 2020). These words have clearly different meaning, in some contexts it is not fundamental to distinguish them. On the contrary, it is much more convenient to trace them back to their lemma. Indeed, this may simplify some analysis and allow easier extraction of relevant information from the text. Let's see an example of lemmatization applied to the corpus of tweets using spaCy:

In [15]:
# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy and perform lemmatization
doc = nlp(text)

# Print words and extractes lemmas
for token in doc:
    print("{0} -> {1}".format(token.text, token.lemma_))

# Finally we can recover the text of the tweet after lemmatization
print('\n\nOriginal text:', text)
lemmatized_text = " ".join([token.lemma_ for token in doc])
print('Lemmatized text:', lemmatized_text)

Mike -> Mike
Pence -> Pence
caught -> catch
on -> on
hot -> hot
mic -> mic
delivering -> deliver
empty -> empty
boxes -> box
of -> of
PPE -> PPE
for -> for
a -> a
PR -> pr
stunt -> stunt


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt
Lemmatized text: Mike Pence catch on hot mic deliver empty box of PPE for a pr stunt


### Stemming 

Another strategy to reduce different forms of a word to a common base or root form is stemming. Stemming involves removing the suffixes of words to create a simplified form of the word. For example, the stem of the words "running," "runner," and "run" is "run." This can be achieved using several algorithms like the one developed by Porter (1980). This algorithm defines a number of suffixes and the order in which they should be removed or replaced. These actions are then applied iteratively untill a word is reduced to its stem.

Note how, although similar, stemming and lemmatization are different and give different results. Generally speaking, lemmatization tends to produce more accurate and meaningful results with respect to stemming. Nonethelss, stemming is often faster and simpler to implement, which makes it useful for tasks that require real-time processing or have limited computational resources.

An implementation of the Porter stemmer is available in the Python library NLTK. Let's see an example:

In [16]:
# run this to install NLTK
# !pip install nltk

In [17]:
# download popular NLTK data
# !python -m nltk.downloader popular

In [18]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# This performs tokenization on the text (NLTK equivalalent of what we did with spaCy)
tokens = word_tokenize(text)

# Create a PorterStemmer object
stemmer = PorterStemmer()

# Apply stemming to each word in the text
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Let's see results 
for token, stem in zip(tokens, stemmed_tokens):
    print("{0} -> {1}".format(token, stem))

# Finally we can recover the text of the tweet after lemmatization
print('\n\nOriginal text:', text)
stemmed_text = " ".join(stemmed_tokens)
print('Stemmed text:', stemmed_text)

Mike -> mike
Pence -> penc
caught -> caught
on -> on
hot -> hot
mic -> mic
delivering -> deliv
empty -> empti
boxes -> box
of -> of
PPE -> ppe
for -> for
a -> a
PR -> pr
stunt -> stunt


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt
Stemmed text: mike penc caught on hot mic deliv empti box of ppe for a pr stunt


### N-grams

In natural language processing (NLP), **N-grams** are contiguous sequences of n elements from a given text sample, where an element can be a word, a character, or part of speech. In most cases, n-grams are created from a text by dragging a window of size n over the text and extracting the sequences of n elements that fall within that window.

N-grams are used in a variety of NLP tasks such as language modeling, machine translation, and text classification. By extracting n-grams from a text, it is possible to capture the local context of a word or word sequence, which can help improve the accuracy of many NLP tasks.

For example, a bigram (n=2) is "natural language", a trigram (n=3) is "natural language processing", and a 4-gram (n=4) is "natural language processing task". By examining the frequency of different n-grams in a text or corpus, it is possible to gain insight into the distribution of words and their relationships.

N-grams can also be used to generate new texts through techniques such as n-gram language modeling. In this approach, the probabilities of different N-grams in a text are used to generate a new text that is similar in style and content to the original text.

However, it should be noted that n-grams can be constrained by the sparsity problem, especially for larger values of n. That is, as the value of n increases, the number of unique n-grams in a text can increase rapidly, making it difficult to capture meaningful patterns or relationships. Therefore, choosing an appropriate value of n is an important consideration in many NLP tasks.

Let's see an example of  N-grams extraction applied to the corpus of tweets using spaCy:

In [20]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Define the function to extract n-grams
def extract_ngrams(doc, n):
    ngrams = []
    for i in range(len(doc) - n + 1):
        ngram = " ".join([doc[j].text for j in range(i, i + n)])
        ngrams.append(ngram)
    return ngrams

# Extract unigrams, bigrams, and trigrams from the text
unigrams = extract_ngrams(doc, 1)
bigrams = extract_ngrams(doc, 2)
trigrams = extract_ngrams(doc, 3)

# Print the extracted n-grams
print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)


Unigrams: ['Mike', 'Pence', 'caught', 'on', 'hot', 'mic', 'delivering', 'empty', 'boxes', 'of', 'PPE', 'for', 'a', 'PR', 'stunt']
Bigrams: ['Mike Pence', 'Pence caught', 'caught on', 'on hot', 'hot mic', 'mic delivering', 'delivering empty', 'empty boxes', 'boxes of', 'of PPE', 'PPE for', 'for a', 'a PR', 'PR stunt']
Trigrams: ['Mike Pence caught', 'Pence caught on', 'caught on hot', 'on hot mic', 'hot mic delivering', 'mic delivering empty', 'delivering empty boxes', 'empty boxes of', 'boxes of PPE', 'of PPE for', 'PPE for a', 'for a PR', 'a PR stunt']


## 6.2.2. Stopwords

In natural language processing (NLP), stop words refer to words that are frequently used in a language but usually do not have much meaning or semantic value when used in context. Examples of stop words in English are "the", "a", "an", "and", "in", "on", "is", "are", "for", "with", and so on.

Stop words are usually removed from text during preprocessing in NLP tasks such as text classification, sentiment analysis, and information retrieval. The reason is that they do not contribute much to the overall meaning or topic of a text and can potentially degrade algorithm performance by adding noise to the data. Removing stop words can also help reduce the size of vocabulary and improve the efficiency of text processing algorithms.

However, there are certain cases where the inclusion of stop words in the analysis may be useful or even necessary. For example, stopwords can be useful in tasks such as authorship attribution, to identify common themes, or writing styles. In such cases, it is important to carefully consider the use of stop words and their potential impact on the analysis

We will now see a simple example on how to remove Stop Words from a text using spaCy:

In [28]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Define the list of stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Remove stop words from the text
filtered_text = [token.text for token in doc if not token.is_stop]
stop_words_removed = [token.text for token in doc if token.is_stop]

# Print the original and filtered text, and the stop words removed
print("Original tokens: ", [token.text for token in doc])
print("Filtered tokens:", filtered_text)
print("\nStop words removed: ", stop_words_removed)


Original tokens:  ['Mike', 'Pence', 'caught', 'on', 'hot', 'mic', 'delivering', 'empty', 'boxes', 'of', 'PPE', 'for', 'a', 'PR', 'stunt']
Filtered tokens: ['Mike', 'Pence', 'caught', 'hot', 'mic', 'delivering', 'boxes', 'PPE', 'PR', 'stunt']

Stop words removed:  ['on', 'empty', 'of', 'for', 'a']


## 6.2.3. Parts of Speech

**Part of speech tagging** (POS) is the process of assigning a part of speech to each word in a sentence, such as noun, verb, adjective, or adverb. POS Tagging is an important step in many NLP applications, such as named entity recognition, sentiment analysis, and machine translation.

The goal of POS tagging is to identify the grammatical structure of a sentence by labelling each word with its corresponding part of speech. This information can then be used to extract meaning and context from the text. For example, knowing whether a word is a noun or a verb can help determine the subject and predicate of a sentence.

POS tagging is typically performed using machine learning algorithms, such as hidden Markov models, conditional random fields, or neural networks. These algorithms are trained on annotated text corpora in which each word is labelled with the corresponding word type. After training, the algorithm can then predict the word type for a new unseen text.

POS tagging is not always an easy task, as some words may have multiple possible word types depending on the context. For example, "run" can be a verb ("I run every morning") or a noun ("I went for a run"). In these cases, the algorithm must use contextual clues to determine the most likely part of speech for the word.

Overall, POS tagging is an important technique in NLP that helps extract meaning and context from texts by identifying the grammatical structure of sentences.

English has 9 main categories:

- verb — Expresses an action or a state of being. E.g. jump, is, write, become
- noun — identifies a person, a place or a thing or names of particular of one of these (pronoun). E.g. man, house, happiness
- pronoun — can replace a noun or noun phrase. E.g. she, we, they, it
- determiner — Is placed in front of a noun to express a quantity or clarify what the noun refers to — briefly a noun introducer. E.g. my, that, the, many
- adjective — modifies a noun or a pronoun. E.g. pretty, old, blue, smart
- adverb — modifies a verb, an adjective, or another adverb. E.g. gently, extremely, carefully, well
- preposition — Connect a noun/pronoun to other parts of the sentence. E.g. by, with, about, until
- conjunction — glue words, clauses, and sentences together. E.g. and, but, or, while, because
- interjection — Expresses emotion in a sudden or exclamatory way. E.g. oh!, wow!, oops!


<img src='images/POS.png' style='height: 200px; float: center'>




We will now see a simple example on how to perform POS on a text using spaCy:

In [33]:
import spacy

# load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# iterate over each token in the doc and print its text and POS tag
for token in doc:
    print(token.text, token.pos_)


Mike PROPN
Pence PROPN
caught VERB
on ADP
hot ADJ
mic NOUN
delivering VERB
empty ADJ
boxes NOUN
of ADP
PPE PROPN
for ADP
a DET
PR NOUN
stunt NOUN


If the meaning of a POS tag is not clear to us, we ask spaCy to explain it:

In [34]:
spacy.explain("PROPN")

'proper noun'

Finally, let's see how spaCy POS tagging works on more tricky examples:

In [40]:
# here the word run is used as verb
text1 = "I run every morning"

# here the word run is used as a noun
text2 = "I went for a run"

# POS tagging of sentence 1
for token in nlp(text1):
    print(token.text, token.pos_)

print("\n")
# POS tagging of sentence 2
for token in nlp(text2):
    print(token.text, token.pos_)


I PRON
run VERB
every DET
morning NOUN


I PRON
went VERB
for ADP
a DET
run NOUN


We can see that spaCy correctly tag the word "run" differently in these two examples. 

<div class='alert-info'>
<big><b>Insight</b></big>
    
<a href="https://docs.python.org/3/library/string.html">Python String</a> module contains some constants, utility function, and classes for string manipulation. It is a built-in module and we have to import it before using any of its constants and classes.

<a href="https://spacy.io/">spaCy</a> is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more that we will practice in the following sections.
</div>

<img src='images/spacy_pipeline.svg' style='height: 200px; float: right'>

In this stage, we will use spaCy to lemmatize and remove stop words from our tweets dataset. We will also list the stop words to observe: We can keep some of them depdending on our research questions (e.g., 'us' vs. 'them') or remove them fully.

In [None]:
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS

#nlp = spacy.load('en')

# To build a list of stop words for filtering
stopwords = list(STOP_WORDS)
print(stopwords)

Now, we will learn how to tokenize words to be lemmatized and filtered for pronouns, stopwords and punctuations by the 'tokenize_text' function. A text can be converted into nlp object of spaCy. First, we should convert raw text into a spaCy object.

In [None]:
import string
punctuations = string.punctuation
# Creating a Spacy Parser
from spacy.lang.en import English
# parser = English()

nlp = en_core_web_sm.load()
def lemmatize_text(text):
    tokens = nlp(text)
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ]
    tokens = [ word for word in tokens if word not in stopwords and word not in punctuations ]
    return tokens

With the following line, we apply 'tokenize_text' function for removing stop words and lemmatizing tokens to ...

INTRODUCE TEXT CORPUS NOW (AMAZON REVIEWS?) -- DON'T USE TWEETS

In [None]:
#df['lemmatized_text'] = df['cleaned_text'].apply(lambda x: lemmatize_text(x))

In [None]:
#print(df['lemmatized_text'])

## 6.2.3. Parts of Speech
The **Part of speech tagging** (POS tagging) is the process of marking a word in the text to a particular part of speech based on both its context and definition. In simple language, we can say that POS tagging is the process of identifying a word as nouns, pronouns, verbs, adjectives, etc.

English has 9 main categories:

- verb — Expresses an action or a state of being. E.g. jump, is, write, become
- noun — identifies a person, a place or a thing or names of particular of one of these (pronoun). E.g. man, house, happiness
- pronoun — can replace a noun or noun phrase. E.g. she, we, they, it
- determiner — Is placed in front of a noun to express a quantity or clarify what the noun refers to — briefly a noun introducer. E.g. my, that, the, many
- adjective — modifies a noun or a pronoun. E.g. pretty, old, blue, smart
- adverb — modifies a verb, an adjective, or another adverb. E.g. gently, extremely, carefully, well
- preposition — Connect a noun/pronoun to other parts of the sentence. E.g. by, with, about, until
- conjunction — glue words, clauses, and sentences together. E.g. and, but, or, while, because
- interjection — Expresses emotion in a sudden or exclamatory way. E.g. oh!, wow!, oops!


<img src='images/POS.png' style='height: 200px; float: right'>


<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
    
Pouria, can you please check other images for POS tagging that you find it beautiful and comprehensive that shows all the tags? I found this now but we can change. 
</div>

To apply POS tagging on our dataset, first, let's tag the tokens. First, to build the dataframe with POS tags, we will have columns as "text", "pos", "tag", "explain_tag", then we will write rows from the tokenized text. 

In [None]:
cols = ["text", "pos", "tag", "explain_tag"]
rows = []

for text in df['text']:
    doc = nlp(text)
    for token in doc:
        row = token.text, token.pos_, token.tag_, spacy.explain(token.tag_)
        rows.append(row)
    
pos_df = pd.DataFrame(rows, columns = cols)
 

In [None]:
pos_df.head(50)

### 6.2.4. n-gram extraction

<div class='alert-info'>
<big><b>Insight</b></big>
    
<a href="https://textacy.readthedocs.io/en/0.12.0/api_reference/extract.html">textacy</a> is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. — delegated to another library, textacy focuses primarily on the tasks that come before and follow after. We will use textacy for extracting basic ngrams. If you have not installed yet this library, please do so before moving to the following exercise.
     
</div>

In [None]:
import textacy

ngrams_list = []

for text in df['cleaned_text']:
    doc = nlp(text)
    # you can change the n grams to 3, 4 depending on your quesiton
    ngrams = list(textacy.extract.basics.ngrams(doc, 3, min_freq=2))
    if ngrams != []:
        ngrams_list.append(ngrams)

In [None]:
ngrams_list

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
    
Or, we can introduce gensim here??... another option is shown as well. we can discuss. 
</div>

In [None]:
import gensim, pprint 

# tokenize documents with gensim's tokenize() function
tokens = [list(gensim.utils.tokenize(doc, lower=True)) for doc in df['text']]

# build bigram model
bigram_mdl = gensim.models.phrases.Phrases(tokens, min_count=1, threshold=2)

# do more pre-processing on tokens (remove stopwords, stemming etc.)
# NOTE: this can be done better
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, stem_text
CUSTOM_FILTERS = [remove_stopwords, stem_text]
tokens = [preprocess_string(" ".join(doc), CUSTOM_FILTERS) for doc in tokens]

# apply bigram model on tokens
bigrams = bigram_mdl[tokens]

pprint.pprint(list(bigrams))

In [None]:
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in bigrams:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in bigrams]
pprint.pprint(processed_corpus)

In [None]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

In [None]:
pprint.pprint(dictionary.token2id)


### Missing: tf*idf, systematic pipeline, save document-term matrix at each step to TSV

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, can you add here visualizations for the word count after we talk with Haiko about preprocessing texts if we need more cleaning before plotting?
  
</div>

## 6.3. Understanding the meaning: Similarity of words and documents 

<div class='alert alert-block alert-danger'>
<big><b>Notes from the content document(remove after finalizing it):</b></big>
    
(we need a large corpus for this; https://ai.googleblog.com/2017/01/a-large-corpus-for-supervised-word.html proposes disambiguation of “stock”: pick the largest pages from https://en.wikipedia.org/wiki/Stock_(disambiguation))
Cosine
Word2vec (not pretrained embeddings); using gensim (Hovy)
</div>

Some suggested inspiration:

Measurement of Text Similarity A Survey: https://www.mdpi.com/2078-2489/11/9/421

https://medium.com/@adriensieg/text-similarities-da019229c894

https://newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python

<div class='alert alert-block alert-danger'>
<big><b>Gizem's note:</b></big>
    
From this point on, old notebook notes are here that I didn't want to remove in case we would like to reuse some of the information below.
</div>

<div class='alert alert-block alert-warning'>
<b>Additional resources</b>

#### NLTK (Natural Language Toolkit)

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. For more details, check out <a href="https://www.nltk.org/">NLTK</a>'s webpage.
    
#### gensim

Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) as possible. It was developed for topic modelling, which supports the NLP tasks like Word Embedding, text summarization and many others, such as <a href="https://radimrehurek.com/gensim/models/ldamodel.html">LDA Topic Modeling</a> and <a href="https://radimrehurek.com/gensim/models/phrases.html">Bigrams/Trigrams</a>. For more details, check out  <a href="https://radimrehurek.com/gensim/auto_examples/index.html#documentation">Gensim</a>'s webpage.
    
#### Hugging Face Transformers

Transformers was developed by <a href="https://huggingface.co/">Hugging Face</a> and provides state of the art models. It is an advanced library known for the transformer modules with high-level NLP tasks. Hugging Face is one of the most widely used libraries in NLP community. It provides native support for PyTorch and Tensorflow-based models, increasing its applicability in the deep learning community. <a href="https://arxiv.org/abs/1810.04805">BERT</a>  and <a href="https://arxiv.org/abs/1907.11692">RoBERTa</a> are two of the most valuable models supplied by the Hugging Face library, which is used for machine translation, question/answer activities, and many other applications. 

Hugging Face pipeline provides a rapid and simple approach to perform a range of NLP operations, and the Hugging Face library also supports GPUs for training. As a result, processing speeds are multiplied by a factor of ten. Check out their <a href="https://huggingface.co/docs/transformers/main_classes/pipelines">Pipelines</a> for what 10+ tasks we can perform as one-liners basically. Their model repository is vast.
    
</div>

<div class='alert-info'>
<big><b>Haiko</b></big>

At this point it occurrs to me that just providing the commands how things can be done is not enough. We also want to teach how users can proproces their corpus and save intermediate steps like "corpus.txt" > "corpus_stemmed.txt" > "corpus_stemmed_nostopwords.txt" > ...

In general, do we need spacy, nltk, and gensim to work along this pipeline? Even if not, we should tell in the session why we introduce all packages.
</div>

### Word frequency analysis

<div class='alert-info'>
<big><b>Haiko</b></big>

Yes, I think it should be here. Hovy has it in the "text representation" section. It is then a statistic of the document-term matrix for example.
</div>




<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>

To session 10! Huggingface’s transformers: State-of-the-art NLP
Currently, Hugging face is supported by Pytorch and tensorflow 2.0. We can use transformers of Hugging Face to implement Summarization, Text Generation, Language Translation, ChatBot...

</div>

# References

Hovy, D. (2020). Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press.

Hovy, D. (2021). Text Analysis in Python for Social Scientists: Prediction and Classification. Cambridge University Press.

Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

He, R., & McAuley, J. (2016, April). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web (pp. 507-517).: http://jmcauley.ucsd.edu/data/amazon/index_2014.html


https://www.machinelearningplus.com/nlp/natural-language-processing-guide/



Image Credits

[1] Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media. Chapter 1: https://www.oreilly.com/library/view/practical-natural-language/9781492054047/ch01.html

POS image credit: https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/
spacy pipeline image credit: https://spacy.io/usage/linguistic-features

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & ..?

Contributors: Haiko Lietz & Pouria Mirelmi & ..?

Acknowledgements: ...

Version date: XX. February 2023

License: ...
</div>

#### Notes to be removed before publication

Reviewers: Indira, Olya, or Veronika?

Review intro

Review and finish red boxes

Add insight boxes more?