<img src='social_comquant.png' style='height: 60px; float: left'>
<img src='gesis.png' style='height: 60px; float: right; margin-right: 40px'>
<img src='isi.png' style='height: 60px; float: right; margin-right: 20px'>  

-------------------------------------------------------------------------------------------------------------------------------

# Section C: Data preprocessing methods | Session 6: Natural Language Processing

### Authors: N. Gizem Bacaksizlar Turbic, Haiko Lietz, Pouria Mirelmi, and ..

### Date: ? October 2022

-------------------------------------------------------------------------------------------------------------------------------

# 1. Introduction

The field of study that focuses on the interactions between human language and computers is called natural language processing (NLP).

NLP is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. NLP systems are used exploiting the signals in our language used to predict all of the aforementioned features: people’s age (Nguyen et al., 2011; Rosenthal & McKeown, 2011), gender (Alowibdi et al., 2013; Ciot et al., 2013; Liu & Ruths, 2013), personality (Park et al., 2015), job title (Preoţiuc-Pietro et al., 2015a), income (Preoţiuc-Pietro et al., 2015b), and much more (Volkova et al., 2014, 2015).

In NLP, word embeddings have been at the forefront of this progress, which has expanded to include flexible model architectures (Hovy, 2021). The most publicly visible example of this shift is probably the translation quality of services like Google Translate (Wu et al., 2016).

NLP has a wide number of applications in the real world.

- Sentiment Analysis
- Speech Recognition
- Text Classification
- Machine Translation
- Semantic Search
- News/article Summarization
- Answering Questions

This session will help you understand the basic and advanced NLP concepts and show you how to implement using the most advanced and popular NLP libraries – <a href="https://www.nltk.org/">`NLTK`</a>, <a href="https://spacy.io/">`spaCy`</a>, <a href="https://pypi.org/project/gensim/">`Gensim`</a>, and <a href="https://huggingface.co/">`Huggingface`</a>.

Notes: Should we start with regex here? Or should it be in the second session?
https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/2_data_handling_and_visualization.ipynb

## 1.1. NLTK

In [1]:
# Lets first make sure that nltk is installed.
!pip install nltk



In [2]:
# import nltk
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bacaksgm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1.2. spaCy

<a href="https://spacy.io/">`spaCy`</a> is the most trending and advanced library for implementing NLP today. It has distinct features that provide clear advantage for processing text data and modeling.

In [3]:
# again, make sure that spacy is installed and uncomment the below code line
# !pip install spacy

In [4]:
# import spacy and create nlp object with loading the models and data for English Language

import spacy
nlp = spacy.load('en_core_web_sm') # if you get error, run python -m spacy download en on your Anaconda prompt

## 1.3. gensim

<a href="https://pypi.org/project/gensim/">`Gensim`</a> was developed for topic modelling, which supports the NLP tasks like Word Embedding, text summarization and many others.

In [5]:
# again, make sure that gensim is installed and uncomment the code line below
# !pip install gensim

In [6]:
# import gensim
import gensim

## 1.4. Transformers

It was developed by <a href="https://huggingface.co/">`HuggingFace`</a> and provides state of the art models. It is an advanced library known for the transformer modules.

In [7]:
# Install the package if you haven't done so, please uncomment the code line below
# !pip install transformers

# 2. Text Preprocessing methods

To refer to the entire collection of documents/observations, we use the word corpus (plural corpora). The raw text data often referred to as *text corpus* has punctuations, suffices, and stop words that do not give us important information. To have more useful information for NLP tasks, Text Preprocessing involves preparing the text corpus.

Let's walk through some basic steps of preprocessing of a raw text corpus!

In [8]:
# A text can be converted into nlp object of spaCy as it was shown in the earlier step (1.2. spaCy).
# Convert raw text into a spaCy object
raw_text = 'Today is a great day with learning NLP, such a powerful tool!'
text_doc = nlp(raw_text)

## 2.1. Word Descriptors

### 2.1.1 Tokens and splitting 

### What is a Token?

The set of all the unique terms in our data is called the vocabulary. Each element in this set is called a type. Each occurrence of a type in the data is called a token. 

Let's practice: Our sentence “Today is a great day with learning NLP, such a power tool!”, has 14 tokens but only 13 types (namely, 'Today', 'is', 'a', 'great', 'day', 'with', 'learning', 'NLP', ',', 'such', 'a', 'powerful', 'tool', '!'). Note that types can also include punctuation marks and multiword expressions.

In other words, the words of a text document/file separated by spaces and punctuation are called as tokens.

### What is a Tokenization?
The process of extracting tokens from a text file/document is referred as tokenization.



In [9]:
# We can print text of the tokens by accessing token.text while using spaCy
# printing tokens
tokens = []
for token in text_doc:
    print(token.text)
    tokens.append(token.text)
    
len(tokens)    

Today
is
a
great
day
with
learning
NLP
,
such
a
powerful
tool
!


14

In [10]:
# What if we want to find a particular token with alphabetic characters?

for token in text_doc:
    print(token.text, token.is_alpha)


Today True
is True
a True
great True
day True
with True
learning True
NLP True
, False
such True
a True
powerful True
tool True
! False


In [11]:
# What if we want to know if the particular token is space, or a stop word or punctuation?
print("Text".ljust(10), ' ', "Alpha", "Space", "Stop", "Punct")
for token in text_doc:
    print(token.text.ljust(10), ':', token.is_alpha, token.is_space, token.is_stop, token.is_punct)

Text         Alpha Space Stop Punct
Today      : True False False False
is         : True False True False
a          : True False True False
great      : True False False False
day        : True False False False
with       : True False True False
learning   : True False False False
NLP        : True False False False
,          : False False False True
such       : True False True False
a          : True False True False
powerful   : True False False False
tool       : True False False False
!          : False False False True


#### Let's try tokenization with nltk.
nltk doesn't come fully installed, you need to use nltk.download() to use some of the missing functions, and we will also use spaCy to show similar features that these library have.

In [12]:
import nltk
# nltk.download('punkt')

sentence = "At 10:30 o'clock on Monday mornings, we have Social ComQuant meetings. Let's have our meeting another time."
tokens = nltk.tokenize.word_tokenize(sentence)
print(tokens)
print()
print(" ".join(tokens))

['At', '10:30', "o'clock", 'on', 'Monday', 'mornings', ',', 'we', 'have', 'Social', 'ComQuant', 'meetings', '.', 'Let', "'s", 'have', 'our', 'meeting', 'another', 'time', '.']

At 10:30 o'clock on Monday mornings , we have Social ComQuant meetings . Let 's have our meeting another time .


In [13]:
# convert string to upper case characters
sentence.upper() 

"AT 10:30 O'CLOCK ON MONDAY MORNINGS, WE HAVE SOCIAL COMQUANT MEETINGS. LET'S HAVE OUR MEETING ANOTHER TIME."

In [14]:
# convert string to lower case characters
sentence.lower() 

"at 10:30 o'clock on monday mornings, we have social comquant meetings. let's have our meeting another time."

#### Let's remove punctuation and stop words. But, wait, what do we mean by stop words?
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is”, and “and” would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [15]:
# Access the built-in stop words in nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bacaksgm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
# Access the built-in stop words in Spacy
stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)

# printing a fraction of the list through indexing
for word in list_stopwords[:5]:
    print(word)

herein
namely
see
and
can


In [17]:
# Filter out the stopwords
filtered_text= [token for token in text_doc if not token.is_stop]

# Count the tokens after removal of stopwords
token_count_without_stopwords=0
for token in filtered_text:
    print(token)
    token_count_without_stopwords+=1

Today
great
day
learning
NLP
,
powerful
tool
!


In [18]:
# Remove punctuations
filtered_text=[token for token in filtered_text if not token.is_punct]

token_count_without_stop_and_punct=0

# Counting the new no of tokens
for token in filtered_text:
    print(token)
    token_count_without_stop_and_punct += 1

Today
great
day
learning
NLP
powerful
tool


### 2.1.2. Lemmatization

When we look up a word in a dictionary, we usually just look for the base form. This dictionary base form is called the lemma.
For instance, we might see forms like “go,” “goes,” “went,”, “gone,” or “going” and we look up dictionary in a lemmatized form, such as "go" (Hovy, 2020).

In [19]:
# Let's give an example with nltk
# import nltk already imported
# import the lemmatizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Remember our sentence with the Social ComQuant meeting.
sentence = "At 10:30 o'clock on Monday mornings, we have Social ComQuant meetings. Let's have our meeting another time."

WNL = WordNetLemmatizer() # declaring an instance of our preprocessor.
tokens = nltk.tokenize.word_tokenize(sentence)

lemmatized_tokens = []
for t in tokens:
    t_lemma = WNL.lemmatize(t)
    lemmatized_tokens.append(t_lemma)
print(lemmatized_tokens)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bacaksgm\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['At', '10:30', "o'clock", 'on', 'Monday', 'morning', ',', 'we', 'have', 'Social', 'ComQuant', 'meeting', '.', 'Let', "'s", 'have', 'our', 'meeting', 'another', 'time', '.']


In [20]:
# Let's give an example with spaCy
new_sentence = "What about going to festivals? You said you like dancing."

text_doc = nlp(new_sentence)
for token in text_doc:
    print(token.text, '----->', token.lemma_)


What -----> what
about -----> about
going -----> go
to -----> to
festivals -----> festival
? -----> ?
You -----> you
said -----> say
you -----> you
like -----> like
dancing -----> dance
. -----> .


### 2.1.3. Stemming 

From Hovy's book: "Rather than reducing a word to the lemma, we strip away everything but the irreducible morphological core (the stem). For example, for a word like “anticonstitutionalism”, which can be analyzed as “anti+constitut+ion+al+ism,” we remove everything but “constitut.” The most famous and commonly used stemming tool is based on the algorithm developed by Porter (1980). For each language, it defines a number of suffixes (i.e., word endings) and the order in which they should be removed or replaced. By repeatedly applying these actions, we reduce all words to their stems. In our example, all words derive from the stem “constitut–” by attaching different endings. Again, a version of the Porter stemmer is already available in Python, in the nltk library (Loper & Bird, 2002), but we have to specify the language."

In [21]:
# import stemmer
from nltk.stem.porter import PorterStemmer

sentence = """Every weekday evening, our editors guide you through the biggest stories of the day,
help you discover new ideas, and surprise you with moments of delight. Subscribe to get this delivered to your inbox.."""

stemmer = PorterStemmer()
tokens = nltk.tokenize.word_tokenize(sentence)

stemmed_tokens = []
# for loop for each token in stemmed_tokens
for token in tokens:
    # add the stemmed version of the token to the new list
    stemmed_tokens.append(stemmer.stem(token))
# join a list of tokens into one string
stemmed_sentence = " ".join(stemmed_tokens) 
print(stemmed_sentence)

everi weekday even , our editor guid you through the biggest stori of the day , help you discov new idea , and surpris you with moment of delight . subscrib to get thi deliv to your inbox ..


### 2.1.4. n-grams

## 2.1.5. Word frequency analysis (not sure if this should be here?)
## 2.1.6 regex?

## 2.2. Parts of speech

## 2.3. Named entities
NER with NLTK

NER with spaCy

# 3. NLP tasks implementation

Text summarization (gensim + spaCy)


# 4. Huggingface’s transformers: State-of-the-art NLP

Currently, Hugging face is supported by Pytorch and tensorflow 2.0. We can use transformers of Hugging Face to implement Summarization, Text Generation, Language Trasnlation, ChatBot...

In [24]:
# if you haven't installed yet, please install torch first
# !pip install torch
import torch
# !pip install transformers --upgrade

from transformers import pipeline

# 5. Sentiment Analysis with BERT?

# References

Hovy, D. (2020). Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press.

Hovy, D. (2021). Text Analysis in Python for Social Scientists: Prediction and Classification. Cambridge University Press.

https://www.nltk.org/book/ch01.html

https://www.machinelearningplus.com/nlp/natural-language-processing-guide/

