<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

#### Notes to be removed before publication

Still work in progres..
Reviewers: Arnim (author)?, more GESIS people, or SCQ summer school/workshops? Yelena, Malak, Ahmed

Review intro

Review and finish red boxes

Add insight boxes more?

## Introduction to Computational Social Science methods with Python

# Session 6: Natural Language Processing
The field of study that focuses on the interactions between human language and computers is called natural language processing (NLP).

NLP is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. NLP systems are used exploiting the signals in our language used to predict all of the aforementioned features: people’s age (Nguyen et al., 2011; Rosenthal & McKeown, 2011), gender (Alowibdi et al., 2013; Ciot et al., 2013; Liu & Ruths, 2013), personality (Park et al., 2015), job title (Preoţiuc-Pietro et al., 2015a), income (Preoţiuc-Pietro et al., 2015b), and much more (Volkova et al., 2014, 2015).

In NLP, word embeddings have been at the forefront of this progress, which has expanded to include flexible model architectures (Hovy, 2021). The most publicly visible example of this shift is probably the translation quality of services like Google Translate (Wu et al., 2016).

A collection of fundamental tasks appear frequently across various NLP projects (Vajjala et al., 2020). Let’s briefly introduce them (Figure 1):

*Language modeling* is the task of predicting what the next word in a sentence will be based on the history of previous words. The goal of this task is to learn the probability of a sequence of words appearing in a given language. Language modeling is useful for building solutions for a wide variety of problems, such as **speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction**.

*Text classification* is the task of bucketing the text into a known set of categories based on its content. Text classification is by far the most popular task in NLP and is used in a variety of tools, from **email spam identification** to **sentiment analysis**.

*Information extraction* is the task of extracting relevant information from text, such as **calendar events from emails** or the **names of people mentioned** in a social media post.

*Information retrieval* is the task of finding documents relevant to a user query from a large collection. Applications like **Google Search** are well-known use cases of information retrieval.

*Conversational agent* is the task of building dialogue systems that can converse in human languages. **Alexa** and **Siri** are some common applications of this task.

*Text summarization* aims to create short summaries of longer documents while retaining the **core content** and preserving the **overall meaning** of the text.

*Question answering* is the task of building a system that can automatically answer questions posed in natural language.

*Machine translation* is the task of converting a piece of text from one language to another. Tools like **Google Translate** are common applications of this task.

*Topic modeling* is the task of uncovering the topical structure of a large collection of documents. Topic modeling is a common text-mining tool and is used in a wide range of domains, from **literature** to **bioinformatics**.

<img src='images/nlp_tasks.png' style='height: 500px; float: right'>

Understanding human language is considered as a difficult task due to its complexity. For example, there is an infinite number of different ways to arrange words in a sentence. 

Also, words can have several meanings and contextual information is necessary to correctly interpret sentences as every language is unique and ambiguous. The ambiguity can be in lexical and syntactic forms.

- In lexical ambiguity, a single word has two or more possible meanings. For example, "I saw bats".
- In syntactic ambiguity, a single sentence or a sequence of words have multiple possible meanings. For example, "The chicken is ready to eat".

This session will help you understand the basic and advanced NLP concepts and show you how to implement using the most advanced and popular NLP libraries – <a href="https://www.nltk.org/">NLTK</a>, <a href="https://spacy.io/">spaCy</a>, <a href="https://radimrehurek.com/gensim/">Gensim</a>, and <a href="https://huggingface.co/">Hugging Face</a>.

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn about basics for the Natural Language Processing. In subsession **6.1**, **6.2**, we will show .. **6.3**,... Finally, in subsession **6.4**, we will compare these libraries and talk about the challanges and data privacy approaches.
</div>

<div class='alert-info'>
<big><b>Reminder:</b></big>
    
ONLY use pip unless there is no Conda option. Please make sure that ALL paclages we need are installed in 
<a href="https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb"> Session 1 </a>.
</div>

### NLTK (Natural Language Toolkit)

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. For more details, check out <a href="https://www.nltk.org/">NLTK</a>'s webpage.

In [None]:
# import nltk
import nltk
nltk.download('punkt')

### spaCy

spaCy is the most trending and advanced free open-source library for implementing NLP in Python today. It has distinct features that provide clear advantage for processing text data and modeling such as name entity recognition (NER), part-of-speech (POS) tagging, dependency parsing, word vectors and more. For more details, check out <a href="https://spacy.io/">spaCy</a>'s webpage.

In [None]:
# import spacy and create nlp object with loading the models and data for English Language
import spacy
nlp = spacy.load('en_core_web_sm') # if you get error, run python -m spacy download en on your Anaconda prompt

### gensim

Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) as possible. It was developed for topic modelling, which supports the NLP tasks like Word Embedding, text summarization and many others, such as <a href="https://radimrehurek.com/gensim/models/ldamodel.html">LDA Topic Modeling</a> and <a href="https://radimrehurek.com/gensim/models/phrases.html">Bigrams/Trigrams</a>. For more details, check out  <a href="https://radimrehurek.com/gensim/auto_examples/index.html#documentation">Gensim</a>'s webpage.

In [None]:
# import gensim
import gensim

### Hugging Face Transformers

Transformers was developed by <a href="https://huggingface.co/">Hugging Face</a> and provides state of the art models. It is an advanced library known for the transformer modules with high-level NLP tasks. Hugging Face is one of the most widely used libraries in NLP community. It provides native support for PyTorch and Tensorflow-based models, increasing its applicability in the deep learning community. <a href="https://arxiv.org/abs/1810.04805">BERT</a>  and <a href="https://arxiv.org/abs/1907.11692">RoBERTa</a> are two of the most valuable models supplied by the Hugging Face library, which is used for machine translation, question/answer activities, and many other applications. 

Hugging Face pipeline provides a rapid and simple approach to perform a range of NLP operations, and the Hugging Face library also supports GPUs for training. As a result, processing speeds are multiplied by a factor of ten. Check out their <a href="https://huggingface.co/docs/transformers/main_classes/pipelines">Pipelines</a> for what 10+ tasks we can perform as one-liners basically. Their model repository is vast.

<div class='alert-info'>
<big><b>Reminder:</b></big>
    
Please install the package if you haven't done so, uncomment the code line below:
    
!pip install transformers check out for conda in <a href="https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb"> Session 1 </a>.
</div>


# 6.2. Text Preprocessing methods

To refer to the entire collection of documents/observations, we use the word corpus (plural corpora). The raw text data often referred to as *text corpus* has punctuations, suffices, and stop words that do not give us important information. To have more useful information for NLP tasks, Text Preprocessing involves preparing the text corpus.

Let's walk through some basic steps of preprocessing of a raw text corpus!

<div class='alert-info'>
<big><b>Haiko</b></big>

So far the session is very technical. But what is the teaching goal beyond coding? Could we introduce "language models" here as a larger concept that allows us to structure content here? Wikipedia lists four main models(https://en.wikipedia.org/wiki/Language_model). How do the simple text processing techniques we discuss here connect to the actual models (e.g., markoc models); how do bag of words techniques fit in; what's the relationship to those learned representation techniques? This session should provide the answers.

We could structure the session around Hovy's section 8: start with language models and proceed with the necessary steps in that context (following the logic of his subsections).

This session could apply an n-gram model to a large corpus and discuss the meaning of the n-grams.

This session should also prepare the ground for the later session on supervised text mining. We could ask ourselves "what is needed later on?" and make the connection...
</div>

A text can be converted into nlp object of spaCy as it was shown in the earlier step in spaCy. First, we should convert raw text into a spaCy object. However, to do that, we need to have our dataset.

In [None]:
raw_text = 'Today is a great day with learning NLP, such a powerful tool!'
text_doc = nlp(raw_text)

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
I might remove all toy examples and continue with the reviews from Amazon. What do you think?"
<div>

Let's download publicly available Amazon reviews data from He, R., & McAuley, J. (2016): http://jmcauley.ucsd.edu/data/amazon/index_2014.html

The dataset that we will work on is already downloaded for this Session in data folder, filename: "Industrial_and_Scientific_5.json.gz"


In [None]:
# import neccessary libraries and check out the Session 1 for further installation information

import json
import gzip
import pandas as pd

In [None]:
data = []
with gzip.open('./data/Industrial_and_Scientific_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])


In [None]:
# convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)

print(len(df))

In [None]:
# show the first five rows of the dataframe
df.head()

<div class='alert-info'>
<big><b>Reminder:</b></big>
    
If you need more time to learn about pandas, please get back to <a href="https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/2_data_handling_and_visualization.ipynb"> Session 2 </a>.
</div>


## 6.2.1. Word Descriptors

### Tokens and splitting 

#### What is a Token?

The set of all the unique terms in our data is called the vocabulary. Each element in this set is called a type. Each occurrence of a type in the data is called a token. 

Let's practice: Our sentence “Today is a great day with learning NLP, such a power tool!”, has 14 tokens but only 13 types (namely, 'Today', 'is', 'a', 'great', 'day', 'with', 'learning', 'NLP', ',', 'such', 'a', 'powerful', 'tool', '!'). Note that types can also include punctuation marks and multiword expressions.

In other words, the words of a text document/file separated by spaces and punctuation are called as tokens.

#### What is a Tokenization?
The process of extracting tokens from a text file/document is referred as tokenization.



In [None]:
# We can print text of the tokens by accessing token.text while using spaCy
# printing tokens
tokens = []
for token in text_doc:
    print(token.text)
    tokens.append(token.text)
    
len(tokens)    

In [None]:
# What if we want to find a particular token with alphabetic characters?

for token in text_doc:
    print(token.text, token.is_alpha)


In [None]:
# What if we want to know if the particular token is space, or a stop word or punctuation?
print("Text".ljust(10), ' ', "Alpha", "Space", "Stop", "Punct")
for token in text_doc:
    print(token.text.ljust(10), ':', token.is_alpha, token.is_space, token.is_stop, token.is_punct)

#### Let's try tokenization with nltk.
nltk doesn't come fully installed, you need to use nltk.download() to use some of the missing functions, and we will also use spaCy to show similar features that these library have.

In [None]:
import nltk
# nltk.download('punkt')

sentence = "At 10:30 o'clock on Monday mornings, we have Social ComQuant meetings. Let's have our meeting another time."
tokens = nltk.tokenize.word_tokenize(sentence)
print(tokens)
print()
print(" ".join(tokens))

In [None]:
# convert string to upper case characters
sentence.upper() 

In [None]:
# convert string to lower case characters
sentence.lower() 

#### Let's remove punctuation and stop words. But, wait, what do we mean by stop words?
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is”, and “and” would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [None]:
# Access the built-in stop words in nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

print(stopwords.words('english'))

In [None]:
# Access the built-in stop words in Spacy
stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)

# printing a fraction of the list through indexing
for word in list_stopwords[:5]:
    print(word)

In [None]:
# Filter out the stopwords
filtered_text= [token for token in text_doc if not token.is_stop]

# Count the tokens after removal of stopwords
token_count_without_stopwords=0
for token in filtered_text:
    print(token)
    token_count_without_stopwords+=1

In [None]:
# Remove punctuations
filtered_text=[token for token in filtered_text if not token.is_punct]

token_count_without_stop_and_punct=0

# Counting the new no of tokens
for token in filtered_text:
    print(token)
    token_count_without_stop_and_punct += 1

### Lemmatization

When we look up a word in a dictionary, we usually just look for the base form. This dictionary base form is called the lemma.
For instance, we might see forms like “go,” “goes,” “went,”, “gone,” or “going” and we look up dictionary in a lemmatized form, such as "go" (Hovy, 2020).

In [None]:
# Let's give an example with nltk
# import nltk already imported
# import the lemmatizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Remember our sentence with the Social ComQuant meeting.
sentence = "At 10:30 o'clock on Monday mornings, we have Social ComQuant meetings. Let's have our meeting another time."

WNL = WordNetLemmatizer() # declaring an instance of our preprocessor.
tokens = nltk.tokenize.word_tokenize(sentence)

lemmatized_tokens = []
for t in tokens:
    t_lemma = WNL.lemmatize(t)
    lemmatized_tokens.append(t_lemma)
print(lemmatized_tokens)

<div class='alert-info'>
<big><b>Haiko</b></big>

At this point it occurrs to me that just providing the commands how things can be done is not enough. We also want to teach how users can proproces their corpus and save intermediate steps like "corpus.txt" > "corpus_stemmed.txt" > "corpus_stemmed_nostopwords.txt" > ...

In general, do we need spacy, nltk, and gensim to work along this pipeline? Even if not, we should tell in the session why we introduce all packages.
</div>

In [None]:
# Let's give an example with spaCy
new_sentence = "What about going to festivals? You said you like dancing."

text_doc = nlp(new_sentence)
for token in text_doc:
    print(token.text, '----->', token.lemma_)


### Stemming 

From Hovy's book: "Rather than reducing a word to the lemma, we strip away everything but the irreducible morphological core (the stem). For example, for a word like “anticonstitutionalism”, which can be analyzed as “anti+constitut+ion+al+ism,” we remove everything but “constitut.” The most famous and commonly used stemming tool is based on the algorithm developed by Porter (1980). For each language, it defines a number of suffixes (i.e., word endings) and the order in which they should be removed or replaced. By repeatedly applying these actions, we reduce all words to their stems. In our example, all words derive from the stem “constitut–” by attaching different endings. Again, a version of the Porter stemmer is already available in Python, in the nltk library (Loper & Bird, 2002), but we have to specify the language."

In [None]:
# import stemmer
from nltk.stem.porter import PorterStemmer

sentence = """Every weekday evening, our editors guide you through the biggest stories of the day,
help you discover new ideas, and surprise you with moments of delight. Subscribe to get this delivered to your inbox.."""

stemmer = PorterStemmer()
tokens = nltk.tokenize.word_tokenize(sentence)

stemmed_tokens = []
# for loop for each token in stemmed_tokens
for token in tokens:
    # add the stemmed version of the token to the new list
    stemmed_tokens.append(stemmer.stem(token))
# join a list of tokens into one string
stemmed_sentence = " ".join(stemmed_tokens) 
print(stemmed_sentence)

### n-grams

### regex

<a href="https://docs.python.org/3/howto/regex.html">Regular expressions</a> (called regex, regexes, regex pattern, regexp, or REs) specify search patterns. Typical examples of regular expressions are the patterns for matching email addresses, phone numbers, and credit card numbers.

Regular expressions are essentially a specialized programming language embedded in Python, and you can interact with regular expressions via the built-in `re` module in Python, which has some functions that match a string for a pattern:

- `match()`
- `search()`
- `findall()`
- `finditer()`

Pattern... character set...

In [None]:
# import packages
import PyPDF2
from tika import parser # needs to be imported and note that Tika is written in Java so you will need a Java runtime installed
import re
import pprint
# read the speech data
raw = parser.from_file("data/king_dreamspeech.pdf")

# remove spaces backslashes
text_corpus = raw['content'].replace("\\", "").lower()

search_keywords = ['but', 'because', 'while', 'as']

sentences = text_corpus.split('\n')

# or with spaCy

for keyword in search_keywords:
    matches = [s for s in sentences if re.search(r'\b' + keyword + r'\b', s)]
    print(matches)



### Word frequency analysis

<div class='alert-info'>
<big><b>Haiko</b></big>

Yes, I think it should be here. Hovy has it in the "text representation" section. It is then a statistic of the document-term matrix for example.
</div>




In [None]:
import gensim, pprint # if you haven't done so

# tokenize documents with gensim's tokenize() function
tokens = [list(gensim.utils.tokenize(doc, lower=True)) for doc in sentences]

# build bigram model
bigram_mdl = gensim.models.phrases.Phrases(tokens, min_count=1, threshold=2)

# do more pre-processing on tokens (remove stopwords, stemming etc.)
# NOTE: this can be done better
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, stem_text
CUSTOM_FILTERS = [remove_stopwords, stem_text]
tokens = [preprocess_string(" ".join(doc), CUSTOM_FILTERS) for doc in tokens]

# apply bigram model on tokens
bigrams = bigram_mdl[tokens]

pprint.pprint(list(bigrams))

In [None]:
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in bigrams:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in bigrams]
pprint.pprint(processed_corpus)

In [None]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

In [None]:
pprint.pprint(dictionary.token2id)


<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>

    
Add here freq visualizations, word count --> dictionary based, freqs (Haiko's suggested paper on uncertainity)
principle behind the algorithms, create your own dictionary (25 words, cite the paper) extracting URLs, hashtags and emojis can come after that

tweets for emojis
    
over time analysis?

multi corpus for tfidf"
<div>



## 6.2.2. Parts of speech

## 6.2.3. Named entities
NER with NLTK

NER with spaCy

<img src='images/NER.png' style='height: 500px; float: left'>

# 6.3. NLP tasks implementation

Text summarization (gensim + spaCy)
Text similarity cosine similarity

Use Amazon reviews: https://nijianmo.github.io/amazon/index.html 2018


Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>

To session 10? Huggingface’s transformers: State-of-the-art NLP
Currently, Hugging face is supported by Pytorch and tensorflow 2.0. We can use transformers of Hugging Face to implement Summarization, Text Generation, Language Translation, ChatBot...

<div>

<div class='alert-info'>
<big><b>Haiko</b></big>

pytorch is available in Anaconda, torch is not. Can we use pytorch as we have the policy that we prioritize packages available in Anaconda?
</div>

In [None]:
# if you haven't installed yet, please install torch first
# !pip install torch
import torch
# !pip install transformers --upgrade

from transformers import pipeline

# References

Hovy, D. (2020). Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press.

Hovy, D. (2021). Text Analysis in Python for Social Scientists: Prediction and Classification. Cambridge University Press.

Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

He, R., & McAuley, J. (2016, April). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web (pp. 507-517).: http://jmcauley.ucsd.edu/data/amazon/index_2014.html


https://www.nltk.org/book/ch01.html

https://www.machinelearningplus.com/nlp/natural-language-processing-guide/



Image Credits

[1] Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media. Chapter 1: https://www.oreilly.com/library/view/practical-natural-language/9781492054047/ch01.html

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & ..?

Contributors: Haiko Lietz & Pouria Mirelmi & ..?

Acknowledgements: ...

Version date: XX. December 2022

License: ...
</div>