<a href="https://colab.research.google.com/github/ameryusuf/BotBuilder/blob/master/2-advanced-topics/text-analysis/intro-text-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Text Analysis

Text analysis is the process of extracting meaningful information from text data, uncovering insights that would otherwise remain buried under text corpora.

This session is an **introduction** to text analysis. We'll be covering the following topics:

1. Regex and character patterns in text data
1. Text data pre-processing
1. Counting words
1. Text classification

The session assumes previous knowledge of Python and Pandas, and some knowledge of data visualization using seaborn.

We'll use the following libraries in this notebook:

- **pandas** for dataframe operations
- **re** for regular expressions
- **spacy** for text data preparation
- **seaborn** for data visualization
- **sklearn** for data classification

## (some) Data exploration

We'll start by getting familiarized with our dataset. We'll use a structured tabular dataset of working papers obtained from the WB Documents API.

Run the following line to load the dataset:

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/worldbank/dec-python-course/main/2-advanced-topics/text-analysis/data/papers.csv')

In [None]:
len(df)

In [None]:
df.head()

The data is a corpus of working papers from the WB Policy Research Working Paper series. For each paper, we have:

- A paper identifier
- The Title
- Two URLs
- The topics of the paper, separated by commas
- An abstract
- A text

Let's take a closer look at the columns `url`, `url_text`, and `text`:

In [None]:
df['url'][0]

In [None]:
df['url_text'][0]

In [None]:
df['text'][0]

`url` contains the paper URL, `url_text` is the URL to actual text content, and `text` is the text of the paper.

Now that we know what the data is about, we can start planning what to do with it. In general, all the tasks we'll do are about data augmentation, and basic descriptive and classification tasks. This is a summary of what we'll do:

1. Generate new features (columns) based on the text
1. Count the words and most used words
1. Build a topic classifier from our corpus

For the first task, we'll augment the dataset using existing patterns in the text.

## Patterns

Let's take another look at the text. This time we'll use the function `print()`, so that space characters are properly rendered and the text is easier to read.

In [None]:
print(df['text'][0])

Note that there are a number of information elements that seem to follow some patterns in the text:

- The WP number is in the last sequence of non-space characters in the first line
- The authors' names is a series of contiguous lines after the paper title
- Abstract: lines after the word "Abstract" in the beginnning of the text. All of them seem to have a big space in the middle of the sentence
- Keywords: separated by a semi-colon in a line that starts with "Keywords"
- JEL Codes: an uppercase caracter followed by two numbers, separated by commas
- Authors emails: non-space sequence of characters with "at" sign ("@") and ending in ".org", ".com"
- Bibliography elements: last lines of the text

We're going to take advantage of the patterns of JEL codes to extract them in a new column and augment our original dataframe. We'll use regular expressions for this.

**Important:** We're only checking one observation (the first) when inferring these patterns. If you want to augment an entire dataframe and not a single observation, you'd have to make sure the same pattern exists in the rest of the texts of your corpus. We'll take it for granted in this session for the sake of time, but you should note that manually exploring different observations of your corpus is needed to infer possible patterns in your texts.

### Regular expressions

In programming, regular expressions are sequences of characters that match a pattern in text. A simple example:

In [None]:
import re

In [None]:
text = 'The ID number of participant 1 is 30551. They were born on July 01, 1996. Participant 2 has ID 71098.'

# Pattern for capturing IDs in this text: sequences of five number characters:
pattern = '\d{5}'

# Capturing IDs
ids = re.findall(pattern, text)
print(ids)

Some notes about this code:
- `\d` is a wildcard that represents one number (0-9). This is also the same as `[0-9]`
- `{5}` means that the previous character in the pattern is repeated five times
- A variation of this pattern could be `\d{4}`, which could be used to capture years. This would have returned a list with `1996` in the example above

In regex, there is a wildcard for almost everything. Some examples:

- Character wildcards:
    + `\d` --> digits (0-9)
    + `\W` --> any word character (uppercase and lowercase a-z, digits, and underscore ("_") )
    + `\n` --> newline characters
    + `\s` --> whitespace characters, including newline
    + `.` --> any character except newline
- Character repetition:
    + `{a}` --> the previous character, repeated "a" times
    + `{a,b}` --> the previous character, repeated between "a" and "b" times
    + `*` --> the previous character, repeated zero or more times
    + `+` --> the previous character, repeated one or more times
    
Regex can match any pattern we can possibly imagine. However, working with regex can be complex for starters. For the purpose of this session, we've introduced regex so you know it exists and can be used to augment datasets containing corpus of documents. Don't worry for now if you still didn't grasp well how the patterns work, but if you're interested in learning more about rege, we recommend the following resources:

- A nice regex tutorial is [here](https://regexone.com/)
- A great regez visualizer tool is [here](https://jex.im/regulex/#!flags=&re=www%5C.%5Ba-zA-Z0-9-%5D%2B%5C.(%3F%3Acom%7Cnet%7Corg))

### Extracting information using patterns

Remember we said that JEL codes in the text looked like a pattern of one uppercase letters followed by two digits? We'll use this to extract the JEL codes of each paper in a new column in the dataframe.

In [None]:
pattern = '[A-Z]\d{2}'

This pattern captures one uppercase alphabetic character (`[A-Z]`), followed by one digit repeated two times (`\d{2}`).

Now we'll define a helper function that looks for this pattern in a text and returns all captures in a list:

In [None]:
def capture_jel(text):

    pattern = '[A-Z]\d{2}'
    result = re.findall(pattern, text)

    return result

Lastly, we'll map this function using Pandas' `apply()` method to create a new column in the dataframe:

In [None]:
df['jel'] = df['text'].apply(capture_jel)

In [None]:
df.head()

Now we have augmented our dataset. Great!

For the next part of the session, we'll start properly analyzing and getting insights from the text contents. The final result of the next part will be a count of the most used words in each text and we'll also count the total number of words in each text during the process.

## Text data pre-processing

Before we start, we need to think of the following:

- Our texts are in a very raw state. Shouldn't we "clean" them a bit before counting words?
- Using regex to capture words so we can count them sounds possible, but perhaps there is an easier way?
- Texts in English usually repeat a lot words that are not very insightful about the content, such as prepositions or pronouns. Can we get rid of some of them before the word count?
- Lastly, shouldn't we count in the same category words that are not exactly the same but have a very similar meaning? for example:
    + different conjugations of the same verb
    + singular and plural forms of the same noun
    
The answer to all of these questions is Yes. We'll do this in the data pre-processing. Data pre-processing in text analysis is extremely important. Omitting pre-processing will give you different results in text analysis tasks.

Data pre-processing can consist of multiple tasks. We'll apply the following for our corpus:

- Transform to lowercase
- Tokenization: transform texts into lists of words
- Remove stop words (words that are not very insightful, such as prepositions)
- Lemmatization: transform different forms of words into a common word that conveys a similar meaning. This is useful to "normalize" conjugations of verbs or plural forms of words

Fortunately, there is a very useful Python library we can use for this: [spaCy](https://spacy.io/). SpaCy makes available pre-existing NLP models that tokenize, lemmatize, and detect stop words and non-word characters (such as digits or punctuation), so we can easily transform a text into a list of "meaningful" lemmatized words that we can use for word counts.

### Working with spaCy

First we need to install spaCy. Uncomment the line below, run it, and then comment it again with `#`.

In [None]:
#!pip install spacy

In [None]:
import spacy

Now we need to **download** spaCy's NLP model. Uncomment the line below, run it only once, and then comment it out again to make sure you won't run it again accidentally.

In [None]:
#!python -m spacy download en_core_web_sm

Now we **load** the model so it's available in this Python notebook:

In [None]:
nlp = spacy.load('en_core_web_sm')

Then, we'll build a function that:

1. Reads a text
1. Transforms it to lowercase
1. Loads it into the model
1. For each word, obtains the lemmatized versions of words that are not:
    - Stop words
    - Punctuation
    - Numbers
    - Spaces
1. Finally, the words returns a list of the lemmatized words

In [None]:
def word_tokenization_normalization(text):

    text = text.lower() # lowercase
    doc = nlp(text)     # loading text into model

    words_normalized = []
    for word in doc:
        if word.text != '\n' \
        and not word.is_stop \
        and not word.is_punct \
        and not word.like_num \
        and len(word.text.strip()) > 2:
            word_lemmatized = str(word.lemma_)
            words_normalized.append(word_lemmatized)

    return words_normalized

To get a better idea of what the function does, let's take a look at the result for one paper:

In [None]:
text = df['text'][10]
doc_tokenized = word_tokenization_normalization(df['text'][10])

In [None]:
doc_tokenized

The result is a list of normalized words for the text.

You might have also noticed that this takes some time to run. To avoid having to wait, we'll apply the function to tokenize and normalize only the **abstracts**. We'll again use the Pandas method `apply()`.

In [None]:
df['abstract_tokenized'] = df['abstract'].apply(word_tokenization_normalization)

In [None]:
df.tail()

The downside of having applied the tokenization and normalization on the abstracts is that we might not have abstracts long enough to make word repetition very insightful. In a non-training setting, we should have used the full texts, leave the code running while we do other things or go for coffee, and come back and work with the results once the code finishes.

## Counting words

Now that the texts are normalized, we can count words! We'll do two things:

1. Generate a column with the number of words
1. Generate a column with a dictionary where each word is a key and the number of times are the key's values. This will look like `{'word1': n1, 'word2': n2, ...}`

For the first task, we can directly create a new column with the result in the dataframe:

In [None]:
df['n_words_abstract'] = df['abstract_tokenized'].apply(len)

In [None]:
df.head()

Just out of curiosity, let's pause for a minute to see the distribution in the number of words.

In [None]:
# Uncomment and run this line if you don't have seaborn:
#!pip install seaborn

In [None]:
import seaborn as sns

In [None]:
sns.histplot(data=df, x='n_words_abstract')

For the second task, we need to generate a helper function that generates the dictionary from each tokenized abstract.

In [None]:
def word_counts(tokenized_text):

    count = {}

    for word in tokenized_text:
        if word in count:
            count[word] += 1
        else:
            count[word] = 1

    return count

We'll first apply the function to only one text to make sure the result looks correct.

In [None]:
abstract_tokenized = df['abstract_tokenized'][42]
count = word_counts(abstract_tokenized)
count

This looks interesting, but it's not very meaningful unless we spend some time looking at the result. We'll transform this into a barplot for easier interpretation but only keeping the words with more than 2 counts.

In [None]:
count_trimmed = {}
for word, value in count.items():
    if value > 2:
        count_trimmed[word] = value

In [None]:
sns.barplot(count_trimmed, orient='h')

Now we'll apply the function `word_counts()` to all the abstracts.

In [None]:
df['abstract_word_count'] = df['abstract_tokenized'].apply(word_counts)

In [None]:
df.tail()

## Text classification

For the last part of the session, we'll do a simple text classification example. We're calling this "simple" because there are now available very fancy and state-of-the-art text classification techniques for text, but that are not suitable for a 90-minute session. You can check the link listed below about LLMs if you want to explore more about these.

Simply put, text classification consists of assigning a text to a pre-defined group. If you're familiar with machine learning, this is exactly a supervised machine learning classification task. For our exercise, the pre-defined groups will be the first topic of the column `topics`.

In [None]:
df['first_topic'] = df['topics'].apply(lambda x: x.split(',')[0].lower())

Now we'll tabulate the result:

In [None]:
df['first_topic'].value_counts()

In [None]:
len(df['first_topic'].unique())

In [None]:
len(df)

There are 198 topics for a total of 399 papers (!), which means that a lot of topics have only one or two papers. We'll keep only topics that have at least five papers so that there is at least some observations in each topic to build a classifier. This will reduce the size of our dataframe.

In [None]:
topics_to_keep = df['first_topic'].value_counts()[df['first_topic'].value_counts() >= 5]

In [None]:
topics_to_keep.sum()

Our resulting dataframe will have only have 148 observations. This is not enough to generate a good classifier but we'll still go ahead and use it for the exercise as an example of the application of the text classification method.

In [None]:
df2 = df[df['first_topic'].isin(topics_to_keep.index)].reset_index(drop=True)

In [None]:
len(df2)

### Text encoding

Our classifier will be built (trained) using the tokenized and normalized abstracts. However, we need first to convert them into numbers so a classifier con work with them. This operation is called **encoding**.

There are several ways of encoding texts. We'll use term-frequency inverse-document frequency (TF-IDF). TF-IDF transforms a collection of words into a numeric vector where each word has a weight. It gives high weight to words that show up a lot in a given document, but rarely across documents in the corpus (so they are more distinctive for the document only).

In [None]:
df2.head()

We'll start by installing the library we'll use for the encoding and text classification: scikit-learn.

In [None]:
# Uncomment the line below for the installation:
#!pip install scikit-learn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Generating the encoder
corpus = list(df2['abstract_tokenized'].apply(lambda x: ' '.join(x)))
encoder = TfidfVectorizer(stop_words = ['paper'], max_features=1000)
vectors = encoder.fit_transform(corpus)

In [None]:
vectors.shape

For an easier understanding of the text encoding, we'll transform this back into a dataframe:

In [None]:
words_encoded = encoder.get_feature_names_out()

In [None]:
vectors_data = vectors.todense()

In [None]:
df_tfidf = pd.DataFrame(data=vectors_data, columns=words_encoded)
df_tfidf.insert(0, 'title', df2['title']) # inserting the paper title

In [None]:
df_tfidf.tail()

### Training a classifier

Now that our data is ready, we can train a classifier with it. We'll use a multinomial Naive Bayes classifier in this example, but other types of classifiers are available in the library we're using (scikit learn).

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
classifier = MultinomialNB()

In [None]:
df2

In [None]:
x = df_tfidf.drop(columns = 'title')
y = df2['first_topic']
classifier.fit(x, y)

After this, `classifier` has been trained with the data in `x` to know which patterns in it produce the results in `y`.

### Classification

Now we'll classify our texts with the classifier we trained. Given that it was trained with encoded normalized words, the input for any classification should also be encoded normalized words. We'll use our same data of `tf_idf` to produce a classification and will compare it the actual true values to have a sense of how well this classifier performs.

In [None]:
predictions = classifier.predict(x)

In [None]:
df_predictions = df2[['title', 'first_topic']]
df_predictions['predictions'] = predictions

In [None]:
df_predictions.head()

In [None]:
df_predictions['correct'] = False
df_predictions.loc[df_predictions['first_topic'] == df_predictions['predictions'], 'correct'] = True

In [None]:
df_predictions['correct'].value_counts()

Some notes on this result:

- Our classifier is only 50% accurate. This is not a good performance but we also had to work with very small data that we can manage in a short training session. In a real setting, you should have ideally with 1,000+ observations and different types of classifiers.
- We are using our classifier on the same data we used for training it. In a real setting, this is a very bad practice as it will likely lead to overfitting: producing a classifier that works well for the data it was trained on but can't really generalize for out-of-sample cases. The way you avoid this is by separating your data in a training dataset and a test dataset. Then you use the training set for training and the test set for evaluating its performance.
- You probably noticed that the dataframe `df_tfidf` is a sparse matrix. Sparse matrix are common results of TF-IDF encoding. You can use principal component analysis to reduce the matrix into a few meaningful components and work with that. This will make the computation easier.
- Moreover, you can augment the PCA vectors or TF-IDF matrix with other data that will probably have predicting power for the variable we classify. Remember we extracted the JEL topics before? those are probably good predictors in this case.

## Final notes

### Other languages

These exercises used a corpus in English. However, the principles for working with other languages are just the same for all of these text classification tasks. SpaCy has NLP models in other languages available, you can check them [here](https://spacy.io/usage/models).

### Other text analysis tasks

This was an overview of possibly the simplest text analysis tasks. Other tasks are:

- Named entity recognition: detecting mentions of a meaningful entity (places, names of people, dates, etc) in texts
- Cluster classification and topic modeling: classifying texts into groups based on their similarity
- Vector spaces and word embeddings: transforming texts or words into vectors of "meanings". You can then work with them for other tasks, such as compare the proximity of texts based on meanings
- Generative AI with texts: generating texts based on prompts or previous text.

### Large Language Models (LLMs)

We didn't cover LLMs because they're not part of an introductory session. If you're more interested in learning about them, we recommend these readings:

- BERT was the first (or at least one of the first?) LLM publicly released. This article explains well how it works: [BERT Explained: State of the art language model for NLP](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)
- This is a tutorial of how to work with BERT to fine-tune it for specific NLP/text analysis tasks: [BERT Fine-Tuning Tutorial with PyTorch](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)