# NLP V: Introduction to Topic Modeling and Sentiment Analytics

In this notebook you'll find the information needed to classify texts and to detect polarization in texts.

## Topic Modeling

In this Jupyter Notebook, we will delve into the realms of topic modeling and conversational agents. Our aim is to explore how algorithms can be employed to automatically classify sentences into various topics, which can be particularly useful for topic moderation among other applications.

### Data Preprocessing Overview

First, we start by converting a set of sentences into a matrix. In this matrix, each column represents a unique word, and each row corresponds to a sentence. We utilize 1s and 0s to indicate whether or not a particular word is present in a given sentence.

Let us start importing libraries, what else can we do? 👥

In [None]:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

# Do't forget to import your library; you will need it.
# Type your code here:


### Load the corpus texts

The corpus is similar to what we use for tagging; it consists of a set of sentences.

In [None]:
trainCorpus = ["Me gustan las vacas",
               "Me gustan los caballos",
               "odio los perros",
               "odio los caballos",
               "me gustan las ranas",
               "me gusta el helado",
               "no quiero comer",
               "los helados son cremosos"]

### Vectorised the corpus texts

The next step is to vectorize the sentences; i.e. the corpus texts. This means to convert the text in a vector of frequencies.

In [None]:
puncts = [c for c in string.punctuation]
spanish_stopwords = nltk.corpus.stopwords.words('spanish') + puncts
vectorizer  = CountVectorizer(stop_words=spanish_stopwords)

Stopwords, together with punctuation signs, are useful to have them at hand for cleaning our dataset of undesired elements that do not add value. They usually consist of commonly occurring words like 'the', 'and', 'is', as well as punctuation marks such as commas, periods, and exclamation marks, which may not be meaningful in the analysis.

**Reminder:** We're implementing this cleaning step because we are operating within a free context schema, where we're not restricted by any specific field or subject matter for our text data. In such a case, removing these stopwords and punctuation signs helps us focus on the words that actually carry significance.

In [None]:
vectorizer.fit(trainCorpus)

As you can see, it creates an array to filter out what is truly significant. 

What it has actually learned is the following:

In [None]:
tf_feature_names = vectorizer.get_feature_names_out()
print('Atributos:', vectorizer.get_feature_names_out())

We transform the documents into a document-term matrix (often abbreviated as DTM) using topic frequency (TF).

**Little Excercice:** Print the shape of the `tfMatrix` matrix. Do you andestand what each dimension means?

In [None]:
tfMatrix = vectorizer.transform(trainCorpus)
# print(tfMatrix)

print('Matrix:\n', tfMatrix.toarray())

# Type your code here:


**Carefully examine the matrix and compare it with both: `tf_feature_names` and the sentences in the corpus.** 

The `tfMatrix` is our dataset, where:
- each row represents a sample (a document in the corpus).
- each column represents an attribute (the frequency of a word in that document).

**Room for Improvement?**

Absolutely! You could implement stemming to group similar word forms together. For example, this would allow the algorithm to recognize 'gustan' as similar to 'gusta', or 'helados' as similar to 'helado', and so on. To keep this notebook from becoming overly lengthy, let's move on to some more key concepts.

### The LDA algorithm for extracting topics

The next algorithm we'll explore allows us to identify the distinct characteristics of various themes or topics.

We'll use the **Latent Dirichlet Allocation (LDA)** model for topic modeling. Here's a short explanation.

LDA is a type of probabilistic model used for topic modeling. Topic modeling is the task of automatically identifying topics present in a text corpus. The idea is to uncover the hidden thematic structure in a large collection of documents.

**How Does LDA Work?**

1) **Initialization:** You specify the number of topics you want the model to find in your corpus. For example, in our case, the corpus seems to clearly revolve around "love" and "hate" as one of the topics. Other parameters control the learning process.
2) **Random Assignment**: Each word in each document is initially assigned to a random topic.
3) **Iterative Refinement:** The model goes through multiple iterations, reassigning each word to a topic based on:
    - How often the topic occurs in the document.
    - How often the word occurs in the topic across all documents.
    
    You can change this parameter if desired.
4) **Convergence:** The algorithm iteratively refines these assignments until they converge to a stable state or after a fixed number of iterations.
5) **Output:** The end result is a set of topics, each represented as a collection of words, and the weight or probability of each word belonging to a given topic.

In [None]:
# Define the number of topics and initialize the Latent Dirichlet Allocation model
topics = 2
lda_model = LatentDirichletAllocation(
    n_components=topics,      # Number of topics
    max_iter=5,               # Maximum number of iterations for the optimization
    learning_method='online', # Learning method, online variational Bayes
    learning_offset=50.,      # A parameter to downweigh early iterations in online learning
    random_state=0            # Random seed
).fit(tfMatrix)               # Fit the model to the term-frequency matrix

# The H matrix contains the words associated with each topic
# The components_ attribute gives us the topics found
H = lda_model.components_

# The W matrix contains the topic distribution for each document
# The transform method gives us the topic distribution for each document
W = lda_model.transform(tfMatrix)

# Define the number of top words to display for each topic and the number of top documents for each topic, respectively
n_top_words = 2
n_top_documents = 3

# Function to display the topics
def display_topics(H, W, feature_names, documents, n_top_words, n_top_documents):
    for topic_idx, topic in enumerate(H):
        print('-------------')
        print('Topic', topic_idx)
        # Print the top words in the topic
        for i in topic.argsort()[:-n_top_words - 1:-1]:
            print(feature_names[i])
        # Print the documents that have the highest probability for this topic
        top_doc_indices = np.argsort(W[:, topic_idx])[::-1][0:n_top_documents]
        for doc_index in top_doc_indices:
            print(trainCorpus[doc_index])

In [None]:
display_topics(H, W, tf_feature_names, trainCorpus, n_top_words, n_top_documents)

It's worth mentioning that since I set the algorithm to return strictly three documents for each topic, and there are only two documents that closely match these topics, the algorithm returns 'no quiero comer' as one of the top documents.

## Sentiment Analysis using NLTK's Vader

Sentiment Analysis is the computational study of people's opinions, sentiments, emotions, and attitudes. This technique is commonly applied to customer reviews, social media comments, and survey responses to make data-driven decisions. In the field of Natural Language Processing (NLP), it holds a significant place.

In this notebook, we will utilize Vader (Valence Aware Dictionary and sEntiment Reasoner), a pre-built sentiment analysis tool that is a part of the Natural Language Toolkit (NLTK). Vader is particularly good at handling social media text, short sentences, and even emoticons. Of course, manual sentiment analysis is an option, but given that we've already invested significant effort into manual methods, it's both practical and efficient to leverage Vader for this task.

The goal is to train a model to classify sentences or text into various sentiment categories—positive, negative, or neutral.

### How Vader Works

Vader works by tokenizing the text into individual words (tokens). It then checks these tokens against its predefined list of positive, negative, and neutral words. Each word is assigned a score, and the overall sentiment of the sentence is calculated based on the sum of these scores.

Importing NLTK Libraries:

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

### Corpus for Analysis

In the corpus that we've prepared for analysis, we've included a variety of sentences with differing levels of complexity. These range from simple statements to more nuanced expressions, in order to test the capabilities of our sentiment analysis model.

In [None]:
corpus = ["VADER is smart, handsome, and funny.", # example of a positive sentence
"VADER is smart, handsome, and funny!", # Detection of exclamatory emphasis (increased intensity of feeling).
"VADER is very smart, handsome, and funny.", # detection of augmentative words (increased intensity)
"VADER is VERY SMART, handsome, and FUNNY.", # emphasis derived from capitalization
"VADER is VERY SMART, handsome, and FUNNY!!!",# combination of the above
"VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",# combination of the above to the highest level
"The book was good.", # positive sentence
"The book was kind of good.", # decrease of positivity (intensity adjustment)
"The plot was good, but the characters are uncompelling and the dialog is not great.", # negation
"A really bad, horrible book.", # negative sentence with intensity enhancers
"At least it isn't a horrible book.", # negation of negativity
":) and :D", # emoticons
"", # empty strings are treated correctly
"Today kinda sux", # slang word detection
"Today KINDA SUX!", # combination of slang with capitalisation and exclamation (increases sentiment)
"I'll get by", # this sentence is neutral
"Today kinda sux, think I'll get by", # This example serves to compare with the following sentence    
"Today kinda sux, but I'll get by" # 'but' softens the negativity of the previous sentence.
]

### Execute Pre-Built Function

We will make use of the `SentimentIntensityAnalyzer()` function from the VADER. This function will help us quickly and efficiently determine the sentiment of each sentence in our corpus.

In [None]:
sid = SentimentIntensityAnalyzer()

Now, let's analyze the sentences in our example.

In [None]:
# Iterate through each sentence in the corpus
for sentence in corpus:

    # Print the sentence that is about to be analyzed
    print(sentence)
    
    # Calculate the sentiment scores of the sentence and store them in a dictionary 'ss'
    ss = sid.polarity_scores(sentence)
    
    # Sort and print each component of the sentiment score
    # 'compound' is the aggregated sentiment score
    # 'neg' represents negativity
    # 'neu' represents neutrality
    # 'pos' represents positivity
    for k in sorted(ss):
        print(f"{k}: {ss[k]}")

### Handmade Sentiment Analyzer: A Custom Approach

In the following section, we will create our own Sentiment Analyzer. We will choose a custom set of words that can be associated with a positive or a negative sentiment. We'll also associate these custom words to sentiment scores.

**Context Matters**

The context in which you use this model is crucial. For instance, in a restaurant review, the term 'chorizo' is likely neutral or even positive. However, in the context of a Twitter post discussing corruption, 'chorizo' could be decidedly negative.

**The Big Picture**

The overarching goal of this approach is to sift through a large set of data and categorize it into different thematic clusters or sentiment groups. We define one set with positive sentiments, another set with negative sentiments, and perhaps another set for neutral sentiments.

The algorithm learns these categorizations during its training phase and applies this knowledge to unseen data. The quality of its predictions is directly proportional to the size and quality of the training data. This versatile model can be adapted for a wide range of applications—from predicting the mood of a song to determining the sentiment conveyed in a photograph.

So, whether you're employing VADER or a more advanced sentiment analyzer like BERT, the customizable approach outlined below provides invaluable insights. Faced with something complex? Simply break it down into manageable and understandable components.

In [None]:
# Predefined positive and negative words
p_pos = ['inteligente', 'simpático', 'bueno', 'amable']
p_neg = ['malo', 'feo', 'ladrón', 'corrupto', 'chorizo']

def sentiment(_frase):

    # Tokenization and cleaning the sentence using predefined functions from your library
    tokens = tokenization.tokenize(_frase)
    tokens = tokenization.clean_sw(tokens)

    # Initializing counters for positive, neutral and negative words
    toks = len(tokens)
    pos = 0
    neu = 0
    neg = 0

    # Loop through the tokens and increment counters based on the word sentiment
    for token in tokens:
        if token in p_pos:
            pos += 1
        elif token in p_neg:
            neg += 1
        else:
            neu += 1
    
    # Applying weights to positive and negative counts to get a polarity score
    pol = (pos * 0.6) - (neg * 0.7)
    neu *= 0.1  # Assign some weight to neutral words

    # Calculate the final sentiment score
    res = pol + neu

    # Display the results
    print('Sentence:', _frase)
    print('--------------------')
    print('Sentiment:', res)
    print('Positive:', pos)
    print('Neutral:', neu)
    print('Negative:', neg)

    

In [None]:
sentiment('Mortadelo es un político feo pero malo')

The modifications are endless! Building upon our original custom sentiment analyzer, we introduce an innovative feature—dynamic weighting. This added complexity enables a more nuanced understanding of sentiment in the provided text.

In the original version, each positive and negative word received a static weight, potentially leading to imprecise sentiment scoring. However, with dynamic weighting, the significance of each positive or negative word is influenced by its frequency relative to the total number of tokens in the sentence.

In [None]:
# Predefined positive and negative words
p_pos = ['inteligente', 'simpático', 'carismático', 'amable']
p_neg = ['extinguiremos', 'megaminería', 'cambio climático', 'corrupto', 'lamentablemente', 'choreó', 'terroríficas']

def sentiment(_frase):
    tokens = tokenization.tokenize(_frase)
    tokens = tokenization.clean_sw(tokens)
    
    # Initialize counters
    pos = 0
    neu = 0
    neg = 0
    
    # Total number of tokens in the sentence
    total_tokens = len(tokens)
    
    # Increment counters based on the word sentiment
    for token in tokens:
        if token in p_pos:
            pos += 1
        elif token in p_neg:
            neg += 1
        else:
            neu += 1
    
    # Calculate the proportion of positive and negative tokens
    if total_tokens == 0:  # To avoid division by zero
        prop_pos = 0
        prop_neg = 0
    else:
        prop_pos = pos / total_tokens
        prop_neg = neg / total_tokens
    
    # Dynamic weighting based on the proportion of positive and negative tokens
    weight_pos = 0.6 + prop_pos
    weight_neg = 0.4 + prop_neg
    
    # Calculating polarity score with dynamic weights
    pol = (pos * weight_pos) - (neg * weight_neg)

    # Apply some weight to neutral words
    neu_score = neu * 0.1 
    
    # Calculate the final sentiment score
    res = pol + neu_score
    
    # Display the results
    print('Sentence:', _frase)
    print('---------------------------------------')
    print('Positive Words:', pos)
    print('Neutral Words:', neu)
    print('Negative Words:', neg)
    print('Sentiment Score:', res)

In [None]:
sentiment('Salchichón se choreó la partida presupuestaria de la obra pública. Sin embargo, como es un político carismático ganará la elecciones. Todos estamos al tanto de sus terroríficas políticas en post de la megaminería, que contribuyen al cambio climático. Lamentablemente, nos extinguiremos como raza.')

**Final Remarks: The Power of Creative Combinations**

As we close this tutorial, it's crucial to note that the tools and techniques we've explored are not isolated entities but components of a greater analytical framework. The true power of Natural Language Processing and Sentiment Analysis lies in creatively combining these elements to tailor solutions to specific problems or datasets.

Whether it's enhancing a simple Sentiment Analyzer with dynamic weighting, incorporating Latent Dirichlet Allocation for topic modeling, or even merging different machine learning models—your creativity is the limit.

Each method complements the others, and together they can provide more nuanced and accurate insights into textual data than any single approach could achieve on its own. Feel free to experiment, innovate, and most importantly, have fun exploring the endless possibilities that these tools offer.