# Tutorial on Analyzing Group Chats

Reading all the messages from a group chat can be a tedious task, especially when you receive over 100 text messages while you are away. `txtanalyzer` can alleviate your heavy reading load by providing an accessible gateway to Natural Language Processing (NLP) and supporting you with fast, efficient analyzing tools. This is a short tutorial on performing several Natural Language Processing (NLP) tasks on group chats using `txtanalyzer`. 

For this tutorial, we will introduce and guide you through four popular NLP tasks leveraging the functions in `txtanalyzer`:
- Extract keywords
- Extract topics
- Sentiment analysis
- Detect language patterns


## Import

First, let's import `txtanalyzer` as below. We can check the version of `txtanalyzer` by using the attribute `.__version__`.

In [1]:
import txtanalyzer

print(txtanalyzer.__version__)

0.1.0


Below is a list of sample texts that cover a diverse range of conversations among peers. We will work on this sample texts for the majority of the tutorial.

In [2]:
sample_text = [
    "Hey, has anyone watched the latest episode of that new series?",
    "Not yet! Is it worth watching?",
    "Totally! The twists this season are insane.",
    "Okay, now I’m intrigued. Adding it to my watchlist.",
    "I’m thinking of baking cookies today. Any flavor suggestions?",
    "Chocolate chip, always a classic!",
    "Good choice! I’ll try adding some sea salt on top for extra flavor.",
    "Ooh, sea salt on cookies? That sounds amazing. Let us know how it turns out."
    "What’s the best vacation spot you’ve been to?",
    "Bali, hands down. The beaches are incredible.",
    "Oh, I’ve always wanted to visit Bali. Did you go to any of the waterfalls there?",
    "Yes! Tegenungan Waterfall was breathtaking.",
    "Does anyone else feel like AI is moving so fast?",
    "For sure. It’s exciting but also a little scary sometimes.",
    "Agreed. I mean, just look at how tools like ChatGPT are changing how we work.",
    "True, but it’s also helping with so many tasks I used to find tedious.",
    "I just got a new gaming headset, and it’s a game-changer!",
    "Nice! Which one did you get?",
    "The HyperX Cloud II. The sound quality is amazing, and it’s super comfortable.",
    "I’ve heard good things about that one. Great pick!"
]


## Keyword Extraction

Keyword extraction is a technique to identify and extract the most relevant words or phrases from a given set of text messages. This function supports keyword extraction by using the Term Frequency-Inverse Document Frequency (TF-IDF). It is helpful in summarizing text, identifying key terms, or preprocessing data for further text analysis tasks.

To use the keyword extraction function, import `extract_keywords` from `txtanalyzer.extract_keywords`.

In [3]:
from txtanalyzer.extract_keywords import extract_keywords

Below extracts top keywords from the list of messages using TF-IDF. Each message's keywords are determined based on their importance relative to the entire group chat. This is made possible by specifying the parameter `method`="tfidf" and `num_keywords` = 3.

In [4]:
keywords = extract_keywords(sample_text, method="tfidf", num_keywords=3)

print(keywords[:5])

[['watched', 'series', 'latest'], ['worth', 'watching', 'yes'], ['twists', 'totally', 'season'], ['watchlist', 'okay', 'intrigued'], ['today', 'thinking', 'suggestions']]


From the above results, we can see that the group talks a lot about watching TV series.

## Topic Modeling

Another way to summarize texts is topic modeling. Topic modeling is a tool to identify and extract different topics mentioned in a text and represent these topics with a group of words or phrases originated from the text. Our topic modeling function leverages the algorithm of Non-negative Matrix Factorization to reduce the text corpus to multiple topics. This application is helpful in summarizing and identifying common themes in long texts.

To use the topic modeling function in our package, import `topic_modeling` from `txtanalyzer.topic_modeling` as below.

In [5]:
from txtanalyzer.topic_modeling import topic_modeling

Now we can apply topic modeling to our sample texts. 

Below returns 10 topics via topic modeling, where each topic is represented by 3 words selected from the sample texts. This is made possible by specifying the parameter `n_topics` = 10 and `n_words` = 3. 

Note: A runtime warning might be returned but it is expected when the number of topics requested exceeds the maximum number of topics that Non-negative Matrix Factorization will extracts. It will still return as many topics as requested while throwing a warning.

In [6]:
topic_modeling(sample_text, n_topics = 10, n_words = 3)

{'Topic 1': ['ii', 'quality', 'hyperx'],
 'Topic 2': ['yes', 'tegenungan', 'breathtaking'],
 'Topic 3': ['incredible', 'hands', 'beaches'],
 'Topic 4': ['like', 'ai', 'fast'],
 'Topic 5': ['insane', 'twists', 'season'],
 'Topic 6': ['flavor', 'adding', 'sea'],
 'Topic 7': ['worth', 'watching', 'suggestions'],
 'Topic 8': ['little', 'sure', 'exciting'],
 'Topic 9': ['did', 'nice', 'oh'],
 'Topic 10': ['new', 'latest', 'episode']}

It seems like using 3 representative words for each topic is not too insightful. We have topics like `Topic 5`, i.e. a topic on insane twists in a TV series season, that are easy to comprehend, but we also have topics like `Topic 9` that do not tell us anything useful. Let's try adding more words for each topic.

Thus, we extract 5 words from the sample texts to represent 10 topics. The `random_state` parameter is also specified to ensure reproducibility when rerunning the function. The default of `random_state` is set to `123`.

In [7]:
topic_modeling(sample_text, n_topics = 10, n_words = 5, random_state = 456)



{'Topic 1': ['incredible', 'hands', 'beaches', 'bali', 'little'],
 'Topic 2': ['new', 'latest', 'watched', 'episode', 'hey'],
 'Topic 3': ['insane', 'season', 'totally', 'twists', 'chip'],
 'Topic 4': ['okay', 'watchlist', 'intrigued', 'adding', 'try'],
 'Topic 5': ['yes', 'waterfall', 'tegenungan', 'breathtaking', 'helping'],
 'Topic 6': ['like', 'ai', 'feel', 'fast', 'moving'],
 'Topic 7': ['cookies', 'flavor', 'sea', 'salt', 'suggestions'],
 'Topic 8': ['pick', 'heard', 'things', 'great', 'good'],
 'Topic 9': ['did', 'nice', 'waterfalls', 'oh', 'wanted'],
 'Topic 10': ['sound', 'hyperx', 'super', 'quality', 'comfortable']}

This is more informative and detailed than the previous result. For instance, we can now deduce that the group talked about the incredible beaches in Bali from `Topic 1`.

## Sentiment Analysis

## Detect Language Patterns

Detecting language patterns is another great way to get a better understanding of text messages, especially when you have an international group of peers that speak in different languages. 

The `detect_language_patterns`function spots patterns like common n-grams (word combinations), frequently used characters, or the mix of languages in a dataset. These patterns can help you see key trends and details in the text, like often-mentioned terms, writing styles, or the overall language makeup.


To use the language pattern detection function, import `detect_language_patterns` from `txtanalyzer.detect_language_patterns` as shown below.

In [8]:
from txtanalyzer.detect_language_patterns import detect_language_patterns

We are using a different sample text below to test the function's ability to read a mix of languages. The sample covers a mix of themes, including artificial intelligence and meditation, and a mix of languages, including English, French, and Chinese.

In [9]:
mix_text = [
    "Artificial intelligence and machine learning are transforming industries around the globe.",
    "The basketball team secured a thrilling victory in the final seconds of the game.",
    "Yoga and meditation are excellent for reducing stress and improving mental health.",
    "Exploring the hidden beaches of Bali is an unforgettable experience for any traveler.",
    "Quantum computing is expected to revolutionize data processing and cryptography.",
    "L'intelligence artificielle et l'apprentissage automatique transforment les industries du monde entier.",
    "L'équipe de basket-ball a remporté une victoire passionnante dans les dernières secondes du match.",
    "Le yoga et la méditation sont excellents pour réduire le stress et améliorer la santé mentale.",
    "L'exploration des plages cachées de Bali est une expérience inoubliable pour tout voyageur.",
    "L'informatique quantique devrait révolutionner le traitement des données et la cryptographie.",
    "人工智能和机器学习正在改变全球各行各业。",
    "篮球队在比赛的最后几秒钟取得了激动人心的胜利。",
    "瑜伽和冥想是减压和改善心理健康的绝佳方式。",
    "探索巴厘岛隐秘的海滩对任何旅行者来说都是一次难忘的经历。",
    "量子计算有望彻底改变数据处理和密码学。"
]

The example below demonstrates how to use the **`detect_language_patterns`** function to analyze the sample text.

1. **Language Detection**  
   The first part detects the language of each message in the sample text by setting the parameter `method="language"`. The result is a list of detected languages, where each entry corresponds to the language of a sentence in the sample text.

In [10]:
# Detect the language of each message in the sample text
result = detect_language_patterns(mix_text, method="language")
print(result)

['en', 'en', 'en', 'en', 'en', 'fr', 'fr', 'fr', 'fr', 'fr', 'zh-cn', 'zh-cn', 'zh-cn', 'zh-cn', 'zh-cn']


Each detected language (`en` for English, `fr` for French, and `zh-cn` for Chinese) corresponds to a sentence in the sample_text.

2. **Bigram Extraction**  
   The second part identifies the top 5 most common bigrams (two-word combinations) in the sample text by setting the parameters `method="ngrams"`, `n=2`, and `top_n=5`. The output shows the bigrams along with their frequencies.


In [11]:
# Extract the top 5 most common bigrams (two-word combinations)
result = detect_language_patterns(sample_text, method="ngrams", n=2, top_n=5)
print(result)


[('sea salt', np.int64(2)), ('salt on', np.int64(2)), ('did you', np.int64(2)), ('and it', np.int64(2)), ('hey has', np.int64(1))]


The bigram `et la` appears twice in the French sentences, while other bigrams occur once in the English sentences.