Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and email below:

In [None]:
# Full name
NAME = ""
# Institutional email (hm.edu or hmtm.de)
EMAIL = ""

---

# Day 6 - Analyzing & visualizing text messages

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 13.06.2024
+ **Author:** Lenny Martinez Dominguez, Ph.D candidate at Sorbonne Université

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A6_conversation_analysis_and_visualization/1_conversation_analysis.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## 0. Getting Started

### Introduction 
This sixth day of class will teach you:
•⁠  ⁠How to explore and analyze your text data using ML and conventional methods.
•⁠  ⁠How to visualize text data using Seaborn.

### Content of the repository
- `utils.py`: A Python script with some pre-written functions that we will use in this notebook.
- `data`: The folder that will contain the data of interest. As an example, we probide a famous conversation between Romeo and Juliet (the balcony scene).
- `1_conversation_analysis.ipynb`: This notebook, in which we will go through the process and analysis of conversation data.
- `2_conversation_visualization.ipynb`: A second notebook that specifically demonstrate how to plot graphs and visualize data.

### Assignment

Sketch and implement a creative data visualization of a conversation of your choice. You can use the provided data (Romeo and Juliet) or import a private What's app conversation of your choice. 

<div class="alert alert-block alert-warning">
<b>Instruction:</b> Do not share any of your conversation data in the assigment! We will also teach you about data anonymization in this notebook.
</div>

### Installation required
Run the following cell to make sure you have all the packages you need.

In [None]:
# General purpose packages
!pip install pandas numpy requests pprint

# Packages for ML
!pip install sentence-transformers umap-learn transformers nltk bertopic

---

## 1. Importing a conversation

We will be working with conversation data today. This section will guide you through the process of importing data, either from:
1. A famous Shakespearian dialogue: the balcony scene from Romeo and Juliet.
2. A whats'app conversation with a relative of yours.

### 1.1 Import the balcony dialogue from Romeo and Juliet

We will start to import the dialogue from the balcony scene of Romeo and Juliet. We will use a public [website](https://shakespeare.mit.edu/romeo_juliet/romeo_juliet.2.2.html) to extract the dialogue in a python variable.

The extraction will follow these steps:
1. Extract HTML content from the url, using the `requests` library.
2. Parse the HTML content using the `BeautifulSoup` library, which is a Python library for pulling data out of HTML and XML files. This process is called web scraping.
3. Store the extracted content in a `pandas.DataFrame` object with the following columns:
    - `date_time`: The date and time of the message, it our case, we will only use an integer (`int`) to represent the order of the message.
    - `sender`: The name of the message sender.
    - `message`: The message content.

In [None]:
# STEP 1
import requests
import pprint # A library to print data in a more readable format
url = 'https://shakespeare.mit.edu/romeo_juliet/romeo_juliet.2.2.html'
html = requests.get(url).text
pprint.pprint(html)


<div class="alert alert-info">
<b>Instruction:</b> What is HTML? What other information on a website than the dialogue can be extracted from this HTML file?
</div>

YOUR ANSWER HERE

To process the HTML content, we will use a function `html2dataFrame` we programmed for you. This function is imported from the `utils.py` script.

Here is the header of the function:
```python
def html2dataFrame(html_content : str, save_csv : bool = True) -> pd.DataFrame:
    # the code...
```

<div class="alert alert-info">
<b>Instruction:</b> What is the type of the `html_content` argument? What is the type of the `save_csv` argument? What is the type of the function output?
</div>

YOUR ANSWER HERE

<div class="alert alert-info">
<b>Instruction:</b> Write the docstring of the function `html2dataFrame` in `utils.py`?
</div>

Let's not use the function to extract the dialogue from the balcony scene of Romeo and Juliet.

In [None]:
# STEP 2 and 3
from utils import html2dataFrame
df = html2dataFrame(html, False)
df.head()

### 1.1 Importing a conversation from Whatsapp

As an alternative, we propose you to import a Whatsapp' conversation from one of your relatives. 

Be aware that such data is highly personal and sensitive. It should be treated with care as another person than you is involved ! **Importantly, such data should never be shared as part of an assignment, or on any other platform**.

Such data is a good opportunity to learn about two important concepts in data science: **data anonymization** and **data protection**.

#### a. Export the conversation from your phone

Follow the following guide to export a conversation from Whatsapp:
- [English](https://faq.whatsapp.com/1180414079177245/?locale=en_US&cms_platform=android)
- [German](https://faq.whatsapp.com/1180414079177245/?locale=de_DE&cms_platform=android&cms_id=1180414079177245&draft=false)

Export **Without Media** to minimize the size of the downloaded file.

Whatsapp by default exports a file `_chat.txt`. Find that file on your computer and note the path.

#### b. Import the conversation as a DataFrame

In the file `utils.py`, we have written a function `whatsapp2dataFrame` that will help you import the conversation in a `pandas.dataFrame` for you.

In [None]:
from utils import whatsapp2dataFrame

whatsapp_archive_path = 'path/to/your/whatsapp/archive.txt' # ! : Do not include the file in the same folder as the note book to minimize the risk of sharing it with your assignment
df

<div class="alert alert-info">
<b>Instruction:</b> What is data anonymization?
</div>

YOUR ANSWER HERE

<div class="alert alert-info">
<b>Instruction:</b> What is the difference between anoymization and pseudo-anonymization?
</div>

YOUR ANSWER HERE

<div class="alert alert-info">
<b>Instruction:</b> What column needs to be anonymized in order to protect the identity of the people in the conversation? 
</div>

YOUR ANSWER HERE

To replace the name of the sender in the `sender` column, we can use the `.replace()` method of the `pandas.DataFrame` object. This method can take as argument a dictionary with:
- The values to be replaced as the dictionary's keys
- The new values as the dictionary's values

Here is an example:
```python
mapping : dict = {
    'ROMEO': 'Person 1',
    'JULIET': 'Person 2'
}

df['sender'] = df['sender'].replace(mapping)
```



In [None]:
# Define the anonymization mapping below:
mapping : dict = {} # TODO : Fill in the mapping

df["sender"] = df["sender"].replace(mapping)
df.head()

<div class="alert alert-info">
<b>Instruction:</b> Is that enough to rename the values of the `sender` column to ensure the anonymity of the data? What should also be done?
</div>

YOUR ANSWER HERE

---

## 2. Extracting features from the conversation

Now we posess a pseudo-anonymized conversation (still personal and sensitive data!), we can start to extract features for our (creative) visualization.

<div class="alert alert-info">
<b>Brainstorming:</b> Think about 6 features that could be extracted from a conversation. According to you, which of those features require machine learning to be extracted? Which of those features can be extracted with explicit programming?
</div>

YOUR ANSWER HERE

Many different features can be implemented. In this notebook, we will implement only a few of them, including:
- `char_count` : the length of each message in characters
- `question_count`: the number of questions in a message
- `sentiment_score`: positivity or negativity score of a message (between -1 and 1)

For whatsapp conversation, we can also implement:
- `time_diff_seconds` (whatsapp only): the time difference in seconds from the previous message
- `media_sent` (whatsapp only): the presence of media (audio, file, document, photo, sticker, video) in a message
- `emoji_count` (whatsapp only): the number of emojis in a message

Furthermore, we will also perform a topic modeling analysis on the conversation using the `BERTopic` library.

### 2.2 `char_count` : the length of each message

You are now aware that each row of the `panda.DataFrame` object represents a message. We can use the `.apply()` method to a column of the `pandas.DataFrame` object to apply a function to each element of the column.

The apply method takes as argument a **function** that will be applied to each element of the column. The result of the function will be stored in a new column called `char_count`. Look carefully at the example below:

In [None]:
df["char_count"] = df["message"].apply(len)
df.head()

### 2.3 `question_count` : the number of questions in each message

<div class="alert alert-info">
<b>Instruction:</b> Define a function that takes a string as input and returns the number of question marks in the string.
</div>

In [None]:
def question_count(message : str ) -> int:
    # YOUR CODE HERE
    raise NotImplementedError()

<div class="alert alert-info">
<b>Instruction:</b> Apply your function `question_count` to every message in the `message` column of the `pandas.DataFrame` object.
</div>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### 2.4 `sentiment_score` : the positivity or negativity score of a message (machine learning)



Extracting sentiment from text is a common task in Natural Language Processing (NLP), that can be considered as a sub-task of text classification. 

Our goal will be to classify text as positive, negative, or neutral. The values of the `sentiment_score` feature will span from -1 (negative), 0 (neutral), to 1 (positive).

Many pre-trained models for this task are available on the Hugging Face Hub platform introduced in day 5. The model below is somehow limited but support multiple languages:

> [lxyuan/distilbert-base-multilingual-cased-sentiments-student](https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student)

Otherwise, this model seems to perform better, but with english text only:

> [cardiffnlp/twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

<div class="alert alert-info">
<b>Instruction:</b> Read the documentation of the model of interest and load the model in a pipeline.
</div>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<div class="alert alert-info">
<b>Instruction:</b> Implement a function `sentiment_score` that takes a string (message) and a pipeline object (model) as inputs, and returns output of the model.

The output of sentiment analysis models is usually a dictionary with two keys:
    Dict[str, float] = {
        'label': 'POSITIVE',
        'score': 0.99
    }
</div>

In [None]:
def sentiment_score(message : str, pipeline : pipeline) -> dict:
    # YOUR CODE HERE
    raise NotImplementedError()

message = df["message"][10]
print(message)
sentiment_score(message, sentiment_pipeline,)

To simplify the encoding, we propose this function to convert labels into a single score using the function `label2score` below:

In [None]:
def label2score(output : dict) -> float:
    if output["label"].lower() == "negative":
        return -output["score"]
    elif output["label"].lower() == "positive":
        return output["score"]
    else:
        return 0

label2score(sentiment_score(message, sentiment_pipeline))

Let's now apply the functions `sentiment_score` and `label2score` to the `message` column of the `pandas.DataFrame` object, and store the result in a new column `sentiment_score`.

To make sure that the model is working on every message, we will use a `for` loop and the `try` and `except` statements to catch any error that might occur during the process.

In [None]:
sentiment_scores = []
for i, message in enumerate(df["message"]):
    print(f'Processing message {i}/{len(df["message"])}', end="\r")
    try:
        sentiment_scores.append(label2score(sentiment_score(message, sentiment_pipeline)))
    except RuntimeError as e:
        sentiment_scores.append(None)

df["sentiment_score"] = sentiment_scores


### 2.5 `time_diff_seconds` : the time difference in seconds from the previous message

Let's now implement the `time_diff_seconds` feature. This feature is specific to Whatsapp conversations, as it requires the `date_time` column to be in a specific format.

In [None]:
if not df["date_time"].dtype == 'int64':
    df["time_diff_seconds"] = df["date_time"].diff().dt.total_seconds()
    df.head(10)

### 2.6 `media_sent` : the presence of media in a message

The `utils.py` file also contains pre-programmed function to extract the presence of media in a message.

In [None]:
from utils import check_media_sent

df["media_sent"] = df["message"].apply(check_media_sent)
df.head(10)

### 2.7 `emoji_count` : the number of emojis in a message



In [None]:
from utils import count_emojis

df["emoji_count"] = df["message"].apply(count_emojis)
df.head(10)

### 2.8 Final anonymization and export

Now we have extracted all features we wanted, we can finalize anonymization by **deleting** the messages and ** anonymize** the `sender` column.

In [None]:
# Deletion of the message column

df_anonymized = df.drop(columns=["message"], inplace=False)

# Anonymizaiton of senders' names

# TODO : Fill in the mapping to replace senders' names
mapping = {
    "ROMEO": "Person 1",
    "JULIET": "Person 2",
    "Nurse": "Person 3"
 } 
df_anonymized["sender"] = df["sender"].apply(lambda x: mapping[x])

# Save the anonymized data frame to a csv file
df_anonymized.to_csv("data/anonymized_conversation_features.csv", index=False)

Now we obtained numerical features from the conversation data, let's have a look at other analysis techniques.

# 3. Most common words in the conversation

Another kind of text analysis method that is popular is to look for the most common words. You may have seen a wordcloud at some point -- that's how this kind of analysis is usually visualized. You can try one out here: [https://monkeylearn.com/word-cloud](https://monkeylearn.com/word-cloud). Try putting all your messages in there to see what it's like.

We've written a function for you to get the most common words. If you want to get data for a specific time period (or a specific part of a scene) you'll have to filter the dataframe before working. Below is the function and an example of running it on an entire corpus

In [None]:
from collections import Counter
import re
import string
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

def common_words(df, n_words=10, languages=["english", "german"]):
    words = " ".join(df["message"]).lower()
    words = re.sub(f"[{re.escape(string.punctuation)}]", "", words)
    words = words.split()

    all_stop_words = set()
    for lang in languages:
        all_stop_words.update(set(stopwords.words(lang)))
    
    words = [word for word in words if word not in all_stop_words]

    return Counter(words).most_common(n_words)

In [None]:
common_words = common_words(df, n_words=10)
common_words

---

# 4. Topic Modeling using BERTopic

The goal of topic modeling is to use unsupervised ML to identify clusters of similar words within a corpus. This might be useful if you don't know what kind of things to look into for your visualization.

BERTopic is quite popular and we can visualize the results quite quickly. Below is the code for getting it to run on a whatsapp conversation.

In [None]:
from bertopic import BERTopic

docs = df["message"].tolist()
topic_model = BERTopic(language="multilingual")
topics, probs = topic_model.fit_transform(docs)

We can then view the results and see how many topics we have. Note that Topic `-1` includes all the things that didn't fit in the other topics.

In [None]:
topic_model.get_topic_info()

A table isn't the best way to understand the topics. 

We can also visualize the topics overall using `.visualize_topics()`

In [None]:
topic_model.visualize_topics()

And we can also see what words appear in which topics. This gives us an idea about what each topic relates to.

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

There are more ways to customize the visualizations and the modeling to get better results. Check out the GitHub page: https://github.com/MaartenGr/BERTopic