# Whatsapp group sentiment analysis
In this work I make a model to know the polarity of the sentiments of the messages in a whatsapp group.

In the development of the work I will analyse different aspects of the dynamics of the group, highlighting the activity of the different users, the type of messages and the topics that were discussed with their corresponding polarity.

this is my final work of the data analytic bootcamp, and i decided to do this work because i believe that the spontaneity and quantity of the messages are an extremely valuable resource to know the opinion of the people about relevant topics.

# 1. Import basic libraries & read the dataset

In [None]:
import pandas as pd
import re
from datetime import timedelta
from datetime import datetime

In [None]:
# Specify the path to your text file
file_path = 'WhatsApp.txt'

# Initialize lists to store parsed information
dates = []
times = []
senders = []
messages = []

# Define a regular expression pattern to extract information
pattern = re.compile(r'(\d+/\d+/\d+ \d+:\d+) - ([^:]+): (.+)')

# Open the file in read mode
with open(file_path, 'r', encoding='utf-8') as file:
    # Iterate through each line in the file
    for line in file:
        # Use the regular expression to match and extract information
        match = pattern.match(line.strip())
        if match:
            # Extract date, time, sender, and message
            datetime_str, sender, message = match.groups()

            # Convert date and time to a datetime object
            datetime_obj = datetime.strptime(datetime_str, '%d/%m/%y %H:%M')

            # Append information to the respective lists
            dates.append(datetime_obj.date())
            times.append(datetime_obj.time())
            senders.append(sender)
            messages.append(message)

# Create a DataFrame for easier analysis
df = pd.DataFrame({
    'Date': dates,
    'Time': times,
    'Sender': senders,
    'Message': messages
})

In [None]:
# Read the nickname mapping from CSV into a DataFrame
nickname_mapping = pd.read_csv('sender_nickname.csv')

# Create a mapping dictionary from 'Full Name' to 'Nickname'
name_mapping_dict = dict(zip(nickname_mapping['Full Name'], nickname_mapping['Nickname']))

# Replace values in the 'Sender' column of the original DataFrame (df) using the mapping
df['Sender'] = df['Sender'].replace(name_mapping_dict)

In [None]:
# Display the DataFrame
df.head()

#  2.Data Preparation and Cleaning

## Understanding the data
Before we start cleaning the dataset, we need to understand the business and the data structure
* we can see that the dataset has 3 columns (date, sender and message)
* we have 39972 messages

In [None]:
# Display the dataframe shape
df.shape

In [None]:
# Display the dataframe columns
df.columns

## Cleaning messages 
This step is critical in the sentiment analysis process - we need to make sure that the words and messages make sense to the model. It is about ensuring that the messages reach the sentiment analysis with meaning and do not cause confusion.
To achieve this, we need to clean the messages of jargon, including all kinds of onomatopoeias, emogis and multimedia so common in whatsapp communication. 
Stop words will be cleaned up later. They are not applicable at the moment because Bert Multilanguage needs them to interpret the messages.

In [None]:
# Create column to categorize MessageCount
msg_counts = df['Message'].value_counts().reset_index()
msg_counts.columns = ['Message','repetition']
msg_counts['repetition_cat'] = pd.cut(msg_counts['repetition'], bins=[0, 1, 2, 3, float('inf')],
                                       labels=['1 msg', '2 msg', '3 msg', '>3 msg'])

# Function to count words in a given text
def count_words(text):
    """
    Count the number of words in a given text.
    Parameters:
        text (str): The input text.
    Returns:
        int: The number of words in the text.
    """
    if isinstance(text, str):
        return len(text.split())
    else:
        return 0

# Apply the function to the 'text' column and create a new 'word_count' column
msg_counts['word_count'] = msg_counts['Message'].apply(lambda x: count_words(x))

msg_counts['word_count_cat'] = pd.cut(
    msg_counts['word_count'],
    bins=[0, 5, 10, float('inf')],
    labels=['1-4 words', '5-10 words', '>10 words']
)

msg_counts['word_count_group'] = pd.cut(
    msg_counts['word_count'],
    bins=[0, 10, float('inf')],
    labels=['short msg', 'regular msg']
)


In [None]:
# Create a pivot table with totals
result_pivot = pd.pivot_table(
    msg_counts,
    values='repetition',
    index='repetition_cat',
    columns=['word_count_group'],
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

result_pivot_filtered = result_pivot.loc[:, (result_pivot != 0).any(axis=0)]

# Display the filtered result_pivot
result_pivot_filtered

In [None]:
# Create a pivot table with totals
result_pivot = pd.pivot_table(
    msg_counts,
    values='repetition',
    index='repetition_cat',
    columns=['word_count_cat'],
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

result_pivot_filtered = result_pivot.loc[:, (result_pivot != 0).any(axis=0)]

# Display the filtered result_pivot
result_pivot_filtered

## Cleaning Messages | Short messages of 1 to 4 words

In [None]:
msg_counts[
    (msg_counts['word_count_cat']=='1-4 words') & 
    (msg_counts['repetition_cat'] !='1 msg')
].head(15)

Short messages have particular characteristics, they appear very often because they are brief and it is important to understand their nature given that most of them are short answers that have no meaning of their own, they are also used to make comments with double meaning (ironies) and also the use of signs or forms that are not understandable by the model.
* multimedia messages notes
* deleted messages notes
* slang or jargon
* use of marks and letters (alphabetic characters) to write onomatopoeias (Jajajaja, Jajaj, etc...)
* words with double meanings.
* emogis on the other hand has to be analysed separately
* null

## Cleaning Messages | Replace notes

In [None]:
def clean_msg_to_replace(text, elements_to_replace):
    if isinstance(text, str):
        for element in elements_to_replace:
            text = text.replace(element, '')
        return text.strip().lower()
    else:
        return text
    
message_to_replace = [
    ('<Multimedia omitido>'),
    ('Se eliminó este mensaje.'),
    ('ubicación en tiempo real compartida'),
    ('Eliminaste este mensaje.'),
    ('null'),
    ('\n'),
]

# Apply the clean_words_to_replace function to the 'Message' column
df['clean_msg'] = df['Message'].apply(lambda x: clean_msg_to_replace(x, message_to_replace))

In [None]:
df['clean_msg'].value_counts().head(15)

## Cleaning Messages | Replace emoticons and onomatopoeias

In [None]:
def apply_regex_patterns(text, regex_patterns):
    if isinstance(text, str):
        for pattern, replacement in regex_patterns:
            text = re.sub(pattern, replacement, text)
        return text
    else:
        return text
    
# List of regex patterns and replacements
words_to_replace = [
    (r'[^\w\s]|_', ''),          # Replace everything that is not an alphabetical string, including emojis
    (r'http\S+', ''),            # Remove URLs
    (r'[0-9]+',''),               # Remove numbers
    (r'\s+', ' '),               # Replace multiple whitespaces with a single space
    (r'^\d+$',''),               # Replace strings consisting entirely of digits
    (r'([a-zA-Z])\1\1', '\\1'),  # Replace consecutive identical characters with a single character
]

# List of regex patterns and replacements
other_words_to_replace = [
    (r'\!{2,}', '!'),            # Replace repetition of !!! with !
    (r'\!', ''),                 # Remove !
    (r'\?{2,}', '?'),            # Replace repetition of ??? with ?
    (r'\.{2,}', ''),             # Remove repetition of ...
    (r'^\?\s*$', ''),            # Remove ? to treate it as emoji
    (r'^\!\s*$', ''),            # Remove ! to treate it as emoji
    (r'^:\-\)$', ''),            # Remove :-) to treate it as emoji
    (r'^;\-\)$', ''),            # Remove ;-) to treate it as emoji
    (r'\b(?:j[aj]*a[aj]*j[aj]*|ja(?:j[aj]*a[aj]*)*|ja(?:j[ak]*a[aj]*)*|ja+)\b', ''),  # Remove repetition of ja = :)
    (r'\b(?:j[ej]*e[ej]*j[ej]*|je(?:j[ej]*e[ej]*)*)\b', ''),                          # Remove repetition of je = :)
    (r'\b(?:j[oj]*o[oj]*j[oj]*|jo(?:j[oj]*o[oj]*)*|jo(?:j[ok]*o[oj]*)*|jo+)\b', ''),  # Remove repetition of jo = :)
    (r'\b(?:j[uj]*u[uj]*j[uj]*|ju(?:j[uj]*u[uj]*)*|ju(?:j[uk]*u[uj]*)*|ju+)\b', ''),  # Remove repetition of ju = :)
    (r'\b(jiji|jijij)\b', ''),                                                        # Remove repetition of ji = :) 
]

# Apply the function to the 'clean_msg' column
df['clean_msg'] = df['clean_msg'].apply(lambda x: apply_regex_patterns(x, words_to_replace))
df['clean_msg'] = df['clean_msg'].apply(lambda x: apply_regex_patterns(x, other_words_to_replace))

In [None]:
df['clean_msg'].value_counts().head(15)

## Cleaning Messages | Replace short messages with doble meaning
At this point I reemploy the short messages with double meanings by a phrase that replaces the real meaning of the message

In [None]:
def apply_patterns(series, patterns):
    for pattern, replacement in patterns:
        series = series.str.replace(pattern, replacement, regex=True, flags=re.IGNORECASE)
    return series

word_replacements = pd.read_csv('word_replacements.csv')
word_replacements_list = list(zip(word_replacements['Pattern'], word_replacements['Replacement']))

cleaned_replacements_list = [
    (rf'^\s*{str(pattern).replace("nan", "")}\s*$', rf'{str(replacement).replace("nan", "")}')
    for pattern, replacement in word_replacements_list
]

df['clean_msg'] = apply_patterns(df['clean_msg'], cleaned_replacements_list)

In [None]:
df['clean_msg'].value_counts().head(15)

## Cleaning Messages | Replace other words and marks 

In [None]:
def apply_replacements(df, patterns):
    for pattern, replacement in patterns:
        df['clean_msg'] = df['clean_msg'].str.replace(pattern, replacement, regex=True, flags=re.IGNORECASE)

# Define your replacement patterns
replacement_patterns = [
    (r'^Bata\s+\w+$', '(risa)'),
    (r'^Basta\s+\w+$', '(risa)'),
    (r'^y( si)+$', 'confirmo que si'),
    (r'\b(gracias)+\b', 'gracias'),
    (r'\bgracias+\b', 'gracias'),
    (r'^Ok\.?\s*$', 'estoy de acuerdo'),
    (r'^se+\s*$', 'estoy de acuerdo'),
    (r'^Nah+\s*$', 'me sorprende'),
    (r'^No+\s*$', 'me sorprende'),
    (r'^\s*a+migos\s*$', 'amigos'),
    (r'^\s*([a-zA-Z])\s*$', ''),
    (r'^@\d+$', ''),
    (r'^\s+$', ''),
    (r'^\s*', ''),
    (r'^\s*si\b(?: si)+\s*$', 'confirmo que si'),
    (r'^(no\s)+no$', 'confirmo que no'),
    (r'b*chiques\b', 'amigos'),
    (r'b*o estoy crazy macaya\b', ''),
    (r'b*no es joda\b', 'hablo enserio'),
    (r'\b(Perdon)+\b', 'perdón'),
    (r'\b(cumple+)+\b', 'cumpleaños'),    
    (r'\b(ojo+)+\b', ''),
    (r'\b(epa+)+\b', ''),
    (r'\b(apa+)+\b', ''),
]    

# Apply replacements using the function
apply_replacements(df, replacement_patterns)

In [None]:
df['clean_msg'].value_counts().head(15)

## Cleaning Messages | Short messages of 5 to 10 words

In [None]:
msg_counts[
    (msg_counts['word_count_cat']=='5-10 words') & 
    (msg_counts['repetition_cat'] !='1 msg')
].head(10)

In this category of short messages (5-7 words) have a clear meaning and do not need modification, the repetitions are due to the fact that they are common ways of communicating. The use of multiple exclamation marks for emphasis is also observed.

## Cleaning Messages | Messages longer than 10 words
As we could see in the pivot table at the beginning, there are messages of normal size that are duplicated. We can then confirm that these are duplicate messages, other than short messages...


In [None]:
msg_counts[
    (msg_counts['word_count_cat']=='>10 words') & 
    (msg_counts['repetition_cat'] =='2 msg')
].head(15)

These are normal sized messages, so there is no reason why the messages would be duplicates.
This will most likely happen when messages are forwarded. This duplicate should be removed.

## Cleaning Messages | Drop regular messages duplicates

In [None]:
df['word_count'] = df['Message'].apply(lambda x: count_words(x))

# Creating word_count_group column based on word_count
df['word_count_group'] = pd.cut(
    df['word_count'],
    bins=[0, 7, float('inf')],
    labels=['short msg', 'regular msg']
)

df[(df.duplicated(subset=['Message'], keep=False)) & (df['word_count_group'] == 'regular msg')]

In [None]:
regular_msg_subset = df[df['word_count_group'] == 'regular msg'].copy()
regular_msg_subset.drop_duplicates(subset=['Message'], keep='first', inplace=True)
df.loc[df['word_count_group'] == 'regular msg'] = regular_msg_subset

In [None]:
regular_msg_condition = (df['word_count_group'] == 'regular msg') & df['Message'].str.contains(
    'Los nuevos rufianes kirchneristas: DANIEL VILA...|'
    'Totalmente negro. Sabias palabras. Una forma d...|'
    'Muchachos, debo dejar el grupo por problemas p...',
    case=False
)

df[['Date', 'Time', 'Sender', 'Message', 'clean_msg']][regular_msg_condition]

# 3. Feature engineering
At this stage we prepare the data for exploration:
* we are going to separate the emojis from the messages in order to analyse the polarity of each one.
* We are going to separate the date from the time in order to take advantage of both variables.
* finally, we will analyse the polarity of the messages.

## Message engineering | Spliting emojis from messages

In [None]:
Emoticon_to_replace = [
    (r'^\?\s*$', '❓'),          # Replace ? with emoji
    (r'^\!\s*$', '❗'),          # Replace ! with emoji
    (r'^:\-\)$', '😂'),          # Replace :-) with emoji
    (r'^;\-\)$', '😉'),          # Replace ;-) with emoji

    (r'\b(ojo+)+\b', '⚠️'),      # Replace comment with emoji
    (r'\b(epa+)+\b', '⚠️'),      # Replace comment with emoji
    (r'\b(apa+)+\b', '⚠️'),      # Replace comment with emoji

    (r'\b(?:j[aj]*a[aj]*j[aj]*|ja(?:j[aj]*a[aj]*)*|ja(?:j[ak]*a[aj]*)*|ja+)\b', '😂'),  # Replace jajaja with emoji
    (r'\b(?:j[ej]*e[ej]*j[ej]*|je(?:j[ej]*e[ej]*)*)\b', '😂'),                          # Replace jejeje with emoji
    (r'\b(?:j[oj]*o[oj]*j[oj]*|jo(?:j[oj]*o[oj]*)*|jo(?:j[ok]*o[oj]*)*|jo+)\b', '😂'),  # Replace jojojo with emoji
    (r'\b(?:j[uj]*u[uj]*j[uj]*|ju(?:j[uj]*u[uj]*)*|ju(?:j[uk]*u[uj]*)*|ju+)\b', '😂'),  # Replace jujuju with emoji
    (r'\b(jiji|jijij)\b', '😂'),                                                        # Replace jijiji with emoji
]

# Apply the function to the 'Message' column 
df['Message'] = df['Message'].apply(lambda x: apply_regex_patterns(x, Emoticon_to_replace)) 

In [None]:
import emoji

# Emoji extraction
df['emoji'] = df['Message'].apply(lambda x: ''.join(c for c in str(x) if c in emoji.EMOJI_DATA))

In [None]:
df['emoji'].value_counts().iloc[0:10]

In [None]:
# remove ♂ and replace ⚠ with ⚠️ from emoji column
filtered_df = df[df['emoji'].str.contains(r'♂', case=False, regex=True)]
df['emoji'] = df['emoji'].str.replace(r'♂', '', regex=True, flags=re.IGNORECASE)
df['emoji'] = df['emoji'].str.replace(r'⚠', '⚠️', regex=True, flags=re.IGNORECASE)
df['emoji'].value_counts().head(10)

## Message engineering | Dropping empty messages
So far I have only replaced wrong or meaningless values by '', in this last step I filter out all the "" and delete the rows. Then I do the same with the null values. 

In [None]:
# check null values on both columns (clean_msg and emoji)
df[(df['clean_msg'] == '') & (df['emoji'] == '')].shape

In [None]:
# Create a mask for rows where both 'clean_msg' and 'emoji' are empty
mask = (df['clean_msg'] == '') & (df['emoji'] == '')

# Drop the rows that meet the condition
df_filtered = df[~mask]
df.drop(df[mask].index, inplace=True)

In [None]:
# check null values
df.isnull().sum()

In [None]:
# drop rows with null values
df.dropna(subset=['clean_msg'], inplace=True)

In [None]:
df['clean_msg'].value_counts().head()

In [None]:
df['emoji'].value_counts().head(15)

In [None]:
# Replace remaining 1084 '' in column Clean_msg and 19295 in column Emoji
white_space_to_replace = [(r'^\s*$', '_')]   

# Apply the function to the 'Message' column 
df['emoji'] = df['emoji'].apply(lambda x: apply_regex_patterns(x, white_space_to_replace)) 
df['clean_msg'] = df['clean_msg'].apply(lambda x: apply_regex_patterns(x, white_space_to_replace))

## Date engineering | Date format and extraction of year, month and day of week
At this point, I modify the date format because I need to filter messages by date, as my plan is to analyse from 01/2022 onwards.

In [None]:
# Add a new column 'rDate' with the datetime values
df['rDate'] = pd.to_datetime(df['Date'])

# Extract month, year, and weekday information
df['Month'] = df['rDate'].dt.month
df['Year'] = df['rDate'].dt.year
df['Weekday'] = df['rDate'].dt.day_name()

# Handle missing values in 'Month'
df['Month'] = df['Month'].fillna(-1)  # Replace NaN with -1 or any suitable value
df['Month'] = df['Month'].astype(int)

# Handle missing values in 'Year' (if needed)
df['Year'] = df['Year'].fillna(-1)  # Replace NaN with -1 or any suitable value
df['Year'] = df['Year'].astype(int)

# Display the updated DataFrame
selected_columns = ['rDate', 'Month', 'Year', 'Weekday', 'Time', 'Sender', 'clean_msg', 'emoji', 'word_count_group']
df = df[selected_columns]
df.head()

##  Time engineering | Split Time to get Daytime

In [None]:
df['Time_str'] = df['Time'].astype(str)
df['hour'] = df['Time_str'].str.extract(r'(\d{2})').astype(int)

df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S').dt.time

# Create new columns for morning, afternoon, and night
df['Morning'] = ((df['Time'].apply(lambda x: x.hour) >= 6) & (df['Time'].apply(lambda x: x.hour) < 13)).astype(int)
df['Afternoon'] = ((df['Time'].apply(lambda x: x.hour) >= 13) & (df['Time'].apply(lambda x: x.hour) < 20)).astype(int)
df['Night'] = ((df['Time'].apply(lambda x: x.hour) >= 20) | (df['Time'].apply(lambda x: x.hour) < 6)).astype(int)

# Combine morning, afternoon, and night into a single column 'Time-Day'
def classify_time(row):
    if row['Morning'] == 1:
        return 'Morning'
    elif row['Afternoon'] == 1:
        return 'Afternoon'
    elif row['Night'] == 1:
        return 'Night'
    else:
        return None

df['Time-Day'] = df.apply(classify_time, axis=1)

# Display the updated DataFrame
selected_columns = ['rDate', 'Month', 'Year','Weekday','Time','hour','Time-Day','Morning','Afternoon','Night',
                    'Sender','clean_msg','emoji','word_count_group']
df = df[selected_columns]
df.head()

## Sentiment analysis | Apply sentiment analysis to clean msg
In this step we are going to perform the sentiment analysis on the messages, first we are going to filter the messages belonging to 2023 to focus our analysis on a recent period (we have 2019-2023).
Then we will import the necessary libraries and we will be able to run the sentiment analysis.

In [None]:
data = df[df['Year']== 2023]
data.shape

In [None]:
data_copy.to_csv('data.csv', index=False)

In [None]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

In [None]:
# Import pretrained model
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

In [None]:
def get_sentiment_token(text):
    # Tokenize the text
    tokens = tokenizer.encode(text, return_tensors='pt', truncation=True)
    
    # Check if the tokens exceed the maximum sequence length
    if tokens.size(1) > 512:
        print(f"Tokens size exceeds maximum sequence length: {tokens.size(1)}")
        # Pad the tensor if needed
        tokens = F.pad(tokens, (0, 512 - tokens.size(1)))
    
    # Forward pass through the model
    blob = model(tokens)
    
    # Return the sentiment polarity
    return int(torch.argmax(blob.logits)) + 1

In [None]:
def get_sentiment_token(text):
    tokens = tokenizer.encode(text, return_tensors='pt')
    blob = model(tokens)
    return int(torch.argmax(blob.logits)) + 1

In [None]:
import timeit
start_time = timeit.default_timer()

get_sentiment_token('no me gusta eso')

end_time = timeit.default_timer()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")

In [None]:
get_sentiment_token('me gusta eso')

In [None]:
import timeit
start_time = timeit.default_timer()

# Create a copy of the DataFrame to avoid SettingWithCopyWarning
data_copy = data.copy()

# Apply sentiment analysis to each row of the 'clean_msg' column
data_copy['Sentiment_Polarity'] = data_copy['clean_msg'].apply(lambda x: get_sentiment_token(x))

# Map sentiment polarity to labels
data_copy['Sentiment_Label'] = data_copy['Sentiment_Polarity'].map({5: 'Positive', 1: 'Negative', 3: 'Neutral'})

# Save the labeled data to a CSV file
data_copy.to_csv('data_labeled.csv', index=False)

end_time = timeit.default_timer()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")

## Emoji | Apply emosent analysis to emoji

In [None]:
# Import the necessary libraries
import pandas as pd
import re
from datetime import timedelta
from datetime import datetime

data_labeled = pd.read_csv('data_labeled.csv')
data_labeled.shape

In [None]:
from emosent import get_emoji_sentiment_rank

def emosent_score(emoji):
    score, count = 0, 0
    for e in set(emoji):
        try:
            score += get_emoji_sentiment_rank(e)['sentiment_score']
            count += 1
        except:
            continue
    
    # Calculate the sentiment score based on your formula
    return score / count if count != 0 else score

# Apply sentiment analysis to each row of the 'emoji' column
data_labeled['Emosent_Polarity'] = data_labeled['emoji'].apply(lambda x: emosent_score(x))

In [None]:
# Map sentiment polarity to labels
def map_emotion_label(polarity):
    if polarity > 0:
        return 'Positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the function to create the 'Emosent_Label' column
data_labeled['Emosent_Label'] = data_labeled['Emosent_Polarity'].apply(map_emotion_label)

In [None]:
data_labeled['is_msg'] = data_labeled['clean_msg'].apply(lambda x: '-' if pd.isna(x) or x == '' else 'msg')
data_labeled['clean_msg'].fillna('-', inplace=True)

data_labeled['is_emoji'] = data_labeled['emoji'].apply(lambda x: '-' if pd.isna(x) or x == '' else 'emoji')
data_labeled['emoji'].fillna('-', inplace=True)

In [None]:
data_labeled.shape

# 4. Let's start with Exploratory Data Analysis (EDA)

In [None]:
import plotly.graph_objs as go

## Most common words | Word-cloud

### Word-cloud | Remove Stop-words 
At this point I make sure, the words have the correct sensitive polarity and I clean the stop words so that the most used words can be seen.

In [None]:
# # remove stop-words
data_filtered = data_labeled
stopwords_list = pd.read_csv('stopwords.csv', encoding='ISO-8859-1')['words'].tolist()
data_filtered['clean_words'] = data_filtered['clean_msg'].apply(lambda x: ' '.join([word for word in str(x).split() if word.lower() not in stopwords_list]))

### Word-cloud | Remove remaining Stop-words manually
Some words not included in the stop-words file are still in the list of positive words. I remove them to leave only nouns, adjectives and other words that really represent the positive meaning of the messages.

In [None]:
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(o)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(ayer)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(mucha)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(muchas)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(mucho)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(ahi)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(esos)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(otro)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(hoy)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(nada)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(nada)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(está)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(estás)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(ahora)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(esto)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(tanto)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(cumpleee)+\b', 'cumple', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(cómo)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(ver)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(x)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(dice)+\b', '', regex=True)
data_filtered['clean_msg'] = data_filtered['clean_msg'].str.replace(r'\b(creo)+\b', '', regex=True)

In [None]:
import wordcloud
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud

In [None]:
def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color = 'white',
        max_words = 200,
        max_font_size = 40, 
        scale = 3,
        random_state = 42
    ).generate(str(data))
    #).generate(data)

    fig = plt.figure(1, figsize = (20, 20))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize = 20)
        fig.subplots_adjust(top = 2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
Positive = data_filtered[data_filtered["Sentiment_Label"] == 'Positive']
concatenated_message = ''
for i in range(Positive.shape[0]):
    concatenated_message += ' ' + Positive['clean_words'].iloc[i]

# to remove leading space
concatenated_message = concatenated_message.strip()
show_wordcloud(concatenated_message)

In [None]:
Negative = data_filtered[data_filtered["Sentiment_Label"] == 'Negative']
Negative_concatenated = ''
for i in range(Negative.shape[0]):
    Negative_concatenated += ' ' + Negative['clean_words'].iloc[i]

# to remove leading space
Negative_concatenated = Negative_concatenated.strip()
show_wordcloud(Negative_concatenated)

## Conversation stats

In [None]:
selected_columns = ['rDate', 'Month', 'Year', 'Weekday', 'Time', 'Time-Day', 'Morning',
       'Afternoon', 'Night', 'Sender', 'clean_msg', 'emoji','word_count_group',
       'Sentiment_Label', 'Emosent_Label', 'is_msg', 'is_emoji','clean_words']

chat = data_labeled[selected_columns]

# Assuming 'Time' is the current name of the column and data_filtered is your DataFrame
chat = chat.rename(columns={'rDate': 'date'})
chat = chat.rename(columns={'Time': 'hour'})
chat = chat.rename(columns={'Sentiment_Label': 'sentiment'})
chat = chat.rename(columns={'Sender': 'username'})
chat = chat.rename(columns={'clean_msg': 'message'})
chat = chat.rename(columns={'Emosent_Label': 'emosent'})
chat = chat.rename(columns={'word_count_group': 'msg_categ'})

chat.columns

In [None]:
chat.shape

In [None]:
chat.head()

## Use of emojis
The results showed that 😂 Face with Tears of Joy was the most commonly used emoji, followed by 👍 Thumbs Up, 🙏 Folded Hands, and 🥰 Smiling Face with Hearts. According to Emojipedia, these emojis suggest positive emotions, which could be assumed that the chat messages with emojis tended to be more positive in tone.

Next, I analyzed the word count of the messages sent by each member to understand their communication styles.

In [None]:
chat.groupby('username').agg({'message': 'count',
                              'emoji': lambda x: ' '.join(set(emoji for emojis in x.dropna() for emoji in emojis))
                              }).sort_values(by='message', ascending=False)

In [None]:
from collections import Counter
import pandas as pd

emoji_counter = Counter()

# Iterate over each message in the 'emoji' column
for message in chat['emoji']:
    # Check if the message is not NaN and is a string
    if not pd.isna(message) and isinstance(message, str):
        # Exclude "_" emoji and update the counter
        emoji_counter.update(emoji for emoji in message if emoji != "_")

# Create a DataFrame from the Counter
emoji_df = pd.DataFrame(emoji_counter.most_common(), columns=['emoji', 'count'], index=range(1, len(emoji_counter) + 1))

chat['emoji'] = chat['emoji'].replace({'🏻':'🤷'})
chat['emoji'] = chat['emoji'].replace({'⚠':'⚠️'})

# Display the top 20 emojis
emoji_df.head(15)

In [None]:
chat['is_emoji_empty'] = chat['emoji'].apply(lambda x: 0 if pd.isna(x) or x == '_' else 1)
grouped_chat = chat.groupby('is_emoji_empty').size().reset_index(name='count')

chat['is_emoji'] = chat['emoji'].apply(lambda x: True if x != '_' else False)
grouped_chat = chat.groupby('is_emoji').size().reset_index(name='count')

In [None]:
import plotly.graph_objects as go

# Create a pie chart using Plotly
fig = go.Figure(data=go.Pie(
    labels=['Chats without emoji', 'Chats with emoji'],
    values=grouped_chat['count'],
    hole=0.4,marker=dict(colors=['#25D366', '#075E54']),
    title=dict(text='<b>Overall</b>', font=dict(size=16))))

fig.update_traces(hoverinfo='label+value')

In [None]:
fig = go.Figure(data=go.Pie(labels=['Negative', 'Neutral', 'Positive'],
                            values=chat.groupby('sentiment').count()[['message']].reset_index()['message'],
                             hole=.4, marker=dict(colors=['#075E54', '#dcf8c6', '#25D366']),
                             title=dict(text='<b>Overall</b>', font=dict(size=16))))

fig.update_traces(hoverinfo='label+value')

In [None]:
chat.columns

In [None]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(go.Pie(labels=['Negative', 'Neutral', 'Positive'],
                     values=chat[chat.msg_categ == 'short msg'].groupby('sentiment').count()[['message']].reset_index()['message'],
                     marker=dict(colors=['#075E54','#dcf8c6', '#25D366', ]),
                     title=dict(text='<b>short msg</b>', font=dict(size=16))), 1, 1)

fig.add_trace(go.Pie(labels=['Negative', 'Neutral', 'Positive'],
                     values=chat[chat.msg_categ == 'regular msg'].groupby('sentiment').count()[['message']].reset_index()['message'],
                     hole=.4, marker=dict(colors=['#075E54','#dcf8c6', '#25D366', ]),
                     title=dict(text='<b>regular msg</b>', font=dict(size=16))), 1, 2)

fig.update_traces(hole=.4, hoverinfo='label+value')

In [None]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(go.Pie(labels=['Negative', 'Neutral', 'Positive'],
                     values=chat[chat.is_emoji_empty == 0].groupby('sentiment').count()[['message']].reset_index()['message'],
                     marker=dict(colors=['#075E54','#dcf8c6', '#25D366', ]),
                     title=dict(text='<b>without Emoji</b>', font=dict(size=16))), 1, 1)

fig.add_trace(go.Pie(labels=['Negative', 'Neutral', 'Positive'],
                     values=chat[chat.is_emoji_empty == 1].groupby('sentiment').count()[['message']].reset_index()['message'],
                     hole=.4, marker=dict(colors=['#075E54','#dcf8c6', '#25D366', ]),
                     title=dict(text='<b>with Emoji</b>', font=dict(size=16))), 1, 2)

fig.update_traces(hole=.4, hoverinfo='label+value')

## Message length
Next, I analyzed the word count of the messages sent by each member to understand their communication styles.

In [None]:
from whatstk import WhatsAppChat, FigureBuilder

fig = FigureBuilder(chat.assign(message=chat['message'].apply(lambda x: ''.join([' ' for i in range(len(x.split())) if x != '<Media omitted>'])))
                    ).user_msg_length_boxplot(title='User message length', xlabel=None)
fig

## Message activity
Looking at the conversation stats, it was pretty obvious that some members were more active than others. But, to really get what was going on in this group chat, I needed to see how often messages were being sent over time. Thus, I delved deeper to get the scoop on message activity.

## Activity by day

In [None]:
chat['date'] = pd.to_datetime(chat['date'])
# chat.info()

In [None]:
fig = FigureBuilder(chat).user_interventions_count_linechart(title=None, xlabel=None, all_users=True)
fig

## Members interventions

In [None]:
fig = FigureBuilder(chat).user_interventions_count_linechart(date_mode='date', title=None, xlabel=None)
fig

In [None]:
fig = FigureBuilder(chat).user_interventions_count_linechart(cumulative=True, title=None, xlabel=None)
fig

## Activity by hour 

In [None]:
chat.columns

In [None]:
chat.dtypes

In [None]:
chat['hour'] = pd.to_datetime(chat['hour']).dt.hour

In [None]:
pivot = pd.pivot_table(chat, index='hour', columns='Time-Day', values='message', aggfunc='count').fillna(0)
pivot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.countplot(x='hour', data=chat, palette='viridis')
plt.title('Distribution of messages')
plt.xlabel('Time')
plt.ylabel('Messages')
plt.legend(title='', title_fontsize='12')
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.countplot(x='hour', hue='Time-Day', data=chat, palette='viridis')
plt.title('Distribution of messages')
plt.xlabel('Time-Day')
plt.ylabel('Quantity')
plt.legend(title='Hour', title_fontsize='12')
plt.show()


In [None]:
pivot = pd.pivot_table(chat, index='hour', columns='Weekday', values='message', aggfunc='count').fillna(0)
pivot

In [None]:
pivot = pd.pivot_table(chat, index='hour', columns='Weekday', values='message', aggfunc='count').fillna(0)
heatmap = go.Heatmap(z=pivot.values,
                     x=pivot.columns,
                     y=pivot.index,
                     hovertemplate='Interventions at %{y}-hour<extra>%{z}</extra>',
                     colorscale='Greens')
fig = go.Figure(data=[heatmap]).update_layout(xaxis={'categoryorder': 'array',
                                                     'categoryarray': ['Monday', 'Tuesday', 'Wednesday',
                                                                       'Thursday', 'Friday', 'Saturday', 'Sunday']})
fig

In [None]:
pivot = pd.pivot_table(chat, index='Time-Day', columns='Weekday', values='message', aggfunc='count').fillna(0)
heatmap = go.Heatmap(z=pivot.values,
                     x=pivot.columns,
                     y=pivot.index,
                     hovertemplate='Interventions at %{y}-Time-Day<extra>%{z}</extra>',
                     colorscale='Greens')
fig = go.Figure(data=[heatmap]).update_layout(xaxis={'categoryorder': 'array',
                                                     'categoryarray': ['Monday', 'Tuesday', 'Wednesday',
                                                                       'Thursday', 'Friday', 'Saturday', 'Sunday']})
fig

In [None]:
chat['date'] = pd.to_datetime(chat['date'])

In [None]:
hour_chat = chat.groupby('hour').size().reset_index(name='count')
hour_chat.head()

In [None]:
fig = FigureBuilder(chat).user_interventions_count_linechart(
    date_mode='weekday',
    title=None,
    xlabel=None).update_layout(xaxis={'tickvals': [0, 1, 2, 3, 4, 5, 6],
    'ticktext': ['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday']})
fig

## Member interaction

In [None]:
fig = FigureBuilder(chat).user_message_responses_heatmap(title=None)
fig

In [None]:
fig = FigureBuilder(chat).user_message_responses_flow(title=None)
fig

## How everyone’s feeling

In [None]:
# + and - by user
pivot = pd.pivot_table(chat, index='sentiment',
                       columns='username',
                       values='message',
                       aggfunc='count').apply(lambda x: x/x.sum(), axis=0)
heatmap = go.Heatmap(z=pivot.values,
                     x=pivot.columns,
                     y=pivot.index,
                     hovertemplate='Interventions<extra>%{z:.2%}</extra>',
                     colorscale='Greens')
fig = go.Figure(data=[heatmap])
fig

## Conclusion
After analyzing our WhatsApp group chat, it can be concluded that our group chat was a 
fun and supportive space where everyone showed appreciation for one another and had a good laugh. 
Despite a mostly neutral sentiment, the use of emojis added a positive touch. The chat’s topics vary greatly, from casual banter to serious discussions, making for an engaging and diverse conversation.
Overall, the analysis provided valuable insights into what our group chat is all about.