# Exploratory Data Analysis
In this notebook, we conduct basic data wrangling, and EDA on a whatsapp group conversations. For this case study, we select a classroom whatsapp group of 207 members. The group mostly discusses subjects about AI/ML, and assignments.

<br>This notebook serves as groundwork for second notebook, where we perform extensive Topic modelling.

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import re

plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=False, dpi=100)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

sns.set()

## How does an exported whatsapp text file looks like?

In [46]:
!head -n 10 'exported_whatsapp_chat.txt'

21/07/2020, 22:22 - Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.
21/07/2020, 22:22 - Glad: hii tribhav..
21/07/2020, 22:22 - Tribhav: Hello
21/07/2020, 22:24 - Tribhav: <Media omitted>
21/07/2020, 22:27 - Glad: that's the perfect shape..
21/07/2020, 22:27 - Tribhav: 🤗
21/07/2020, 22:27 - Glad: hope I'm there one day..
21/07/2020, 22:27 - Tribhav: You'll
21/07/2020, 22:28 - Glad: far I've gone is oval.. that's why I like making naan.. no one will question me that way..
21/07/2020, 22:28 - Tribhav: Hahahah


Each line in exported text file has three distinct parts - date, and time of the message, and the message text. The file also contains notifications like 'Media omitted', or about encryption. We exclude these lines in our analysis.

## Preprocessing

In this section, we extract datetime, phone-number/name, and message text from each line, and generate a csv file with above three sections as header.

In [49]:
def separator(msg):
    """
    Function to read each line of .txt file, and extract date-time, name, and message text
    """
    msg = msg.strip()

    # we define three groups: datetime, person-name/phone number, message
    result = re.search(r'(\d{1,2}/\d{1,2}/\d{4},\s+\d{1,2}:\d{1,2})\s+-\s+([+0-9a-zA-Z\s]+):\s+(.*)', msg)

    # ignore lines that don't have above three groups, and return empty line
    if not hasattr(result, 'group'):
        return ''

    return result.groups()

separator('20/05/2021, 21:24 - +91 9999 99999: Check this ans https://stackoverflow.com/questions/41567895/will-scikit-learn-utilize-gpu')

('20/05/2021, 21:24',
 '+91 9999 99999',
 'Check this ans https://stackoverflow.com/questions/41567895/will-scikit-learn-utilize-gpu')

In [50]:
import csv
def txt_to_csv(fn):
    """
    Function to convert .txt file to csv file with three headers - datetime, id, message text
    Accepts `fn` filename of exported text file as argument
    """
    with open(fn) as f:
         lines = map(separator, f.readlines())

         with open(f'{fn[:-4]}.csv', 'w') as wf:
             out = csv.writer(wf)
             out.writerow(['datetime', 'id', 'message'])
             for line in lines:
                 if line:
                    out.writerow(line)

txt_to_csv('original_chats.txt')

To protect the privacy of the members, the phone numbers/real names have been masked with random names with a random-name-generator API. <b>This is done only for EDA and topic modelling notebooks, not in web app.</b>

In [44]:
!pip install getindianname > /dev/null
from getindianname import randname

# read the original data to mask the real ids.
chats = pd.read_csv('original_chats.csv')  # this dataset won't be available further

# get all unique real ids
real_ids = chats.id.unique().tolist()

# generate random names
generated_ids = [randname().split()[0] for _ in range(len(real_ids))]

# map each real id with a random name, and save the data
real_to_generated = dict(zip(real_ids, generated_ids))
chats['id'] = chats.id.map(real_to_generated)
chats.to_csv('group_chats.csv', index=False)

In [54]:
chats = pd.read_csv('group_chats.csv', parse_dates=['datetime'])
chats

Unnamed: 0,datetime,id,message
0,2021-10-05 11:16:00,Udaya,hi
1,2021-10-05 11:30:00,Udaya,Has anyone completed transfer learning assignm...
2,2021-10-05 11:30:00,Krishna,hi all
3,2021-10-05 16:11:00,Mahendra,<Media omitted>
4,2021-10-05 16:13:00,Sherya,df.age.isna() or df.age.isnull()
...,...,...,...
6315,2021-11-12 21:50:00,Raghvendra,Very nice article. 👏👏👏
6316,2021-11-12 23:44:00,Rohit,Anyone who is appearing or appeared for interv...
6317,2021-12-12 17:22:00,Sherya,Guys please anyone can share sample resume on ...
6318,2021-12-12 17:25:00,Ashish,Thank you :)
