# Whatsaap chat data Analysis

In this EDA project, I will attempt to find out what normally I will do in a group chat with my friends such as the active hours we usually talk and the number of emoji we use in the chat. Let get started!


**Now, importing the required libraries for this project**

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import emoji
from wordcloud import WordCloud, STOPWORDS
from collections import Counter

In [None]:
def rawToDf(file, key):
    split_formats = {
        '12hr' : '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s[APap][mM]\s-\s',
        '24hr' : '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s',
        'custom' : ''
    }
    datetime_formats = {
        '12hr' : '%d/%m/%Y, %I:%M %p - ',
        '24hr' : '%d/%m/%Y, %H:%M - ',
        'custom': ''
    }
    
    with open(file, 'r') as raw_data:
        raw_string = ' '.join(raw_data.read().split('\n')) 
        user_msg = re.split(split_formats[key], raw_string) [1:] 
        date_time = re.findall(split_formats[key], raw_string) 
        df = pd.DataFrame({'date_time': date_time, 'user_msg': user_msg}) 
        
   
    df['date_time'] = pd.to_datetime(df['date_time'], format=datetime_formats[key])
    
   
    usernames = []
    msgs = []
    for i in df['user_msg']:
        a = re.split('([\w\W]+?):\s', i) 
        if(a[1:]): 
            usernames.append(a[1])
            msgs.append(a[2])
        else: 
            usernames.append("grp_notif")
            msgs.append(a[0])

           
    df['user'] = usernames
    df['msg'] = msgs

   
    df.drop('user_msg', axis=1, inplace=True)
    
    return df

In [None]:
df = rawToDf("/kaggle/input/whatsaap-chat-data/WhatsApp Chat with MBA23 (4th sem official).txt",'12hr')
df.head()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.info()

In [None]:
me = "Aanchal jayal"

# Data Cleaning

**no. of images, images are represented by media omitted**

In [None]:
images = df[df['msg']=="<Media omitted> "] 
images.shape

In [None]:
df["user"].unique()    #no. of unique values

**remove images**

In [None]:
df.drop(images.index,inplace = True)

In [None]:
df.shape

# Exploratory Analysis & Visualization

**1.Who is the most active member of the group. Who is the least active?**

In [None]:
df.groupby("user")["msg"].count().sort_values(ascending=False)

**2.the person who uses whatsapp the most**

In [None]:
df.groupby('user').count().head()


In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x="user",data = df)
plt.show()

**3.Plotting the activation of the people per hour**

In [None]:
df['hour'] = df['date_time'].apply(lambda x: x.hour)
df[df['user']==me].groupby(['hour']).size().sort_index()
df.plot(x="hour", kind='bar')

**4.top 5 emojis used in the group**

In [None]:

emojis = []
for message in df['msg']:
  emojis.extend([c for c in message if c in emoji.EMOJI_DATA])

pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis)))).head()

**5.the most active hour in whatsaap**

In [None]:
df1 = df.copy()
df1['msg'] = [1] * df.shape[0]
df1['hours'] = df1['date_time'].apply(lambda x: x.hour)
time_df = df1.groupby('hours').count().reset_index().sort_values(by = 'hours')
time_df
