# A - Analysis of a WhatsApp chat
### GROUP G - Authors: Alessia Bernacchia, Alexandra Biddiscombe

## 1. Problem statement

Our project consists of extracting chat data from WhatsApp, and visualising a variety of metrics about texting styles by age. As a goal of the project, we would like to prove that there are a number of differences between the ways people of different ages text, both based on our preconceptions and personal experience, and through the exploration of the data. Our various hypothesis tests and linear regression visualisations will focus mostly on the differences between people under the age of 30, and people over the age of 30.

## 2. Description of the dataset

*We asked a set of different people with different ages to provide the Whatsapp chats. All data was submitted with consent of the participants, who were requested to also submit their age for statistics purposes.* 

- **source of dataset**  
    Multiple Whatsapp chats from people of various ages
- **number of observation**  
    40 chats between various age groups, for a total of 460 different people
- **number of variables per observation**   
    Each observation is one person's chatting information, extracted from the collected chats logs
    We start with around 250 different people, and clean it down to 191
- **meaning and type of the different variables**  
    Each person (variable) is stored along with the chatting information, which in our end goal state inlcudes:
    - **Author** : A short name acronym to separate each chatter, followed by and underscore and their age
    - **Age** : the author's age, arguably the most important piece of information
    - **Age_group** : which group of range 5 years their age fits into (i.e. 20 - 25)
    - **n_letters** : the total count of letters this person sent
    - **n_word** : the total word count sent by this person
    - **n_emoji** : the total emoji count sent by  this person
    - **n_Url** : the total URL count sent by this person
    - **Media_Count** : the total number of images, audio, and other media sent by t his person
    - **Message_Count** : the total number of distinct messages sent by this person
    - **mean_letters_x_mess** : the average number of letters per message, calculated by taking the n_letters and dividing by the Message_Count
    - **mean_words_x_mess** : average number of words per message
    - **mean_emojis_x_mess** : average number of emoji per message
  

## Dataframe creation:

#### Sourcing the raw datasets:

In [1]:
# source of dataset
# All raw WhatsApp exports are saved locally in a folder named "chats"

android_chats_to_analyse = ["chats/JB_56_MKB_59.txt", "chats/OB_48_LC_53.txt", "chats/PP_35_AY_37.txt",
                            "chats/FB_39_TL_34.txt", "chats/FB_39_SO_34.txt", "chats/FB_39_JB_35.txt",
                            "chats/GROUP_OVER_A.txt", "chats/GROUP_OVER_B.txt", "chats/GROUP_OVER_C.txt",
                            "chats/SB_23_EG_35.txt", "chats/RCM_21_AM_63.txt", "chats/RCM_21_MB_32.txt",
                            "chats/MB_21_CAS_21.txt", "chats/MB_21_HK_21.txt", "chats/VAL_21_LES_21.txt", "chats/VAL_21_MB_21.txt",
                            "chats/AB_21_YS_26.txt", "chats/GS_21_AM_20.txt", "chats/GS_21_CM_21.txt", "chats/GS_21_KM_22.txt",
                            "chats/GS_21_SF_21.txt", "chats/GS_21_VM_26.txt",
                            "chats/GROUP_UNDER_A.txt", "chats/GROUP_UNDER_B.txt", "chats/GROUP_UNDER_C.txt", "chats/GROUP_UNDER_D.txt",
                            "chats/GROUP_UNDER_E.txt", "chats/GROUP_UNDER_F.txt", "chats/GROUP_UNDER_G.txt", "chats/GROUP_UNDER_H.txt"]

ios_chats_to_analyse = ["chats/AZ_43_SP_48.txt", "chats/CL_54_NC_47.txt", "chats/GR_60_SP_48.txt", "chats/VS_36_SP_48.txt",
                        "chats/FP_21_LL_22.txt", "chats/MM_19_AB_23.txt", "chats/MM_19_SM_21.txt",
                        "chats/GROUP_UNDER_I.txt", "chats/GROUP_UNDER_L.txt", "chats/GROUP_UNDER_M.txt"]


#### Importing the libraries:

The first block are libraries used to create the dataframes.\
The second block are libraries for plotting and visualising the data.\
The last block is made up of the libraries used for language analysis. (reminder to remove, we most likely will not use this as it is not a letter / message counting based analysis).

In [2]:
# Calculation and utility imports
import pandas as pd
import numpy as np
import re
import emoji
import os
from datetime import datetime

## 3. Data cleanup
read the dataset in Python and take care of:

When creating our chat dataframes, we want to extract the data from our initial datasets by using the system whatsapp uses to export the data. The data is exported differently when done by an Android and an iOS device, so we have to define separate code pieces to handle the different exportation methods.

In both cases, we can use regex to look for the following information:
- **Date**, the first piece of information of each interaction, which will always define the start of a datapoint;
- **Author**, between the date and the message, which is not always present (see encryption announcements, join announcements, etc.);
- **Message**, the body of the message, usually found after te author.

To explain how to separate these pieces of information, let's take as an example the chats exported by an Android device:
- The **date** is separated from the rest of the information by a dash ("-"), and will be used by future code in the format dd/mm/yy;
- The **author** is separated from the message by a colon (":");
- The **message** is any additional part of the line and future lines that do not start with a date.

To make sure we obtain the correct information for each interaction, we need to be sure the regex will pick up the correct pieces each time, so we need to convert all dates to dd/mm/yy, make sure there are no ":" in any contact name (otherwise if will split the contact name before the colon and add all the rest to the message), and add any line that does not start with a date to the text of the message above. \
The code for these cleaning steps is placed underneath the dataframe creation, so as to be used in the cells where it is needed.

Through further data exploration and visualisation, we came to the conclusion that all analyses should be made on statistics between the ages of the people in our dataframe, and so we later remove a great deal of outside information, such as the times.

#### Initial dataframe creation:
The "parse_wa_chat" functions create an initial dataframe, using the cleaning methods described above. This is in no means the final dataframe, as we need to anonymise the data, remove all message content and add the age groups before we can perform our final analysis, so we will continue to refine the dataframe further with additional cleaning, reorganising and selection of the final data.

In [3]:
import nbimporter
from changer_names import change_contact_names_android, change_contact_names_ios

In [4]:
# Takes as input an exported whatsapp .txt file and outputs a pandas dataframe containing the columns 
# Date, Time, Author, and Message

# Expected input format: a WhatsApp chat exported using an Android device.
def parse_wa_chat_android(wa_chat_filename):
    
    parsed_data = [] # A list for storing the data to then be used in the pandas dataframe
    
    with open(wa_chat_filename, encoding="utf-8") as chat:
        chat.readline() # skip the first line, it contains information about message encoding
        message_buffer = [] # holds any lines that are not the start of a new message
        date, time, author = None, None, None # initialise important variables
        while True:
            line = chat.readline()
            if not line:
                break # There are no more lines, file ended, stop loop
            line = line.strip() # remove spaces at start and end
            
            line = fix_faulty_contacts(line) # added because this contact name is problematic, contains ":"

            line = change_contact_names_android(line)
            
            if starts_with_date_and_time(line):
                # Normalise all lines by setting date and time to same format for all files
                if not exact_date_and_time(line):
                    line = line[:6] + line[8:]
                    
                # This checks if the line starts with a timestamp
                if len(message_buffer) > 0:
                    parsed_data.append([date, time, author, " ".join(message_buffer)])
                message_buffer.clear()
                date, time, author, message = get_data_point(line)
                message_buffer.append(message)
            else:
                message_buffer.append(line)
    chat_df = pd.DataFrame(parsed_data, columns = ["Date", "Time", "Author", "Message"])
    
    return chat_df


# Checks if a given line starts with a date and time, using regex to determine
def starts_with_date_and_time(line):
    
    pattern = "^([0-9]+)(/)([0-9]+)(/)([0-9]+), ([0-9]+):([0-9]+) -"
    result = re.match(pattern, line)
    
    if result:
        return True
    else:
        return False

    
# Checks if the date format has a year of length 2 digits, which is the format "datetime" uses
def exact_date_and_time(line):
    
    pattern = "^([0-9]+)(/)([0-9]+)(/)([0-9][0-9]), ([0-9]+):([0-9]+) -"
    result = re.match(pattern, line)
    
    if result:
        return True
    else:
        return False
    
    
# Takes a line of the chat and log and returns the elements Date, Time, Author, and Message
def get_data_point(line):
    
    split_line = line.split(" - ", 1)
    date_time = split_line[0]
    
    date, time = date_time.split(", ", 1)
    message = " ".join(split_line[1:])
    if(":" in message): # This indicates there is an author 
        split_message = message.split(": ", 1)
        author = split_message[0]
        message = " ".join(split_message[1:])
    else:
        author = None
    
    return date, time, author, message


# Fix the known mistake of a contact with ":" in it by replacing it with a functional name
def fix_faulty_contacts(line):
    if("******" in line):
        line = line.replace("******", "SB_23")
    elif("*******" in line):
        line = line.replace("******", "RCM_21")
    return line

*Names that gave us issues in the creation of the dataset, in the function "fix_faulty_contacts", have been redacted for privacy resons, and replaced with "\*\*\*\*\*\*\*".*

In [5]:
# Takes as input an exported whatsapp .txt file and outputs a pandas dataframe containing the columns 
# Date, Time, Author, and Message

# Expected input format: a WhatsApp chat exported using an iOS device.
def parse_wa_chat_ios(name_wa_chat_file):

    # Creating a dataframe and storing all data inside that dataframe.
    parsed_data = [] # List to keep track of data so it can be used by a Pandas dataframe
    
    # Uploading exported chat file
    with open(name_wa_chat_file, encoding="utf-8") as fp:
        # Skipping first line of the file because contains information related to something about end-to-end encryption
        fp.readline() 
        message_buffer = [] 
        date, time, author = None, None, None
        
        while True:
            line = fp.readline()
            #line = line.replace(r"\u200u", "") 
            string_encode = line.encode("ascii", "ignore")
            line = string_encode.decode()

            if not line: 
                break # file ended 
                
            line = line.strip() #remove first and last space
            line = change_contact_names_ios(line)

            try:
                if starts_with_date_and_time_ios(line): 
                    if len(message_buffer) > 0: 
                        parsed_data.append([date, time, author, ' '.join(message_buffer)])
                    message_buffer.clear() 
                    date, time, author, message = get_data_point_ios(line) 
                    message_buffer.append(message) 
                    
                else:
                    message_buffer.append(line)
                    
            except:
                pass
    
    chat_df = pd.DataFrame(parsed_data, columns=['Date', 'Time', 'Author', 'Message']) # Initialising a pandas Dataframe.

    return chat_df

def starts_with_date_and_time_ios(s):
    patterm = '^([0-9]+)(/)([0-9]+)(/)([0-9][0-9]), ([0-9]+):([0-9][0-9]) -'
    pattern = '^\[([0-9]+)([\/-])([0-9]+)([\/-])([0-9]+)[,]? ([0-9]+):([0-9][0-9]):([0-9][0-9])?[ ]?(AM|PM|am|pm)?\]' 

    result = re.match(pattern, s)
    #print(result)
    if result:
        return True
    return False

def get_data_point_ios(line):
    split_line = line.split('] ')
    date_time = split_line[0]
    message = ' '.join(split_line[1:])

    if(': ' in message ):
        split_message = message.split(':')
        author = split_message[0]
        message = ' '.join(split_message[1:])
    else:
        author = None

    date, time = date_time.split(', ')
    if len(date) > 9:
        date = date[1:7] + date[9:]
    else:
        date = date[1:]
    
    # Fixing the mistake that is the american date system
    american_contacts = ["VC_36", "SP_48", "AZ_43", "GR_60"]
    if author in american_contacts:
        month, day, year = date.split("/")
        if len(month) == 1:
            month = '0' + month
        if len(day) == 1:
            day = '0' + day
        date = day + '/' + month + '/' + year

    # Making the iOS exported time system match Android method
    hour, minute, second = time.split(':')
    if len(hour) == 1:
        hour = '0' + hour
    if 'AM' in time or 'am' in time:
        if hour == '12':
            hour = '00'
    if 'PM' in time or 'pm' in time:
        hour = str(int(hour) + 12)
    time = hour + ':' + minute     


    # Omitted media in the iOS format contains precise forms of media omitted, which does not match with the Android
    # system, such as differentiating "Omitted audio" and "Omitted image". To fix this problem, we simply take any 
    # message with the common word "omitted" and set it to the same format as Android
    if "omitted" in message:
        message = "<Media omitted>"

    return date, time, author, message

#### Additional data cleaning:
We would like to ideally remove any recognisable information from the chat dataframes, such as content of the messages and names, using a name replacement function.       
The code usedfor the nmae replacement is stored in a different file, for a little more anonymity, as all the contact names need to be listed to be changed.

### Additional data cleaning steps:
In this code segment, we aim to make the dataframes closer to the final use case that we are aiming for, in the following ways:
- **Dropping all NaN** rows
- Adding additional columns holding information about the **date**:
    - **Day of the week**
- Adding additional columns holding information about the  **message**:
    - **Letters in each message**
    - **Words in each message**
    - **Number of URLs**
    - **Number of media**
    - **A list of emoji**
    - **A list of words**
    - **Counter for the number of messages**, at the moment this will always display 1 but will be useful for further analysis


methods to create and populate a global dataframe of all the chats

In [6]:
# Takes a simple dataframe as input and outputs another that is slightly better fit for the final analysis        
def expand_df(df):
    
    # Drop NaN values from dataset
    df = df.dropna()
    df = df.reset_index(drop=True)
    
    # Change datetype of the "Date" column
    df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%y")
    
    # Adding a day column
    weeks = {0 : 'Monday', 1 : 'Tuesday', 2 : 'Wednesday', 3 : 'Thursday', 4 : 'Friday', 5 : 'Saturday', 6: 'Sunday'}
    df['Day'] = df['Date'].dt.weekday.map(weeks)
    
    # Rearrange columns for better readability
    df = df[["Date", "Day", "Time", "Author", "Message"]]
    
    # This is in the original code but I'm gonna skip it for now to see if it still works without:
#     # Changing the datatype of column "Day".
#     df['Day'] = df['Day'].astype('category')

    # Add a letter count column for each message
    df["n_letters"] = del_omitted_media(df).apply(lambda row: len(row.Message), axis=1)

    # Counting number of word's in each message, it will add extra column and store information in it.
    df['n_word'] = del_omitted_media(df).apply( lambda row: len(row.Message.split(' ')), axis=1)

    # Function to count number of links in dataset, it will add extra column and store information in it.
    URLPATTERN = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    df['n_Url'] = df.Message.apply(lambda x: re.findall(URLPATTERN, x)).str.len()

    # Count number of media in chat.
    media_pattern_1 = r'<Media omessi>'
    media_pattern_2 = r'<Media omitted>'
    df['Media_Count'] = df.Message.apply(lambda x : re.findall(media_pattern_1, x)).str.len() + df.Message.apply(lambda x : re.findall(media_pattern_2, x)).str.len()

    # emoji list by message
    df["emoji"] = df["Message"].apply(lambda x: split_count(x))

    # create word list by message
    df["word"] = del_omitted_media(df)["Message"].apply(lambda x: split_in_words(x))

    # usefull to count
    df['MessageCount'] = 1
    
    # We decided to keep the media so as to be able to plot the frequency of messages
    # df = del_omitted_media(df)
    
    return df


# Creates a list containing all the emojis in the message
def split_count(text):
    
    emoji_list = []
    
    data = list(text.strip(" "))

    for word in data:
        if any(char in emoji.EMOJI_DATA for char in word):
            emoji_list.append(word)
    return emoji_list


# Creates a list of all the words in the message
def split_in_words(text):
    
    #remove non alphabetical
    text = re.sub("'", " ", text)
    URLPATTERN = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    text = re.sub(URLPATTERN, "", text)
    regex = re.compile('[^a-zA-Z àèéìòù]')
    text = regex.sub('', text)
    text = re.sub(' +', ' ', text)
    
    #all in lowercase
    text = text.lower()

    #create and populate list
    words_list = text.split(" ")
    while("" in words_list) :
        words_list.remove("")

    return words_list


# Takes a text and returns a list of words
def split_in_words_by_lib(text):
   
    langdata = simplemma.load_data('it')
    return simplemma.text_lemmatizer(text, langdata)


# Takes a dataframe as input and returns another, stripped of omitted media (for english and italian), as output
def del_omitted_media(df):
    df = df[df["Message"] != "<Media omitted>"]
    df = df[df["Message"] != "<Media omessi>"]
    return df

Defining which files to read
and printing out some data as proof of concept

In [None]:
all_df = []
for chat in android_chats_to_analyse:
    all_df.append(expand_df(parse_wa_chat_android(chat)))
for chat in ios_chats_to_analyse:
    all_df.append(expand_df(parse_wa_chat_ios(chat)))

for i in range(3):
    display(all_df[i].head())

*Removed above output for anonymity and data privacy reasons, as it showed some message content.*

#### Creation of the dataset
that contains all the messages of all the chats

In [None]:
chats = []
for chat in all_df:
    chats.append(chat)
    
dataset = pd.concat(chats, ignore_index=True)
dataset

*Removed above output for anonymity and data privacy reasons, as it showed some message content.*

## 4. Preprocessing

At this point we have a large dataframe with all the messages of our dataset.   
We need to organise all of them in a way useful for our analysis.

#### delete the contact with Author = ''
in the groups from ios chats probably there is a strange exportation of the first line (where the group is created) and create a Author with name ''   
we have to exclude this fake-author from the statistic computation

In [9]:
dataset = dataset.drop(dataset[dataset['Author'] == ''].index)

#### Creation of a dataframe that contains in each line the data of one person
* **Author**: an _unique identifier_ of each person, it works as an index in this case, it is a code that contains the crypted name of a person and his age
* **age**: it is _continuous numerical variable_ that directly records a person's age
* **age-group**: it is _an ordinal categorical variable_ dividing the population into eight groups: 15-20 years, 21-25 years, 26-30 years, 31-35 years, 36-40 years, 41-45 years, 46-50 years, 50+ years
* number of total **letters sent**: it is a _continuous numerical values_ that indicates the total amount of letters sent    
* number of total **words sent**: it is a _continuous numerical values_ that indicates the total amount of words sent    
* number of total **urls sent**: it is a _continuous numerical values_ that indicates the total amount of urls sent   
* number of total **media sent**: it is a _continuous numerical values_ that indicates the total amount of medias sent   
* number of total **emojis sent**: it is a _continuous numerical values_ that indicates the total amount of emojis sent   
        
* number of total **messages sent**: it is a _continuous numerical values_ that indicates the total amount of messages sent   
       
* ***average of letters for message***: it is a _continuous numerical values_ that indicates the average of letters in the messages sent   
* ***average of words for message***: it is a _continuous numerical values_ that indicates the average of words in the messages sent   
* ***average of emojis for message***: it is a _continuous numerical values_ that indicates the average of emojis in the messages sent

In [10]:
# create an array with unique name of Authors
def array_of_people(dataset):
    people = dataset['Author'].unique()
    return people

In [11]:
# counter of the emoji in each line
# because in the dataset each message contains a list of emojis,
# that could be empty
def count_len(list_of_lists):
    counter = 0
    for l in list_of_lists:
        counter += len(l)
    return counter

# method that obtain the age from the name of the Author
# it uses the special saving mode to extrapolate the age from the last 2 letters of the name
def obtain_age(string_code_name):
    number = string_code_name[-2:]
    return int(number)

In [12]:
# creation of the dataframe where each line contains the informations of one author
def create_dataframe_groupbyAuthor(dataset):

    # initialisation with n_letters, n_word, n_url, Media_Counter, MessageCount
    dataframe = dataset.groupby("Author")[["n_letters", "n_word", "n_Url", "Media_Count", "MessageCount"]].agg(sum)
    
    # create and insert the column age in the correct position
    dataframe.insert(0, "age", dataframe.index)
    dataframe["age"] = dataframe["age"].apply(obtain_age)
    
    # create and insert the column of 'age_group'
    bins = [15, 20, 25, 30, 35, 40, 45, 50, float('inf')]
    labels = ['15-20', '21-25', '26-30', '31-35', '36-40', '41-45', '46-50', '50+']
    age_label = pd.cut(dataframe['age'], bins=bins, labels=labels)
    dataframe.insert(1, "age_group", age_label)
    
    # create a dataframe with all the emoji's lists
    n_emojis = dataset.groupby("Author")[["emoji"]].agg(list)
    n_emojis['emoji'] = n_emojis['emoji'].apply(count_len)
    # insert the column in the correct position
    # it is a list of lists that contains emojis
    dataframe.insert(4, "n_emoji", dataset.groupby("Author")[["emoji"]].agg(list))

    # modify the dataframe applying to the column the aggregation made
    dataframe['n_emoji'] = dataframe['n_emoji'].apply(count_len)

    # add the mean of letters used for message
    dataframe['mean_letters_x_mess'] = round(dataframe['n_letters'] / dataframe['MessageCount'], 2)

    # add the mean of words used for message
    dataframe['mean_words_x_mess'] = round(dataframe['n_word'] / dataframe['MessageCount'], 2)

    # add the mean of emojis used for message
    dataframe['mean_emojis_x_mess'] = round(dataframe['n_emoji'] / dataframe['MessageCount'], 2)


    return dataframe

In [13]:
len(array_of_people(dataset))

460

In [14]:
# creation of the dataframe
dataframe = create_dataframe_groupbyAuthor(dataset)

  dataframe = dataset.groupby("Author")[["n_letters", "n_word", "n_Url", "Media_Count", "MessageCount"]].agg(sum)


#### delete the people with too small number of messages
we choose to accept the authors that have written more than 30 messages

In [15]:
# number of total lines in the initial dataframe
dataframe.count()[0]

  dataframe.count()[0]


460

In [16]:
# define the number of messages that a person need at least to be considered in the statistics
min_messages = 25

In [17]:
dataframe = dataframe[dataframe["MessageCount"]>=min_messages]

#### split the dataframe in two dataframes (over and under 30)
creation of two differents dataframes only with the people who have enough number of messages

In [18]:
# number of total lines in the initial dataframe
dataframe.count()[0]


  dataframe.count()[0]


191

In [19]:
# dataframe for over 30s
df_over_30 = dataframe[(dataframe["age"]>=30) & (dataframe["MessageCount"]>=min_messages)]
print(df_over_30.count()[0])

55


  print(df_over_30.count()[0])


In [20]:
df_over_30.describe()

Unnamed: 0,age,n_letters,n_word,n_emoji,n_Url,Media_Count,MessageCount,mean_letters_x_mess,mean_words_x_mess,mean_emojis_x_mess
count,55.0,55.0,55.0,55.0,55.0,55.0,55.0,55.0,55.0,55.0
mean,42.327273,9880.509091,1932.327273,32.472727,3.090909,36.472727,167.127273,69.923273,13.219455,0.322909
std,7.198391,16360.187475,3372.212352,59.392871,6.441396,79.885996,295.096912,40.968667,7.542703,0.427532
min,30.0,890.0,173.0,0.0,0.0,0.0,25.0,8.73,1.98,0.0
25%,37.5,1765.0,340.0,4.0,0.0,3.0,34.5,37.385,7.48,0.065
50%,42.0,3918.0,744.0,11.0,1.0,7.0,62.0,65.69,12.35,0.22
75%,46.5,9301.0,1930.0,37.0,4.0,27.0,143.5,91.315,17.22,0.44
max,63.0,91943.0,18809.0,352.0,43.0,508.0,1780.0,207.19,39.73,2.71


In [21]:
# dataframe for under 30s
df_under_30 = dataframe[(dataframe["age"]<30) & (dataframe["MessageCount"]>=min_messages)]
print(df_under_30.count()[0])

136


  print(df_under_30.count()[0])


In [22]:
df_under_30.describe()

Unnamed: 0,age,n_letters,n_word,n_emoji,n_Url,Media_Count,MessageCount,mean_letters_x_mess,mean_words_x_mess,mean_emojis_x_mess
count,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0
mean,21.117647,29443.79,5983.610294,84.926471,9.639706,82.882353,799.139706,36.434412,7.521176,0.075588
std,2.322383,120182.4,23791.42621,424.858349,55.788048,292.859771,2803.645381,17.993952,3.208407,0.169953
min,16.0,499.0,117.0,0.0,0.0,0.0,25.0,13.78,3.18,0.0
25%,20.0,1767.25,360.25,0.0,0.0,3.0,55.0,23.9925,5.2775,0.0
50%,21.0,4403.0,883.0,0.0,1.0,9.0,116.5,32.47,6.95,0.0
75%,21.0,9918.5,2217.5,5.25,3.0,25.75,298.5,41.98,8.4425,0.05
max,29.0,1030826.0,201162.0,3398.0,627.0,1882.0,24675.0,113.98,21.62,1.15


#### creation of a dataset where each line is one of the group-age
* index is the **age-group**: the indexing variable that is _an ordinal categorical variable_ dividing the population into eight groups: 15-20 years, 21-25 years, 26-30 years, 31-35 years, 36-40 years, 41-45 years, 46-50 years, 50+ years
    
* number of **people**: a _continuous numerical variable_ that indicates the sample size of each group of age
* number of total **letters sent**:  a _continuous numerical values_ that indicates the total amount of letters sent by the group    
* number of total **words sent**:  a _continuous numerical values_ that indicates the total amount of words sent by the group   
* number of total **urls sent**:  a _continuous numerical values_ that indicates the total amount of urls sent by the group   
* number of total **media sent**:  a _continuous numerical values_ that indicates the total amount of media sent by the group   
* number of total **emojis sent**:  a _continuous numerical values_ that indicates the total amount of emojis sent by the group   
        
* number of total **messages sent**:  a _continuous numerical values_ that indicates the total amount of messages sent by the group   
       
* ***average of letters for message***: a _continuous numerical values_ that indicates the average of letters in the messages sent by the group    
* ***average of words for message***: a _continuous numerical values_ that indicates the average of words in the messages sent by the group   
* ***average of emojis for message***: a _continuous numerical values_ that indicates the emojis of letters in the messages sent by the group

In [23]:
# creation of the dataframe where each line contains the informations of one author
def create_dataframe_groupbyAgeGroup(dataset):

    # initialisation with n_letters, n_word, n_url, Media_Counter, MessageCount
    dataframe = dataset.groupby("age_group")[["n_letters", "n_word", "n_Url", "Media_Count", "n_emoji", "MessageCount"]].agg(sum)

    # add the mean of letters used for message
    dataframe['mean_letters_x_mess'] = round(dataframe['n_letters'] / dataframe['MessageCount'], 2)

    # add the mean of words used for message
    dataframe['mean_words_x_mess'] = round(dataframe['n_word'] / dataframe['MessageCount'], 2)

    # add the mean of emojis used for message
    dataframe['mean_emojis_x_mess'] = round(dataframe['n_emoji'] / dataframe['MessageCount'], 2)
    
    # create and insert the column people in the correct position
    # create a serie with a list of information for all group of age
    # and after applies the len on one column --> to obtain the number of people
    n_people = dataset.groupby("age_group").agg(list)["n_letters"].apply(len)
    dataframe.insert(0, "n_people", n_people)

    return dataframe

In [24]:
df_age = create_dataframe_groupbyAgeGroup(dataframe)
df_age

  dataframe = dataset.groupby("age_group")[["n_letters", "n_word", "n_Url", "Media_Count", "n_emoji", "MessageCount"]].agg(sum)
  dataframe = dataset.groupby("age_group")[["n_letters", "n_word", "n_Url", "Media_Count", "n_emoji", "MessageCount"]].agg(sum)
  n_people = dataset.groupby("age_group").agg(list)["n_letters"].apply(len)


Unnamed: 0_level_0,n_people,n_letters,n_word,n_Url,Media_Count,n_emoji,MessageCount,mean_letters_x_mess,mean_words_x_mess,mean_emojis_x_mess
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
15-20,58,572276.0,119591.0,769,2666,10,17723,32.29,6.75,0.0
21-25,67,3376660.0,683010.0,518,8474,11338,89201,37.85,7.66,0.13
26-30,13,57648.0,11627.0,25,132,202,1831,31.48,6.35,0.11
31-35,7,70842.0,14108.0,13,302,432,2079,34.08,6.79,0.21
36-40,13,99423.0,19118.0,32,902,657,2262,43.95,8.45,0.29
41-45,17,103752.0,18987.0,49,224,510,1237,83.87,15.35,0.41
46-50,9,139038.0,28118.0,12,304,104,1636,84.99,17.19,0.06
50+,7,128144.0,25490.0,63,274,83,1906,67.23,13.37,0.04


### Export of the dataset

In [25]:
dataframe = dataframe.copy()
dataframe.sort_values(by=['age'], inplace= True)

df_under_30 = df_under_30.copy()
df_under_30.sort_values(by=['age'], inplace= True)

df_over_30 = df_over_30.copy()
df_over_30.sort_values(by=['age'], inplace= True)

In [26]:
if not os.path.exists("dataset"):
    os.mkdir('dataset')

In [27]:
dataframe.to_csv('dataset/complete_dataframe.csv')
df_age.to_csv('dataset/age_data.csv')
df_over_30.to_csv('dataset/over30_data.csv')
df_under_30.to_csv('dataset/under30_data.csv')