# Convert the .txt files from Whats' App to the .csv format

The code that analyses the conversation uses the .csv format. Thus, the user has to first convert his/her .txt sent by Whats' App to .csv. This code aims to implement that. However, Whats' App uses various format depending on the language of your app and its version, the code will ask you to inform your format. The current code is only adapted for two formats, but feel free to adapt it for your format, it should not be that complex.

The .csv has to be separated by ";" and have the following columns :
- UserName : The name of the sender of the message (str)
- MessageBody : The content of the message (str)
- Date2 : The date at which the message has been sent, datetime with the format YYYY-MM-DD
- Time : The hour at which the message has been sent, timedelta
- Date1 : Each first message of the day has the value of the date, other NaN, str, all can be NaN
- UserPhone : The phone number of the user, str, can be NaN
- MediaType : The type of media that has been sent with the message, str, can be NaN
- MediaLink : The link to this media, str, can be NaN
- QuotedMessage : The content of the quoted message (if any), str, can be NaN
- QuotedMessageDate : The date at which the quoted message has been sent, str, can be NaN
- QuotedMessageTime : The time at which the quoted message has been sent, str, can be NaN

## 0. General settings

In [1]:
# Parameters
FOLDER_NAME = "Example" # name of the folder containing the conversation
FILE_NAME = "chat_en.txt" # name of the file contaning the discussion

In [2]:
# Select message pattern
pattern = int(input("Open your .txt file and look at the structure of the messages.\n\n\
Does your message looks like: '29/05/2017, 12:00 - Ivan-Daniel Sievering: Oki' (1)\n\
or like '29.05.17 à 12:00 - Ivan-Daniel Sievering: Oki' (2)\n\
or other (3).\n\
\nPlease type the number corresponding to your format: "))

if pattern == 1:
    # 29/05/2017, 12:00 - Ivan-Daniel Sievering: Oki (Most likely if your WhatsApp is in english)
    normal_msg_pattern = "[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9], [0-9][0-9]:[0-9][0-9] - +"
    date_hour_sep = ", "
    date_format = '%d/%m/%Y'
elif pattern == 2:
    # 29.05.17 à 12:00 - Ivan-Daniel Sievering: Oki (Most likely if your WhatsApp is in French)
    normal_msg_pattern = "[0-9][0-9].[0-9][0-9].[0-9][0-9] à [0-9][0-9]:[0-9][0-9] - +"
    date_hour_sep = " à "
    date_format = '%d.%m.%y'
else:
    print("\nFor now, the code only supports these two formats, feel free to adapt the code for your usage!")
    assert False

Open your .txt file and look at the structure of the messages.

Does your message looks like: '29/05/2017, 12:00 - Ivan-Daniel Sievering: Oki' (1)
or like '29.05.17 à 12:00 - Ivan-Daniel Sievering: Oki' (2)
or other (3).

Please type the number corresponding to your format: 1


In [3]:
# Import libraries
import os
import pandas as pd
import re

# 1. Adapt the .txt file

Message with various lines will make crash the df creation because they appear on various row. Hence the content of a message is fully retranscripted in one row only.

In [4]:
# Load the file
my_file = open(os.path.join(FOLDER_NAME,FILE_NAME), "r", encoding="utf8")
content_list = my_file.readlines()

# Prepare a comparator to find the message that have various lines
regex = re.compile(normal_msg_pattern)

# Add the lines that are the continuation of the others to the original message 
# !! not optimal, may be optimised !!
last_match_idx = 0
for i, line in enumerate(content_list):
    if not regex.match(line):
        # if not a normal message, add the content to the last normal message
        previous_line = content_list[last_match_idx]
        if previous_line[-1:] == "\n": # remove the "\n" of the lines to avoid issues when creating the df
            previous_line = previous_line[:-1]
        content_list[last_match_idx] = previous_line +". "+ line # add a "." to keep the notion of phrase
    else:
        last_match_idx = i

# Remove the lines that are not the begining or a whole message 
# !! not optimal, may be optimised !!
content_list_cleaned = [s for s in content_list if regex.match(s)]

# Save the new file (in order to only do one time the operation)
with open(os.path.join(FOLDER_NAME,'chat_cleaned.txt'), 'w', encoding="utf8") as f:
    for line in content_list_cleaned:
        f.write("%s" % line)

## 2. Load the .txt as a df

In [5]:
df_txt = pd.read_csv(os.path.join(FOLDER_NAME,'chat_cleaned.txt'), sep=":", header=None, 
                     names=["Date_part1","Date_part2","MessageBody"],  
                   dtype = {'Date_part1': str, 'Date_part2': str, 'MessageBody': str}) # enforce str for flexible interpretation

In [6]:
# Look for lines with issue and show them
df_txt_not_standard =  df_txt[df_txt["MessageBody"].isnull()]
print("The following lines will be removed:")
print("If the lines refer to some securtiy, they are just information from Whats' App and can be removed")
df_txt_not_standard

The following lines will be removed:
If the lines refer to some securtiy, they are just information from Whats' App and can be removed


Unnamed: 0,Date_part1,Date_part2,MessageBody


In [7]:
# Suppress them
df_txt = df_txt.dropna()
print("Number of row removed: "+str(len(df_txt_not_standard)))

Number of row removed: 0


In [8]:
df_txt.sample(5)

Unnamed: 0,Date_part1,Date_part2,MessageBody
10,"30/05/2017, 11",52 - User2,I am sad now
2,"29/05/2017, 12",52 - User2,How do you do ?
12,"02/06/2017, 11",53 - User2,I don't need to go there... I already found m...
13,"02/06/2017, 11",55 - User1,"Nah, I am too ugly for that. ."
8,"30/05/2017, 10",52 - User1,But you are cute


## 3. Adapt the df to our format

In [9]:
# Extract date and hour from the different columns
separate_date_hour_df = df_txt["Date_part1"].str.split(date_hour_sep, n = 1, expand = True)
df_txt["Date2"] = separate_date_hour_df[0]
df_txt["Hour"] = separate_date_hour_df[1]
separate_user_minute_df = df_txt["Date_part2"].str.split(" - ", n = 1, expand = True)
df_txt["UserName"] = separate_user_minute_df[1]
df_txt["Minute"] = separate_user_minute_df[0]
df_txt = df_txt.drop(columns=["Date_part1","Date_part2"])

In [10]:
# Convert the date to datetime and with the correct separatrs
df_txt["Date2"] = pd.to_datetime(df_txt["Date2"], format=date_format).apply(lambda x: x.strftime('%Y-%m-%d'))
# Merge the time component togethen
df_txt["Time"] = df_txt["Hour"] + ":" + df_txt["Minute"] + ":00"
# Drop columns that are now useless
df_txt = df_txt.drop(columns=["Hour", "Minute"])

In [11]:
# Add the missing columns to the txt (with NaNs, these columns these columns are not used later on)
df_txt["Date1"] = "NaN"
df_txt["UserPhone"] = "NaN"
df_txt["MediaType"] = "NaN"
df_txt["MediaLink"] = "NaN"
df_txt["QuotedMessage"] = "NaN"
df_txt["QuotedMessageDate"] = "NaN"
df_txt["QuotedMessageTime"] = "NaN"

# Convert the Time column of both to timedelta for further operation
df_txt['Time'] = pd.to_timedelta(df_txt['Time'])

# Remove spaces in username
df_txt["UserName"] = df_txt["UserName"].str.strip()

In [12]:
df_txt.sample(5)

Unnamed: 0,MessageBody,Date2,UserName,Time,Date1,UserPhone,MediaType,MediaLink,QuotedMessage,QuotedMessageDate,QuotedMessageTime
13,"Nah, I am too ugly for that. .",2017-06-02,User1,0 days 11:55:00,,,,,,,
10,I am sad now,2017-05-30,User2,0 days 11:52:00,,,,,,,
4,Hmmm... What?,2017-05-29,User1,0 days 13:22:00,,,,,,,
9,"Ow, I thought that you were single",2017-05-30,User2,0 days 11:52:00,,,,,,,
3,I want to say something important about us...,2017-05-29,User2,0 days 12:54:00,,,,,,,


## 4. Save the df to .csv

In [13]:
df_txt.to_csv(os.path.join(FOLDER_NAME,"data.csv"), sep=";")

---------