**Imports and Functions**

**Deal With The Devil First**

As I was quite afraid that the name  of my friends would be displayed as I saved them on graph axes, I modified the raw  data file so that their names would show up more formally. The following is the script I prepared to obtain a list of how I originally saved them.


In [6]:
import re
import pandas as pd

# Step 1: Read the chat file
file_path = "_chat.txt" 
with open(file_path, "r", encoding="utf-8") as file:
    chat_data = file.readlines()

# Step 2: Define regex pattern for parsing
message_pattern = r"^\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\] ([^:]+): (.*)"
system_message_pattern = r"^\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\] ([^\n]+)$"

# Step 3: Create lists to hold structured data
timestamps = []
senders = []
messages = []
message_types = []  # To differentiate between text and media/system messages

# Step 4: Parse chat lines
for line in chat_data:
    message_match = re.match(message_pattern, line)
    system_message_match = re.match(system_message_pattern, line)
    
    if message_match:
        timestamps.append(message_match.group(1))
        senders.append(message_match.group(2))
        messages.append(message_match.group(3))
        if "<attached:" in message_match.group(3) or "<image omitted>" in message_match.group(3):
            message_types.append("Media")
        else:
            message_types.append("Text")
    elif system_message_match:
        timestamps.append(system_message_match.group(1))
        senders.append(None)  # System messages have no sender
        messages.append(system_message_match.group(2))
        message_types.append("System")

# Step 5: Create a DataFrame
df = pd.DataFrame({
    "Timestamp": timestamps,
    "Sender": senders,
    "Message": messages,
    "Type": message_types
})

# Convert Timestamp to datetime for further analysis
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format="%d.%m.%Y, %H:%M:%S")

# Step 6: Extract unique senders and save the list
unique_senders = df["Sender"].dropna().unique()

# Display the unique senders
print("Unique senders in the chat:")
print(unique_senders)

# Save the unique senders to a file for your reference
pd.DataFrame(unique_senders, columns=["Sender"]).to_csv("unique_senders.csv", index=False)

# Function to replace names
def replace_senders(dataframe, name_mapping):
    dataframe["Sender"] = dataframe["Sender"].replace(name_mapping)
    return dataframe

# Save the structured data for reuse
df.to_csv("structured_chat_data.csv", index=False)
print("Chat data has been structured and saved to 'structured_chat_data.csv'.")


Unique senders in the chat:
['sohbet' 'Kocamm❣️🙈🐣' 'Benim Manit🫠💜yeni' 'Ponçik poğaçam' 'Sarı Çiyan'
 'Isırmalık Turşu🥹❤️' 'Azra Gülboo' 'Duygu Akar']
Chat data has been structured and saved to 'structured_chat_data.csv'.


In [10]:
# Step 7: Define the name mapping
name_mapping = {
    "sohbet": "sohbet",
    "Kocamm❣️🙈🐣": "Gülbin",
    "Benim Manit🫠💜yeni": "Defne",
    "Ponçik poğaçam": "Eslem",
    "Sarı Çiyan": "Zeynep",
    "Isırmalık Turşu🥹❤️": "Duru",
    "Azra Gülboo": "Azra",
    "Duygu Akar": "Ben",
}

# Apply the replacements
df = replace_senders(df, name_mapping)

# Save the updated DataFrame to a new CSV file
df.to_csv("structured_chat_data_with_replacements.csv", index=False)

print("Sender names have been replaced and saved to 'structured_chat_data_with_replacements.csv'.")


Sender names have been replaced and saved to 'structured_chat_data_with_replacements.csv'.


As I was now free of embarrassment, I moved onto parsing the file so that it would step by step generate the structured text format that is free 

Regex Parsing

Regex (Regular Expression) parsing is a technique used to extract specific patterns of text from unstructured or data. In this project, I used regex parsing, regex parsing allows us to identify and extract key elements such as:
    Timestamp: The date and time when a message was sent.
    Sender: The person who sent the message.
    Message Content: The actual message text or a media placeholder

The regex patterns used in this context are as follows:


Pattern 1: "^\[(\d{1,2}\.\d{1,2}\.\d{4},\d{1,2}:\d{2}:\d{2})\] ([^:]+): (.*)"

1) \[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\]:Captures the timestamp inside square brackets.
2) ([^:]+): Captures the sender's name (text before the colon).
3) (.*): Captures the message content.



In [17]:
import re

# Read chat file
file_path = "_chat.txt"  # Replace with your file path
with open(file_path, "r", encoding="utf-8") as file:
    chat_data = file.readlines()

# Define regex patterns
message_pattern = r"^\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\] ([^:]+): (.*)"
system_message_pattern = r"^\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\] ([^\n]+)$"

# Lists to hold parsed components
timestamps = []
senders = []
messages = []
message_types = []

# Parse each line in the chat file
for line in chat_data:
    message_match = re.match(message_pattern, line)
    system_message_match = re.match(system_message_pattern, line)

    if message_match:
        timestamps.append(message_match.group(1))
        senders.append(message_match.group(2))
        messages.append(message_match.group(3))
        message_types.append("Text")
    elif system_message_match:
        timestamps.append(system_message_match.group(1))
        senders.append(None)  # System messages have no sender
        messages.append(system_message_match.group(2))
        message_types.append("System")


The above code
Matches each line against the two regex patterns:
    message_pattern: Extracts regular messages.
    system_message_pattern: Extracts system messages.
Appends the parsed components:(timestamp, sender, message, type) to separate lists for further processing.


Handling Media Messages

In the unprocessed data media files are not recognized.To differentiate between the text and media we will handle the placehodlers.

Media messages in the chat are denoted by placeholders such as:

<attached: file-name>: Indicates a media file was shared (e.g., images, videos, stickers).
<image omitted>: Represents omitted or inaccessible media.
These placeholders will be recognized using string patterns or regular expressions.

Also, earlier the messga etype was identified as "Text", we will now check for the placeholders and actually identify them as "Media".

In [None]:
# Step 4: Parse each line in the chat file
for line in chat_data:
    message_match = re.match(message_pattern, line)
    system_message_match = re.match(system_message_pattern, line)

    if message_match:
        timestamps.append(message_match.group(1))
        senders.append(message_match.group(2))
        message_content = message_match.group(3)

        # Check for media placeholders
        if "<attached:" in message_content or "<image omitted>" in message_content:
            message_types.append("Media")
        else:
            message_types.append("Text")

        messages.append(message_content)

    elif system_message_match:
        timestamps.append(system_message_match.group(1))
        senders.append(None)  # System messages have no sender
        messages.append(system_message_match.group(2))
        message_types.append("System")
"""#output disabled since it outputs each line which redundantly takes long for each output, final output will b edisplayed
# Output parsed results for verification
print("Parsed Messages with Media Handling:")
for i in range(len(timestamps)):
    print(f"Timestamp: {timestamps[i]}, Sender: {senders[i]}, Message: {messages[i]}, Type: {message_types[i]}")
"""

**System Messages**

Now, we needed a way to seperate the system messages. They were analyzed using a regex method similarly; although they can be visually analyzed by recognizing these patterns.

System messages:

Lack a sender and colon (:) separating sender and message.
Often describe actions like:
Group creation: "sohbet: Group created"
User addition/removal: "sohbet added [user]"
Name/icon change: "sohbet changed group name to [name]"

While generating the code, two regex patterns were used:
\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\]:

Captures the timestamp inside square brackets.
Matches dates in DD.MM.YYYY, HH:MM:SS format.
([^\n]+):

Captures the entire message after the timestamp (since there is no colon or sender).

In [None]:
# Parse each line in the chat file
for line in chat_data:
    message_match = re.match(message_pattern, line)
    system_message_match = re.match(system_message_pattern, line)

    if message_match:
        timestamps.append(message_match.group(1))
        senders.append(message_match.group(2))
        message_content = message_match.group(3)

        # Check for media placeholders
        if "<attached:" in message_content or "<image omitted>" in message_content:
            message_types.append("Media")
        else:
            message_types.append("Text")

        messages.append(message_content)

    elif system_message_match:
        timestamps.append(system_message_match.group(1))
        senders.append("System")  # Explicitly mark as system
        messages.append(system_message_match.group(2))
        message_types.append("System")


In [None]:
# Output parsed results for verification
print("Parsed System Messages:")
for i in range(len(timestamps)):
    if message_types[i] == "System":
        print(f"Timestamp: {timestamps[i]}, Message: {messages[i]}, Type: {message_types[i]}")


**Data Frame Creation**

Now we organize the data into frame stuructured format for further analysis. Distinguishing between the data that should be stored into columns and rows, we prepare the data for visual analysis. 

Here, the columns signify the
Timestamp: Contains the date and time for each message.
Sender: Indicates the sender (or "System" for system messages).
Message: The text content of the message or a media placeholder.
Type: Specifies if the message is "Text", "Media", or "System".

Then, two key steps take place:
Data is stored as a Python dictionary with column names as keys and lists of values as values.
This dictionary is passed to pd.DataFrame() to create the tabular structure.



In [None]:
# Step 5: Create a DataFrame
data = {
    "Timestamp": timestamps,
    "Sender": senders,
    "Message": messages,
    "Type": message_types,
}

df = pd.DataFrame(data)

# Step 6: Convert Timestamp to datetime
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format="%d.%m.%Y, %H:%M:%S")

# Save the DataFrame to a CSV file
df.to_csv("final_preprocessed_data.csv", index=False)


Finally! The Final Re-Structured Data

The full code to parse the chat, classify messages (e.g., "Text", "Media", "System"), and save the resulting DataFrame to a CSV file is as follows

In [52]:
import re
import pandas as pd

# Step 1: Read the chat file
file_path = "_chat.txt"  # Replace with your file path
with open(file_path, "r", encoding="utf-8") as file:
    chat_data = file.readlines()

# Step 2: Define regex patterns for parsing
message_pattern = r"^\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\] ([^:]+): (.*)"
system_message_pattern = r"^\[(\d{1,2}\.\d{1,2}\.\d{4}, \d{1,2}:\d{2}:\d{2})\] ([^\n]+)$"

# Step 3: Create lists to hold structured data
timestamps = []
senders = []
messages = []
message_types = []  # To differentiate between text, media, and system messages

# Step 4: Parse chat lines
for line in chat_data:
    # Match user messages
    message_match = re.match(message_pattern, line)
    # Match system messages (e.g., group creation or user addition)
    system_message_match = re.match(system_message_pattern, line)
    
    if message_match:
        # Extract timestamp, sender, and message content
        timestamps.append(message_match.group(1))
        senders.append(message_match.group(2))
        message_content = message_match.group(3)
        messages.append(message_content)
        
        # Check if the message contains media placeholders
        if "<attached:" in message_content or "<image omitted>" in message_content:
            message_types.append("Media")
        else:
            message_types.append("Text")
    
    elif system_message_match:
        # Extract timestamp and system message content
        timestamps.append(system_message_match.group(1))
        senders.append("System")  # Explicitly mark as system
        messages.append(system_message_match.group(2))
        message_types.append("System")

# Step 5: Define name mapping to replace sender names with aliases
name_mapping = {
    "sohbet": "System",
    "Kocamm❣️🙈🐣": "Gülbin",
    "Benim Manit🫠💜yeni": "Defne",
    "Ponçik poğaçam": "Eslem",
    "Sarı Çiyan": "Zeynep",
    "Isırmalık Turşu🥹❤️": "Duru",
    "Azra Gülboo": "Azra",
    "Duygu Akar": "Ben",
}

# Replace sender names with mapped names
mapped_senders = [name_mapping.get(sender, sender) for sender in senders]

# Step 6: Create a DataFrame
df = pd.DataFrame({
    "Timestamp": timestamps,
    "Sender": mapped_senders,  # Use mapped sender names
    "Message": messages,
    "Type": message_types
})

# Convert Timestamp to datetime for better analysis
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format="%d.%m.%Y, %H:%M:%S")

# Step 7: Save the structured data for reuse
df.to_csv("final_chat_data.csv", index=False)

print("Chat data has been structured and saved to 'final_chat_data.csv'.")


Chat data has been structured and saved to 'final_chat_data.csv'.
