### Data Preparation for WhatsApp Chat Finetuning

This notebook processes a WhatsApp chat export file to prepare it for finetuning a language model. The main steps include:

1. Reading the WhatsApp chat export file
2. Extracting and concatenating messages from specific users
3. Formatting the data into a structure suitable for model training
4. Saving the processed data in a format ready for finetuning

The goal is to create a dataset that represents conversations between two specific users, which will be used to train a model to mimic one of the user's communication style.

#### To use this notebook:
1. Replace 'Your Name' with your actual name as it appears in the WhatsApp chat
2. Replace 'Friend' with the name of the person you're chatting with
3. Ensure your WhatsApp chat export file is named 'messages.txt' and placed in the 'data/whatsapp/' directory
4. Run all cells in this notebook to process the data

In [None]:
# Import Necessary Libraries
import json, re

# Variables
name = "Your Name"
friend_name = "Friend"

In [None]:
# Implementing the logic to read the text file and concatenate consecutive messages, and then convert them to JSONL

with open('data/whatsapp/messages.txt', 'r') as file:
    text = file.read()

# Extract the data from the text file
lines = text.split('\n')

# Initialize variables to hold concatenated messages and the previous speaker
user_messages = []
assistant_messages = []
previous_speaker = None

# Process each line
for line in lines:
    # Extract the timestamp, speaker, and message using regex
    match = re.match(r'\[.*?\] (.*?): (.*)', line)
    if match:
        speaker = match.group(1).strip()
        message = match.group(2).strip()

        if speaker.startswith(name):
            if previous_speaker == name:
                # Continue the user's message
                user_messages[-1] += "\n" + message
            else:
                # New user's message
                user_messages.append(message)
                previous_speaker = name
        elif speaker.startswith(friend_name):
            if previous_speaker == friend_name:
                # Continue the assistant's message
                assistant_messages[-1] += "\n" + message
            else:
                # New assistant's message
                assistant_messages.append(message)
                previous_speaker = friend_name

# Combine user and assistant messages into JSONL format
jsonl_output = []
for user_msg, assistant_msg in zip(user_messages, assistant_messages):
    entry = {
        "messages": [
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": assistant_msg}
        ]
    }
    jsonl_output.append(json.dumps(entry))

# Show the final JSONL output
with open('data/whatsapp/output.jsonl', 'w') as jsonl_file:
    for entry in jsonl_output:
        jsonl_file.write(entry + '\n')

In [None]:
# Load the existing JSONL data
with open('data/whatsapp/output.jsonl', 'r') as part1_file:
    lines = part1_file.readlines()

# Calculate the split indices for train, valid, and test
train_split_index = int(len(lines) * 0.6)  # 60% for training
valid_split_index = int(len(lines) * 0.8)  # 20% for validation, 20% for testing

# Write the training data to a new file
with open('data/whatsapp/train.jsonl', 'w') as train_file:
    for line in lines[:train_split_index]:
        # Add the system message to each line
        user_assistant_message = json.loads(line)
        user_assistant_message['messages'].insert(0, {"role": "system", "content": f"You are {name}, a student at the University of Hong Kong."})
        train_file.write(json.dumps(user_assistant_message) + '\n')

# Write the validation data to a new file
with open('data/whatsapp/valid.jsonl', 'w') as valid_file:
    for line in lines[train_split_index:valid_split_index]:
        # Add the system message to each line
        user_assistant_message = json.loads(line)
        user_assistant_message['messages'].insert(0, {"role": "system", "content": f"You are {name}, a student at the University of Hong Kong."})
        valid_file.write(json.dumps(user_assistant_message) + '\n')

# Write the testing data to a new file
with open('data/whatsapp/test.jsonl', 'w') as test_file:
    for line in lines[valid_split_index:]:
        # Add the system message to each line
        user_assistant_message = json.loads(line)
        user_assistant_message['messages'].insert(0, {"role": "system", "content": f"You are {name}, a student at the University of Hong Kong."})
        test_file.write(json.dumps(user_assistant_message) + '\n')

### Validate Data Splits and Integrity

This section checks if we have successfully created train, valid, and test splits, and ensures all the data is valid.

We will:
1. Verify the existence of train, valid, and test files
2. Check if each file contains valid JSONL entries
3. Confirm that each entry has the correct structure (messages list with system, user, and assistant roles)
4. Report the number of valid entries in each file

The code below performs these checks and provides a summary of the validation results.

In [None]:
# Test data validity
import json
import os

def validate_jsonl_file(file_path):
    if not os.path.exists(file_path):
        print(f"Error: The file {file_path} does not exist.")
        return 0
    else:
        with open(file_path, 'r') as file:
            valid_entries = []
            for line_number, line in enumerate(file, start=1):
                try:
                    entry = json.loads(line)
                    # Check if the entry has the required structure
                    if 'messages' in entry and isinstance(entry['messages'], list):
                        valid_entries.append(entry)
                    else:
                        print(f"Invalid entry found at line {line_number}: {line.strip()}")
                except json.JSONDecodeError:
                    print(f"Error decoding JSON for line {line_number}: {line.strip()}")
        print(f"Validation complete. {len(valid_entries)} valid entries found in {file_path}.")
        return len(valid_entries)

# Validate the train, valid, and test files
file_paths = ['data/whatsapp/train.jsonl', 'data/whatsapp/valid.jsonl', 'data/whatsapp/test.jsonl']
for file_path in file_paths:
    validate_jsonl_file(file_path)