# Discord export processing

One of the challenges I faced when training a model on my messages was a lack of good context. Since a Discord export only contains my own messages and not the replies of others, I had to create a new way of providing context to the model, by grouping messages in the same channel during the same time period together.

In [34]:
# SPDX-License-Identifier: MIT

import polars as pl
import numpy as np

In [35]:
np.random.seed(42)

In [36]:
# convert the raw data to jsonl format
!./convert_raw_data_to_jsonl.sh

In [37]:
df = pl.scan_ndjson('./raw_data/**/messages.jsonl', include_file_paths="channel").collect().sort('channel')

In [38]:
# categorize data by channel and convert the timestamp to datetime
df = df.with_columns([
    pl.col('channel').str.replace('raw_data', '').str.replace_all('/','').str.replace('messages.jsonl', '').cast(pl.Categorical),
    pl.col('Timestamp').str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S')
])

In [39]:
# sort by datetime
df = df.sort('Timestamp')

In [40]:
# calculate time difference from previous row in seconds
df = df.with_columns(
    pl.col('Timestamp').diff().over('channel').fill_null(pl.duration(seconds=0)).alias('time_diff')
)

# create a group identifier based on whether the time difference is greater than 15 minutes
# (you may need to adjust this threshold based on how often you want to group messages)
df = df.with_columns(
    ((pl.col('time_diff') > pl.duration(minutes=15)).cum_sum()).alias('group')
)

# group by the new group column and aggregate
grouped = (
    df.group_by('channel', 'group')
    .agg([
        pl.col('Timestamp').min().alias('start_time'),
        pl.col('Timestamp').max().alias('end_time'),
        pl.col('Contents').str.concat("\n").alias('GroupContents')
    ])
).sort('end_time')

# select and rename columns for output
outframe = grouped.select(
    pl.col('GroupContents').alias('text')
)

In [41]:
# write all data to jsonl
outframe.write_ndjson('./data/all.jsonl')

In [42]:
# shuffle for train/val split
outframe = outframe.with_columns(pl.Series(name="random", values=np.random.rand(outframe.height))).sort("random").drop("random")

# calculate 90% of the row count for training, the rest for validation
train_size = int(0.9 * outframe.height)
val_size = outframe.height - train_size
outframe.head(train_size).write_ndjson('./data/train.jsonl')
outframe.tail(val_size).write_ndjson('./data/valid.jsonl')