Doppelganger

Fine-tuning LLM on my Telegram chats. You may read full story in my blog.

Dataset Preparation

First, we have to get the data. Open Telegram, go to 'Setting' -> 'Advanced' -> 'Export Telegram Data' and unselect everything except 'Personal chats' and 'Private groups' (don't select 'Only my messages there'). As output format choose 'Machine-readable JSON'. It will result in result.json.

Use prepare_dataset.py to transform result.json to JSON with a list of sessions:

python prepare_dataset.py "./data/result.json" "./data/messages.json"

There are some flags available for this script, you can read more in --help:

python prepare_dataset.py --help

output

NAME
    prepare_dataset.py - Transforms chat histories from .json telegram export to .json with a list of sessions. Session is a list of messages, where each message is a dict with fields 'author' and 'text'.

SYNOPSIS
    prepare_dataset.py INPUT OUTPUT <flags>

DESCRIPTION
    Transforms chat histories from .json telegram export to .json with a list of sessions. Session is a list of messages, where each message is a dict with fields 'author' and 'text'.

POSITIONAL ARGUMENTS
    INPUT
        Type: str
        Path to .json telegram export, usually called result.json
    OUTPUT
        Type: str
        Path to output .json file

FLAGS
    -t, --target_name=TARGET_NAME
        Type: Optional[str | None]
        Default: None
        The name of the person to target. This person will be present in every session. If empty, will be tried to be detected from "Saved Messages"
    -l, --last_x_months=LAST_X_MONTHS
        Type: int
        Default: 24
        Number of last months to use messages from
    -s, --session_minutes_threshold=SESSION_MINUTES_THRESHOLD
        Type: int
        Default: 10
        Threshold in minutes where messages will belong to the same session
    -c, --concat_one_user_messages_delimeter=CONCAT_ONE_USER_MESSAGES_DELIMETER
        Type: str
        Default: '\n>>> '
        Users might type several messages one after each other. They are concatenated using this delimeter

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

If you are interested, Telegram have several types of messages which should be handled differently:

default text message

{
 "id": 123,
 "type": "message",
 "date": "2023-10-31T15:23:38",
 "date_unixtime": "1698746018",
 "from": "Username",
 "from_id": "user123",
 "text": "ты где?",
 "text_entities": [
  {
   "type": "plain",
   "text": "ты где?"
  }
 ]
}

multiple text entities

{
 "id": 345,
 "type": "message",
 "date": "2023-10-25T01:56:50",
 "date_unixtime": "1698179210",
 "from": "Username",
 "from_id": "user456",
 "text": [
  "California suspends GM Cruise's autonomous vehicle deployment | Hacker News\n",
  {
   "type": "link",
   "text": "https://news.ycombinator.com/item?id=38002752"
  }
 ],
 "text_entities": [
  {
   "type": "plain",
   "text": "California suspends GM Cruise's autonomous vehicle deployment | Hacker News\n"
  },
  {
   "type": "link",
   "text": "https://news.ycombinator.com/item?id=38002752"
  }
 ]
}

sticker

{
 "id": 789,
 "type": "message",
 "date": "2023-10-30T23:24:20",
 "date_unixtime": "1698688460",
 "from": "Username",
 "from_id": "user789",
 "file": "(File not included. Change data exporting settings to download.)",
 "thumbnail": "(File not included. Change data exporting settings to download.)",
 "media_type": "sticker",
 "sticker_emoji": "🤗",
 "width": 512,
 "height": 501,
 "text": "",
 "text_entities": []
}

Training

Final version of models were trained with the parameters which are default in training scripts. Training logs can be accessed on WandB.

LoRA fine-tune

To launch LoRA fine-tune with my default params, you will need GPU with 20GB VRAM. RTX 3090 is a good option for it's money. You may reduce micro_batch_size or max_seq_length if you want to lower the amount of VRAM required. To get full list of parameters, run:

python finetune_lora.py --help

To train LoRA, run:

python finetune_lora.py

Full fine-tune

To list available params with their default values, run:

python finetune_full.py --help

To train:

torchrun --nnodes=1 --nproc_per_node=NUMBER_OF_GPUS finetune_full.py

Launching

Use oobabooga/text-generation-webui. If you used LoRA, then clone ehartford/dolphin-2.2.1-mistral-7b or whatever model you are used as a base model and put trained LoRA connectors to ./loras/ folder within text-generation-webui. If you did full fine-tune, then copy training result to ./models/.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
finetune_full.py		finetune_full.py
finetune_lora.py		finetune_lora.py
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

finetune_full.py

finetune_full.py

finetune_lora.py

finetune_lora.py

prepare_dataset.py

prepare_dataset.py

requirements.txt

requirements.txt

Repository files navigation

Doppelganger

Dataset Preparation

Training

LoRA fine-tune

Full fine-tune

Launching

About

Languages

License

furiousteabag/doppelganger

Folders and files

Latest commit

History

Repository files navigation

Doppelganger

Dataset Preparation

Training

LoRA fine-tune

Full fine-tune

Launching

About

Resources

License

Stars

Watchers

Forks

Languages