In [1]:
import pandas as pd
from datasets import load_dataset
import json
from tqdm import tqdm

## The Prepared data

Honeycomb data conversations

In [2]:
!head -n1 sample_data/alpaca_synth_queries.jsonl | python -m json.tool --json-lines

{
    "conversations": [
        {
            "from": "system",
            "value": "Honeycomb is an observability platform that allows you to write queries to inspect trace data. You are an assistant that takes a natural language query (NLQ) and a list of valid columns and produce a Honeycomb query."
        },
        {
            "from": "human",
            "value": "\n\nNLQ: \"group by HTTP method\"\n\nColumns: ['query_string_num_tokens', 'query_string_length', 'data_queries', 'http.target', 'task.id', 'trace_root.http.target', 'topic', 'http.host', 'total_hits', 'db.user', 'domain_types', 'db.name', 'graphql.document', 'history', 'http.scheme', 'http.method', 'frontend.version', 'disposition_for_dBVVysC8x4Ymwg9rtjMckgw9', 'db.system', 'event_name', 'organization', 'auth.logout', 'organizations', 'name', 'net.transport', 'db.operation', 'disposition_for_UvsPPBVUn9FDuzDjsjYCqopq', 'disposition_for_1RUGSd7GdnP5tuKdgqBRZUm2', 'process.pid', 'disposition_for_6uyAoBc3PuvEcTTPFgPM3Rt

Cypher data conversations

In [7]:
dataset = load_dataset("vedana17/text-to-cypher")

In [8]:
(dataset.items())

dict_items([('train', Dataset({
    features: ['query', 'schema', 'result'],
    num_rows: 235
}))])

Synthesize data

In [10]:
OUT_JSONL = "sample_data/alpaca_synth_cypher.jsonl"


with open(OUT_JSONL, 'w') as outfile:
    for split, data in dataset.items():
        for q, s, r in tqdm(zip(data['query'], data['schema'], data['result'])):
            alpaca_dict = {
                "conversations": [
                    {
                        "from": "system", 
                        "value": "You are an assistant that takes a natural language query (NLQ) and a graph database schema to produce a Neo4J Cypher query."
                    }, 
                    {
                        "from": "human", 
                        "value": f"\n\nNLQ: {q} \n\nSchema: {s}"
                    }, 
                    {
                        "from": "gpt", 
                        "value": r
                    }
                ]
            }
            json.dump(alpaca_dict, outfile)
            outfile.write('\n')

# alpaca_dict.json_dumps(f"sample_data/cypher_alp_synth_test.jsonl")

235it [00:00, 1179.58it/s]


In [7]:
!head -n2 sample_data/alpaca_synth_cypher.jsonl | python -m json.tool --json-lines

{
    "conversations": [
        {
            "from": "system",
            "value": "You are an assistant that takes a natural language query (NLQ) and a graph database schema to produce a Neo4J Cypher query."
        },
        {
            "from": "human",
            "value": "\n\nNLQ: Find all Officers whose name contains 'Dupond' and their associated entities, addresses and relationships. \n\nSchema: Node properties are the following: \":Entity {countries: STRING, lastEditTimestamp: STRING, ibcRUC: STRING, valid_until: STRING, country_codes: STRING, service_provider: STRING, address: STRING, inactivation_date: STRING, struck_off_date: STRING, status: STRING, jurisdiction_description: STRING, incorporation_date: STRING, original_name: STRING, jurisdiction: STRING, internal_id: STRING, name: STRING, node_id: INTEGER, sourceID: STRING, former_name: STRING, tax_stat_description: STRING, company_type: STRING, note: STRING, dorm_date: STRING, type: STRING, closed_date: STRING, compan

## The Config

Pay close attention to `datasets` and `train_on_inputs`

In [1]:
!cat cypher.yml

base_model: deepseek-ai/deepseek-coder-1.3b-instruct
# base_model: Qwen/CodeQwen1.5-7B-Chat
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
is_mistral_derived_model: false

load_in_8bit: false
load_in_4bit: true
strict: false

lora_fan_in_fan_out: false
data_seed: 49
seed: 49

datasets:
  - path: sample_data/alpaca_synth_cypher.jsonl
    type: sharegpt
    conversation: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./qlora-alpaca-deepseek-1.3b-inst
# hub_model_id: jermyn/CodeQwen1.5-7B-Chat-NLQ2Cypher

adapter: qlora
lora_model_dir:

sequence_len: 896
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

# If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
# For LLaMA and Mistral,

### HF & WandB

You need to change the following things in your config

```yaml
wandb_project: hc-axolotl-mistral
wandb_entity: hamelsmu
hub_model_id: hamel/hc-mistral-alpaca
```

## Do the Preprocessing

In [2]:
! python -m axolotl.cli.preprocess cypher.yml --debug

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-06-08 02:03:57,879] [INFO] [datasets.<module>:58] [PID:477] PyTorch version 2.1.2+cu118 available.
[2024-06-08 02:03:58,868] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-08 02:03:58,941] [INFO] [root.spawn:38] [PID:477] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmpehl2vuc6/test.c -o /tmp/tmpehl2vuc6/test.o
[2024-06-08 02

In [None]:
# ! python -m axolotl.cli.preprocess hc.yml

## Debug

### See All The commands

In [3]:
! python -m axolotl.cli.preprocess cypher.yml --help

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-05-28 00:44:26,483] [INFO] [datasets.<module>:58] [PID:207] PyTorch version 2.1.2+cu118 available.
[2024-05-28 00:44:27,572] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-28 00:44:27,649] [INFO] [root.spawn:38] [PID:207] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmpxrbwq46s/test.c -o /tmp/tmpxrbwq46s/test.o
[2024-05-28 00

### Text Only

I often have problems with `debug_text_only`, so I do things manually

In [3]:
!ls -lah last_run_prepared/

total 4.0K
drwxr-xr-x  4 root root   86 Jun  8 02:04 .
drwxr-xr-x 10 root root 4.0K Jun  8 02:04 ..
drwxr-xr-x  3 root root  108 May 28 00:41 26d9a15e77efc1aa96977ca0958caf0b
drwxr-xr-x  2 root root   82 Jun  8 02:04 51918e080cb6de35a8c982ddc7c95741


In [4]:
import json, yaml
from transformers import AutoTokenizer
from datasets import load_from_disk


with open('cypher.yml', 'r') as f:
    cfg = yaml.safe_load(f)

model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
ds = load_from_disk('last_run_prepared/51918e080cb6de35a8c982ddc7c95741/')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Below is the assembled text in its flattened format.  Notice the spaces that axolotl are adding. Will talk about this at the end.

This makes me paranoid because of differences between how the prompt is assembled and inference.  You just have to make sure its the same at inference!  


In [9]:
print(tok.decode(ds['input_ids'][15]))

<｜begin▁of▁sentence｜>You are an assistant that takes a natural language query (NLQ) and a graph database schema to produce a Neo4J Cypher query.

### Instruction: 

NLQ: Find the police officers who investigated a crime 

Schema: Node properties are the following: ":Person {surname: STRING, nhs_no: STRING, name: STRING, age: STRING},:Location {latitude: FLOAT, postcode: STRING, longitude: FLOAT, address: STRING},:Phone {phoneNo: STRING},:Email {email_address: STRING},:Officer {badge_no: STRING, rank: STRING, name: STRING, surname: STRING},:PostCode {code: STRING},:Area {areaCode: STRING},:PhoneCall {call_duration: STRING, call_time: STRING, call_date: STRING, call_type: STRING},:Crime {date: STRING, id: STRING, type: STRING, last_outcome: STRING, note: STRING, charge: STRING},:Object {description: STRING, id: STRING, type: STRING},:Vehicle {model: STRING, reg: STRING, make: STRING, year: STRING}" Relationship properties are the following: ":CURRENT_ADDRESS {},:HAS_PHONE {},:HAS_EMAIL {

### Other Notes

- Seeing the flattened version often helps you spot issues in your prompt.  It can be hard to notice that in jsonl format.
- Check multiple examples!

### Verbose debugging

This helps you check things like: 
1. ignoring inputs (`train_on_inputs:False`) - notice the `red` color, which indicate tokens that are ignored.
2. token ids (ex: what are those spaces right before `##`?
3. The logs tell you what the special tokens are.

In [10]:
# ! python -m axolotl.cli.preprocess hc.yml --debug

## Look at special tokens

Ex: What is `<0x0A>`?

In [15]:
tok.decode([0])

'<unk>'

**But where is the space coming from?**

In [42]:
tok.decode(774)

'###'

**It's pretty confusing!  See [this blog post](https://hamel.dev/notes/llm/finetuning/05_tokenizer_gotchas.html)**

What does Wing think?