# **Demo: Data preparation**

# **Description**
In this tutorial, you will walk through the process of preparing data for fine-tuning a LLM.

# **Steps to perform:**

1. Import necessary libraries
2. Load and prepare the dataset
3. Tokenize a single example
4. Handle long sequences
5. Tokenize the instruction dataset
6. Tokenize the entire dataset
7. Add labels
8. Prepare test/train splits



# **Step 1: Import necessary libraries**


In [4]:
import pandas as pd
import datasets
from pprint import pprint # Pretty Print
from transformers import AutoTokenizer

In [2]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")



tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

# **Step 2: Load and prepare the dataset**



In [6]:
# Load the data first

import pandas as pd

data = "lamini_docs.jsonl"

instruction_dataset = pd.read_json(data, lines=True)
instruction_dataset

# Converthing the data in to dictonary format

examples = instruction_dataset.to_dict()

examples['question'][0]

# Extracting the text data from the Dataset

if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]

# Formatting the data for fine tuning

prompt_template = """### Question:
{question}

### Answer:"""

# Preparing the data for fine tuning dataset

num_examples = len(examples["question"])

finetuning_data = [] # This will save the formatted training data.

# Iterating over the entire dataset for format each example

for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question)
  finetuning_data.append({"question": text_with_prompt_template, "answer": answer})

from pprint import pprint
print("One datapoint in the finetuning dataset:")
print(finetuning_data[5])
print(finetuning_data[6])
print(finetuning_data[8])
print(finetuning_data[66])

One datapoint in the finetuning dataset:
{'question': '### Question:\nHow frequently is the documentation updated to reflect changes in the code?\n\n### Answer:', 'answer': 'Documentation on such a fast moving project is difficult to update regularly - that’s why we’ve built this model to continually update users on the status of our product.'}
{'question': '### Question:\nIs there a community or support channel mentioned in the documentation where I can ask questions or seek help?\n\n### Answer:', 'answer': 'You can always reach out to us at support@lamini.ai.'}
{'question': '### Question:\nIs there a troubleshooting guide or a list of common issues and their solutions?\n\n### Answer:', 'answer': 'All our public documentation is available here https://lamini-ai.github.io/'}
{'question': '### Question:\nHow does Lamini decide what answers or information to give when we use its functions?\n\n### Answer:', 'answer': 'Lamini uses a language model to analyze the input question and generate

In [7]:
finetuning_data[5]["question"]

'### Question:\nHow frequently is the documentation updated to reflect changes in the code?\n\n### Answer:'

In [8]:
finetuning_data[5]["answer"]

'Documentation on such a fast moving project is difficult to update regularly - that’s why we’ve built this model to continually update users on the status of our product.'

In [9]:
instruction_dataset

Unnamed: 0,question,answer
0,What are the different types of documents avai...,"Lamini has documentation on Getting Started, A..."
1,What is the recommended way to set up and conf...,Lamini can be downloaded as a python package a...
2,How can I find the specific documentation I ne...,"You can ask this model about documentation, wh..."
3,Does the documentation include explanations of...,Our documentation provides both real-world and...
4,Does the documentation provide information abo...,External dependencies and libraries are all av...
...,...,...
1395,What is Lamini and what is its collaboration w...,Lamini is a library that simplifies the proces...
1396,How does Lamini simplify the process of access...,Lamini simplifies data access in Databricks by...
1397,What are some of the key features provided by ...,Lamini automatically manages the infrastructur...
1398,How does Lamini ensure data privacy during the...,"During the training process, Lamini ensures da..."


# **Step 3: Tokenize a single example**


*   Before tokenizing the entire dataset, first tokenize a single example to understand the process. Use the Pythia-70m tokenizer for this.


In [10]:
tokenizer.pad_token = tokenizer.eos_token # This is to ensure that padding of short sentences during tokenization do not create any unnecesary effect on training.

text = finetuning_data[0]["question"] + finetuning_data[0]["answer"]

tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    padding=True
)
print(tokenized_inputs["input_ids"])

[[ 4118 19782    27   187  1276   403   253  1027  3510   273  7177  2130
    275   253 18491   313    70    15    72   904 12692  7102    13  8990
  10097    13 13722   434  7102  6177   187   187  4118 37741    27    45
   4988    74   556 10097   327 27669 11075   264    13  5271 23058    13
  19782 37741 10031    13 13814 11397    13   378 16464    13 11759 10535
   1981    13 21798 12989    13   285   966 10097   327 21708    46 10797
   2130   387  5987  1358    77  4988    74    14  2284    15  7280    15
    900 14206]]


# **Step 4: Handle long sequences**


*   If the tokenized input is longer than the model’s maximum sequence length, you need to truncate it.



In [11]:
max_length = 2048
max_length = min(
    tokenized_inputs["input_ids"].shape[1],
    max_length,
)


In [12]:
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    truncation=True,
    max_length=max_length
)

In [13]:
tokenized_inputs["input_ids"]

array([[ 4118, 19782,    27,   187,  1276,   403,   253,  1027,  3510,
          273,  7177,  2130,   275,   253, 18491,   313,    70,    15,
           72,   904, 12692,  7102,    13,  8990, 10097,    13, 13722,
          434,  7102,  6177,   187,   187,  4118, 37741,    27,    45,
         4988,    74,   556, 10097,   327, 27669, 11075,   264,    13,
         5271, 23058,    13, 19782, 37741, 10031,    13, 13814, 11397,
           13,   378, 16464,    13, 11759, 10535,  1981,    13, 21798,
        12989,    13,   285,   966, 10097,   327, 21708,    46, 10797,
         2130,   387,  5987,  1358,    77,  4988,    74,    14,  2284,
           15,  7280,    15,   900, 14206]])

# **Step 5: Tokenize the instruction dataset**





In [14]:
def tokenize_function(examples):

    if "question" in examples and "answer" in examples:
      text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["text"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

# **Step 6: Tokenize the entire dataset**



In [15]:
from datasets import load_dataset

finetuning_data = load_dataset("json", data_files=data, split="train")

tokenized_dataset = finetuning_data.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 1400
})


In [16]:
tokenized_dataset

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 1400
})

# **Step 8: Add labels**



In [17]:
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

In [18]:
tokenized_dataset

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1400
})

# **Step 9: Prepare test/train splits**



In [19]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [20]:
print(split_dataset["train"][0:5])

{'question': ['How can I evaluate the performance and quality of the generated text from Lamini models?', "Can I find information about the code's approach to handling long-running tasks and background jobs?", 'How does Lamini AI handle requests for generating text that requires reasoning or decision-making based on given information?', 'Does the `submit_job()` function expose any advanced training options such as learning rate schedules or early stopping?', 'Does the `add_data()` function support different data augmentation techniques or preprocessing options for training data?'], 'answer': ["There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the 

In [21]:
train_df = pd.DataFrame(split_dataset["train"])
test_df = pd.DataFrame(split_dataset["test"])

In [22]:
train_df

Unnamed: 0,question,answer,input_ids,attention_mask,labels
0,How can I evaluate the performance and quality...,There are several metrics that can be used to ...,"[2347, 476, 309, 7472, 253, 3045, 285, 3290, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2347, 476, 309, 7472, 253, 3045, 285, 3290, 2..."
1,Can I find information about the code's approa...,"Yes, the code includes methods for submitting ...","[5804, 309, 1089, 1491, 670, 253, 2127, 434, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[5804, 309, 1089, 1491, 670, 253, 2127, 434, 2..."
2,How does Lamini AI handle requests for generat...,Lamini AI offers features for generating text ...,"[2347, 1057, 418, 4988, 74, 14980, 6016, 9762,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2347, 1057, 418, 4988, 74, 14980, 6016, 9762,..."
3,Does the `submit_job()` function expose any ad...,It is unclear which `submit_job()` function is...,"[10795, 253, 2634, 21399, 64, 17455, 42702, 11...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[10795, 253, 2634, 21399, 64, 17455, 42702, 11..."
4,Does the `add_data()` function support differe...,"No, the `add_data()` function does not support...","[10795, 253, 2634, 1911, 64, 2203, 42702, 1159...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[10795, 253, 2634, 1911, 64, 2203, 42702, 1159..."
...,...,...,...,...,...
1255,Does the documentation provide guidelines for ...,There is no mention of memory caching or evict...,"[10795, 253, 10097, 2085, 9600, 323, 39793, 25...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[10795, 253, 10097, 2085, 9600, 323, 39793, 25..."
1256,Does Lamini provide any mechanisms for model e...,"Yes, Lamini provides mechanisms for model ense...","[10795, 418, 4988, 74, 2085, 667, 6297, 323, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[10795, 418, 4988, 74, 2085, 667, 6297, 323, 1..."
1257,Is Lamini owned by Tesla?,"No, Lamini AI is an independent company workin...","[2513, 418, 4988, 74, 9633, 407, 27876, 32, 23...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2513, 418, 4988, 74, 9633, 407, 27876, 32, 23..."
1258,What is the process for suggesting edits or im...,You can suggest edits or improvements to the L...,"[1276, 310, 253, 1232, 323, 7738, 1407, 953, 3...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1276, 310, 253, 1232, 323, 7738, 1407, 953, 3..."




# **Conclusion:**
This concludes the data preparation process for fine-tuning a Language Learning Model. The next steps would involve setting up the model, fine-tuning it on the training data, and evaluating its performance on the test data.
