# Prepare Dataset of Examples for Fine-tune LLM

A dataset of `[Question, Abstract, Answer]` triples

At inference time, our model will take a users query, draw from the available passages, and pass relevant context to the generator to create an answer.

We'll use synthetic dataset generation powered by OpenAI's `gpt-3.5-turbo` to generate our questions and answers for each piece of context through `llama-index`.


## Generating Synthetic Training Data with Llama Index

Let's generate some synthetic data using Llama Index - we'll do this with `gpt-3.5-turbo` and then use the resultant data to fine-tune. Install our dependencies for this process.

In [4]:
!pip install -U -q llama-index pypdf

### Loading Data

In [30]:
TRAINING_FILES = ["./book/the_culture_difference.pdf"]

Now that we have our data, let's organize into our desired format for generating synthetic questions/responses.

In [31]:
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode
import re
import pandas as pd

def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f'Loaded {len(docs)} docs')

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f'Parsed {len(nodes)} nodes')

    # Обработка текста в каждом документе
    processed_corpus = {}
    for node in nodes:
        text = node.get_content(metadata_mode=MetadataMode.NONE)
        # Удаление заголовков и переносов строк
        text = text.lower()
        text = re.sub(r'\n', ' ', text)
        text = re.sub(r'\s+', ' ', text).strip()
        processed_corpus[node.node_id] = text

    return processed_corpus


In [32]:
train_corpus = load_corpus(TRAINING_FILES, verbose=True)

Loading files ['/Users/ulianaliakh/Documents/DS/Final Project/book/meyer_erin_the_culture_map_breaking_through_the_invisible_bo_1-18/meyer_erin_the_culture_map_breaking_through_the_invisible_bo-2.pdf']
Loaded 148 docs


Parsing documents into nodes:   0%|          | 0/148 [00:00<?, ?it/s]

Parsed 148 nodes


### Creating Synthetic QA Pairs

We leverage OpenAI model `gpt-3.5-turbo` to help us generate some QA pairs.

In [26]:
import re
import uuid

from llama_index.llms import OpenAI
from tqdm.notebook import tqdm

In [31]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

### Generating Queries

Let's use a helper function to create our question answer pairs.

We're going to use this prompt:

```
Context information is below.
    
---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided.
```

We have the ability to control how many questions we generate, as well as the persona used to create the questions.

The rest of the helper function is simply parsing the questions!

In [41]:
def generate_queries(
    corpus,
    num_questions_per_chunk=2,
    prompt_template=None,
    verbose=False,
):
    """
    Automatically generate hypothetical questions that could be answered with
    doc in the corpus.
    """
    llm = OpenAI(model='gpt-3.5-turbo')

    prompt_template = prompt_template or """\
    Context information is below.

    ---------------------
    {context_str}
    ---------------------

    Given the context information and not prior knowledge.
    generate only questions based on the below query.

    You are a Teacher/ Professor. Your task is to setup \
    {num_questions_per_chunk} questions for an upcoming \
    quiz/examination. The questions should be diverse in nature \
    across the document. Restrict the questions to the \
    context information provided."
    """

    queries = {}
    relevant_docs = {}
    for node_id, text in tqdm(corpus.items()):
        query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
        response = llm.complete(query)

        result = str(response).strip().split("\n")
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        questions = [question for question in questions if len(question) > 0]

        for question in questions:
            question_id = str(uuid.uuid4())
            queries[question_id] = question
            relevant_docs[question_id] = [node_id]
    return queries, relevant_docs

Generate some QA pairs

In [52]:
train_queries, train_relevant_docs = generate_queries(train_corpus, 1)

  0%|          | 0/148 [00:00<?, ?it/s]

In [53]:
train_dataset = {
    'Question': train_queries,
    'Corpus': train_corpus,
    'Abstract': train_relevant_docs,
}

In [54]:
dataset = train_dataset

corpus = dataset['Corpus']
queries = dataset['Question']
relevant_docs = dataset['Abstract']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = {"Question" : query, "Abstract" : text}
    examples.append(example)

In [55]:
question_abstract_pair_df = pd.DataFrame(examples)

In [56]:
question_abstract_pair_df.to_csv("./question_abstract_pair_2.csv")

In [57]:
question_abstract_pair_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  1458 non-null   object
 1   Abstract  1458 non-null   object
dtypes: object(2)
memory usage: 22.9+ KB


### Generating Answers

We'll repeat the process and create an answer for each question as well.

In [10]:
import pandas as pd
df = pd.read_csv('question_abstract_pair_2.csv')
df = df.drop(['Unnamed: 0'], axis=1)
df = df.sample(frac=1).reset_index(drop=True) #Shuffle the rows # Перемешайте строки в DataFrame

In [11]:
list_of_dicts = df.to_dict(orient='records') # Convert the DataFrame into a list of dictionaries # Преобразовываем DataFrame в список словарей
#examples = []
#examples = list_of_dicts[:300]

In [5]:
def generate_answer(
    query,
    context,
    prompt_template=None,
    verbose=False,
):
    """
    Automatically generate hypothetical questions that could be answered with
    doc in the corpus.
    """
    llm = OpenAI(model='gpt-3.5-turbo')

    prompt_template = prompt_template or """\
    Context information is below.

    ---------------------
    {context_str}
    ---------------------

    Given the context information and not prior knowledge.
    generate only answers based on the below query.

    You are a Teacher/ Professor. Your task is to answer \
    questions for an upcoming quiz/examination. Restrict\
    your answers based on the context information provided. \
    If you do not know the answer, simply answer: "I don't know" \
    """
    full_query = prompt_template.format(context_str=query)
    response = llm.complete(full_query)

    result = str(response).strip().split("\n")
    answers = [
            re.sub(r"^\d+[\).\s]", "", answer).strip() for answer in result
        ]
    answers = [answer for answer in answers if len(answer) > 0]
    return answers[0]

In [32]:
for example in tqdm(list_of_dicts[1378:]):
  example["Answer"] = generate_answer(example["Question"], example["Abstract"])

  0%|          | 0/80 [00:00<?, ?it/s]

### Convert to DataFrame Format

In [35]:
import pandas as pd

train_df = pd.DataFrame(list_of_dicts) #pd.DataFrame(examples)

In [36]:
train_df.to_csv("./culture_train_full.csv")

In [37]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  1458 non-null   object
 1   Abstract  1458 non-null   object
 2   Answer    1458 non-null   object
dtypes: object(3)
memory usage: 34.3+ KB


In [50]:
count_with_answer_2 = 0

for answer in train_df['Answer'].value_counts():
    if answer > 1:
        count_with_answer_2 += 1

print(f'{count_with_answer_2} вариантов повторяющихся ответов в столбце Answer')

9 вариантов повторяющихся ответов в столбце Answer


In [55]:
value_counts = train_df['Answer'].value_counts()
values_to_remove = value_counts[value_counts > 1].index
values_to_remove

Index(['I don't know.',
       'As a Teacher/Professor, I can provide the following answer based on the context information provided:',
       'As a Teacher/Professor, I can provide the following answer based on the given context information:',
       'As a Teacher/Professor, I would answer the question as follows:',
       'I don't know',
       'As a teacher/professor, I can provide the following answer based on the context information provided:',
       'As a Teacher/Professor, based on the context information provided, I can provide the following answer:',
       'As a Teacher/Professor, I can provide the following answer based on the given context:',
       'As a teacher/professor, I would answer the question as follows:'],
      dtype='object', name='Answer')

Remove examples where the answer appears more than once

In [56]:
train_df = train_df[~train_df['Answer'].isin(values_to_remove)]

In [60]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 982 entries, 1 to 1457
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  982 non-null    object
 1   Abstract  982 non-null    object
 2   Answer    982 non-null    object
dtypes: object(3)
memory usage: 30.7+ KB


In [61]:
train_df.to_csv("./culture_train_982_clean.csv")