In [2]:
import configparser, os
import numpy as np
import openai
import pandas as pd
import pickle
#import tiktoken

# Set the API key and load the configuration file
config = configparser.ConfigParser()
config.read('../config.ini')
openai.api_key = config['openai']['api_key']
os.environ['OPENAI_API_KEY'] = config['openai']['api_key']

# Model Fine-Tuning
In this notebook we will retrain the model by fine-tuning GPT-3 with an example dataset.
We can then ask questions to the model related to the dataset we just fine-tuned it with.

![Model Selection](../images/fine_tuning_model_selection.png)


*Illustrative examples of text classification performance on the Stanford Natural Language Inference (SNLI) Corpus, in which ordered pairs of sentences are classified by their logical relationship: either contradicted, entailed (implied), or neutral. Default fine-tuning parameters were used when not otherwise specified.*

For complex tasks, requiring subtle interpretation or reasoning or prior knowledge or coding ability, the performance gaps between models can be larger, and better models like curie or text-davinci-002 could be the best fit.

**A single project might end up trying all models. One illustrative development path might look like this:**
- Test code using the cheapest & fastest model (ada)
- Run a few early experiments to check whether your dataset works as expected with a middling model (curie)
- Run a few more experiments with the best model to see how far you can push performance (text-davinci-002)
- Once you have good results, do a training run with all models to map out the price-performance frontier and select the model that makes the most sense for your use case  (ada, babbage, curie, davinci, text-davinci-002)

**Another possible development path that uses multiple models could be:**
- Starting with a small dataset, train the best possible model (text-davinci-002)
- Use this fine-tuned model to generate many more labels and expand your dataset by multiples
- Use this new dataset to train a cheaper model (ada)


More info about model fine-tuning can be found [here](https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit#).

In this notebook we will fine-tune the model with FAQ data coming from https://coolblue.be. We will then ask questions to the model related to the dataset we just fine-tuned it with.

## Step 1: Preprocess the FAQ data
The preprocessing step is crucial for ensuring that the FAQ data is in a format that can be easily processed by the model. Here are some common preprocessing steps for text data:

1. **Lowercase the text**: Converting all text to lowercase can help reduce the size of the vocabulary and make the model more robust to variations in capitalization.

2. **Tokenize the text**: Tokenization involves splitting the text into individual tokens (e.g., words or subwords) that can be fed into the model. This can be done using a tokenizer, such as the one provided by the Hugging Face Transformers library.

3. **Remove punctuation and special characters**: Removing punctuation and special characters can help reduce the size of the vocabulary and make the model more robust to variations in the input.

4. **Remove stop words**: Stop words are common words that are unlikely to carry much meaning, such as "a", "an", "the", etc. Removing stop words can reduce the size of the vocabulary and improve the efficiency of the model.

5. **Convert words to IDs**: The model operates on numbers, not words, so we need to convert each word in the text to a unique integer ID. This can be done using a vocabulary, which maps each word to an ID.

In [3]:
COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

In [5]:
prompt = """Answer the question only when the answer is in the training data, and if you're unsure of the answer, say "I can't help you with this, please contact customer support at support@coolblue.be".

Context:
You can pay online with Bancontact (card and app), credit card (Visa or MasterCard), PayPal, Apple Pay, Coolblue gift cards, or a bank transfer.
In the store, you can easily pay with Bancontact, credit card, cash, Apple Pay, Coolblue gift cards, Consumption Passes, and EcoCheques.

Q: WWhat kind of payment methods are supported ?
A:"""


openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'You can pay online with Bancontact (card and app), credit card (Visa or MasterCard), PayPal, Apple Pay, Coolblue gift cards, or a bank transfer. In the store, you can easily pay with Bancontact, credit card, cash, Apple Pay, Coolblue gift cards, Consumption Passes, and EcoCheques.'

In [18]:
df = pd.read_csv("../data_files/coolblue_faq.csv")
df = df.set_index(["Title","Subject"])
print(f"{len(df)} rows in the data.")
df.sample(5)

9 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,Answer
Title,Subject,Unnamed: 2_level_1
Invoices,Where can I find my invoice?,We always include the invoice in the shipping ...
Invoices,Can I have products delivered at a VAT rate of 0%?,"No, it's not possible to order products at a V..."
Warranty and repairs\n,Can I have my product repaired in the store?,"Yes, you can drop off your product in one of o..."
Ordering,Can I pay afterwards?,"Yes, via a credit card. When you place an orde..."
Ordering,What does Second Chance mean?,A Second Chance product is a product that has ...


In [23]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.Answer) for idx, r in df.iterrows()
    }

In [24]:
def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.Title, r.Subject): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

In [25]:
document_embeddings = compute_doc_embeddings(df)

In [31]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('Warranty and repairs\n', 'What type of warranty do I have?') : [0.018135065212845802, -0.005953091196715832, -0.010298511013388634, -0.01627850905060768, -0.035086240619421005]... (1536 entries)


So we have split our document library into sections, and encoded them by creating embedding vectors that represent each chunk. Next we will use these embeddings to answer our users' questions.

# 2) Find the most similar document embeddings to the question embedding

At the time of question-answering, to answer the user's query we compute the query embedding of the question and use it to find the most similar document sections. Since this is a small example, we store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.

In [32]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [33]:
order_document_sections_by_query_similarity("What payment methods are available?", document_embeddings)[:5]

[(0.8182337267862931, ('Ordering', 'Can I pay afterwards?')),
 (0.7814901015767994, ('Invoices', 'Where can I find my invoice?')),
 (0.7755437010341466,
  ('Invoices', 'Can I have products delivered at a VAT rate of 0%?')),
 (0.7627375648790945, ('Invoices', "What if I didn't receive an invoice?")),
 (0.7482394073512559, ('Ordering', 'How do I cancel my order?'))]

In [34]:
order_document_sections_by_query_similarity("What types of warranty are there ?", document_embeddings)[:5]

[(0.8437506899031103,
  ('Warranty and repairs\n', 'What type of warranty do I have?')),
 (0.8123131144105407, ('Warranty and repairs\n', 'Is my repair free?')),
 (0.7774787621501347,
  ('Warranty and repairs\n', 'Can I have my product repaired in the store?')),
 (0.740717769143366, ('Ordering', 'What does Second Chance mean?')),
 (0.7406376137237811, ('Ordering', 'Can I pay afterwards?'))]

# 3) Add the most relevant document sections to the query prompt

Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. It is helpful to use a query separator to help the model distinguish between separate pieces of text.

In [None]:
MAX_SECTION_LEN = 500
SEPARATOR = "\n* "
ENCODING = "gpt2"  # encoding for text-davinci-003

encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

f"Context separator contains {separator_len} tokens"