In [2]:
import configparser, os
import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken

# Set the API key and load the configuration file
config = configparser.ConfigParser()
config.read('../config.ini')
openai.api_key = config['openai']['api_key']
os.environ['OPENAI_API_KEY'] = config['openai']['api_key']

# Model Fine-Tuning
In this notebook we will retrain the model by fine-tuning GPT-3 with an example dataset.
We can then ask questions to the model related to the dataset we just fine-tuned it with.

![Model Selection](../images/fine_tuning_model_selection.png)


*Illustrative examples of text classification performance on the Stanford Natural Language Inference (SNLI) Corpus, in which ordered pairs of sentences are classified by their logical relationship: either contradicted, entailed (implied), or neutral. Default fine-tuning parameters were used when not otherwise specified.*

For complex tasks, requiring subtle interpretation or reasoning or prior knowledge or coding ability, the performance gaps between models can be larger, and better models like curie or text-davinci-002 could be the best fit.

**A single project might end up trying all models. One illustrative development path might look like this:**
- Test code using the cheapest & fastest model (ada)
- Run a few early experiments to check whether your dataset works as expected with a middling model (curie)
- Run a few more experiments with the best model to see how far you can push performance (text-davinci-002)
- Once you have good results, do a training run with all models to map out the price-performance frontier and select the model that makes the most sense for your use case  (ada, babbage, curie, davinci, text-davinci-002)

**Another possible development path that uses multiple models could be:**
- Starting with a small dataset, train the best possible model (text-davinci-002)
- Use this fine-tuned model to generate many more labels and expand your dataset by multiples
- Use this new dataset to train a cheaper model (ada)


More info about model fine-tuning can be found [here](https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit#).

In this notebook we will fine-tune the model with FAQ data coming from https://coolblue.be. We will then ask questions to the model related to the dataset we just fine-tuned it with.

## Step 1: Preprocess the FAQ data
The preprocessing step is crucial for ensuring that the FAQ data is in a format that can be easily processed by the model. Here are some common preprocessing steps for text data:

1. **Lowercase the text**: Converting all text to lowercase can help reduce the size of the vocabulary and make the model more robust to variations in capitalization.

2. **Tokenize the text**: Tokenization involves splitting the text into individual tokens (e.g., words or subwords) that can be fed into the model. This can be done using a tokenizer, such as the one provided by the Hugging Face Transformers library.

3. **Remove punctuation and special characters**: Removing punctuation and special characters can help reduce the size of the vocabulary and make the model more robust to variations in the input.

4. **Remove stop words**: Stop words are common words that are unlikely to carry much meaning, such as "a", "an", "the", etc. Removing stop words can reduce the size of the vocabulary and improve the efficiency of the model.

5. **Convert words to IDs**: The model operates on numbers, not words, so we need to convert each word in the text to a unique integer ID. This can be done using a vocabulary, which maps each word to an ID.

In [3]:
COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

In [24]:
prompt = """Answer the question only when the answer is in the training data, and if you're unsure of the answer, say "Sorry, I don't know".

Context:
You can pay online with Bancontact (card and app), credit card (Visa or MasterCard), PayPal, Apple Pay, Coolblue gift cards, or a bank transfer.
In the store, you can easily pay with Bancontact, credit card, cash, Apple Pay, Coolblue gift cards, Consumption Passes, and EcoCheques.

Q: Where can I find my invoice ?
A:"""


openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."