# GPT3 Text Generation

This notebook contain fine tunning of GPT3 model for custom dataset. The fine tuned model is combined with the pretrained model for predictions.

## Installing required packages



In [None]:
!pip install --upgrade openai

Installing collected packages: multidict, frozenlist, charset-normalizer, async-timeout, yarl, aiosignal, aiohttp, openai
Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 charset-normalizer-3.1.0 frozenlist-1.3.3 multidict-6.0.4 openai-0.27.2 yarl-1.8.2


## importing required packages

In [None]:
import os 
import pandas as pd
import openai
import json
os.environ['OPENAI_API_KEY'] = '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'

## Collecting data

In [None]:
# Sample training data
data = {'prompt': ["The Taktika of Nikephoros Ouranos", 
                   "By Murray Dahm"],
        'completion': ["The tradition of writing didactic military handbooks stretched back to the fourth century BC; someeven considered that it began with Homer.",
                       "Nikephorosâ€™ text is 178 chapters long,compiled from classical and previous Byzantine authors"]}
# create dataframe and save the data as csv
data_df = pd.DataFrame(data)
data_df.to_csv('data.csv', index = False)

## Loading data

In [None]:
data = pd.read_csv('data.csv')

In [None]:
# Display data
data.head()

Unnamed: 0,prompt,completion
0,The Taktika of Nikephoros Ouranos,The tradition of writing didactic military han...
1,By Murray Dahm,"Nikephorosâ€™ text is 178 chapters long,compiled..."


### Data formating

In [None]:
data['prompt'] = data.prompt.apply(lambda x: x+ " \n\n###\n\n")  # add " \n\n###\n\n" at the end of the prompt
data['completion'] = data.completion.apply(lambda x: " " + x + " END")  # add space and END at the end of the completion

In [None]:
data.head()

Unnamed: 0,prompt,completion
0,The Taktika of Nikephoros Ouranos \n\n###\n\n,The tradition of writing didactic military ha...
1,By Murray Dahm \n\n###\n\n,"Nikephorosâ€™ text is 178 chapters long,compile..."


## Data formating for finetuning - preparing dataset

In [None]:
!openai tools fine_tunes.prepare_data -f /content/data.csv 

Analyzing...

- Based on your file extension, your file is formatted as a CSV file
- Your file contains 2 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset f

In [None]:
# display a sample, file_path is the prepared data path
file_path = '/content/data_prepared.jsonl'

# Use a list comprehension to read each line of the file and convert it to a Python object
data_prepared = [json.loads(line) for line in open(file_path, 'r')]

# Now you can access the data in the `data` variable
print(data_prepared[0])

{'prompt': 'The Taktika of Nikephoros Ouranos', 'completion': 'The tradition of writing didactic military handbooks stretched back to the fourth century BC; someeven considered that it began with Homer.'}


## Fine Tunning the Model - davinci

In [None]:
# fine-tune the model -> insert directory of the jonl file and put numpber of epochs
!openai api fine_tunes.create -t /content/data_prepared.jsonl -m davinci --n_epochs 1

Found potentially duplicated files with name 'data_prepared.jsonl', purpose 'fine-tune' and size 340 bytes
file-WE6BwyACagxxpSC5qaRLOPBY
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: 
Upload progress: 100% 340/340 [00:00<00:00, 407kit/s]
Uploaded file from /content/data_prepared.jsonl: file-Ed0DcFOfHZmQUiOaBPb33CnS
Created fine-tune: ft-eZacmxPBKPfMcJsrP7GlcIDF
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-11 10:51:04] Created fine-tune: ft-eZacmxPBKPfMcJsrP7GlcIDF

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-eZacmxPBKPfMcJsrP7GlcIDF



In [None]:
# follow fine tunning
!openai api fine_tunes.follow -i ft-eZacmxPBKPfMcJsrP7GlcIDF

[2023-03-11 10:51:04] Created fine-tune: ft-eZacmxPBKPfMcJsrP7GlcIDF
[2023-03-11 10:56:06] Fine-tune costs $0.00
[2023-03-11 10:56:07] Fine-tune enqueued. Queue number: 0
[2023-03-11 10:56:07] Fine-tune is in the queue. Queue number: 0
[2023-03-11 10:56:09] Fine-tune started
[2023-03-11 10:58:19] Completed epoch 1/1
[2023-03-11 10:58:52] Uploaded model: davinci:ft-personal-2023-03-11-10-58-52
[2023-03-11 10:58:53] Uploaded result file: file-PSi01EojhRiCmE6M2eDSGdkh
[2023-03-11 10:58:54] Fine-tune succeeded

Job complete! Status: succeeded ðŸŽ‰
Try out your fine-tuned model:

openai api completions.create -m davinci:ft-personal-2023-03-11-10-58-52 -p <YOUR_PROMPT>


## Testing the fine-tuned model

In [None]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
def generate_text(FINE_TUNED_MODEL, MAX_TOKENS, TEMPERATURE, PROMPT):
  
    response = openai.Completion.create(
    model=FINE_TUNED_MODEL,
    prompt= PROMPT,
    temperature=TEMPERATURE,
    stop=[" END"], 
    )
    return response.choices[0].text

In [None]:
FINE_TUNED_MODEL = "davinci:ft-personal-2023-03-11-10-58-52"
MAX_TOKENS = 1024
TEMPERATURE = 0
PROMPT = "Who invented www?  \n\n###\n\n"
generate_text(FINE_TUNED_MODEL, MAX_TOKENS, TEMPERATURE, PROMPT)