<a href="https://colab.research.google.com/github/alinealinealine/GPT-Pilot/blob/main/src/Finetuning_with_GPT3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune GPT-3 for AIMM narrative ex-ante

OpenAI's GPT-3 is a natural language model trained on large set of training data. It can be used for various tasks, including to generate data. 

However, the model is generalist in nature and thus, not fit for specialised tasks in its original or vaniall version. However, will a bit of finetuning it can be used for more specialised tasks such as generating AIMM text. 

The fine-tuning happens via OpenAI's API to fine tune GPT-3. 

## Installing dependencies and libraries

In [None]:
!pip install -Uq openai wandb

[?25l[K     |███████▎                        | 10 kB 19.5 MB/s eta 0:00:01[K     |██████████████▋                 | 20 kB 14.4 MB/s eta 0:00:01[K     |██████████████████████          | 30 kB 10.5 MB/s eta 0:00:01[K     |█████████████████████████████▏  | 40 kB 4.8 MB/s eta 0:00:01[K     |████████████████████████████████| 44 kB 1.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 1.9 MB 9.5 MB/s 
[K     |████████████████████████████████| 146 kB 68.8 MB/s 
[K     |████████████████████████████████| 182 kB 59.7 MB/s 
[K     |████████████████████████████████| 168 kB 56.7 MB/s 
[K     |████████████████████████████████| 62 kB 1.4 MB/s 
[K     |████████████████████████████████| 168 kB 39.0 MB/s 
[K     |████████████████████████████████| 166 kB 63.1 MB/s 
[K     |████████████████████████████████| 166 kB 65.4 MB/s

In [None]:
import openai
import wandb
from pathlib import Path
import pandas as pd
import numpy as np
import json
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [None]:
#Entering API Credentials
openai.api_key_path = "./api.txt"

## Dataset Prepration

The dataset was processed in R by scrapping relevant documents and cleaning it into the JSON format required to finetune the mdoel. The dataset are split based on sector and portion of AIMM narrative it is expected to generate. 

1. Sector:
  1. FIG
  2. MAS
  3. CDF
  4. INR
2. Section of AIMM narrative
  1. Project narrative
  2. Market narrative 
  3. Indicators

In addition different variation of prompts are also explored - creating different models. 

## Model naming convention

In order to keep track of the models they are to be named using the following convention: "SSS-IN-GEN-XXXX"
* SSS: Refers to Sector of the model's focus
* IN: Refers to model input, can be BP for Board Papers and GE for Generic documents
* GEN: Refers to which section the model is trying to generate. Can be one of the following:
  * PRO: Project narrative
  * MAR: Market narrative
  * IND: Indicators
* XXXX: Refers to the number of the model - as various models might be created to accomodate various prompts. This can also be alpha numeric.




In [None]:
project_name = 'FIG-BP-PRO'
model_name = 'FIG-BP-PRO-0001'

# Fine Tuning 

Fine tuning involves the following steps:
1. Preparing the dataset: Datset is here split into training and validation sets. Before the split - the prompts are also shared with OpenAI to see if they are aligned with the requirements for finetuning.
2. FineTuning: This is where the split datasets are shared with OpenAI for finetuning of the GPT model. The final model is saved and can be accessed both here and on OpenAI playground.  

## Preping the dataset

In [None]:
# create a job for splitting dataset
run = wandb.init(project=project_name, job_type='split dataset')

[34m[1mwandb[0m: Currently logged in as: [33mgauravrpjain[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# download full dataset
dataset_path = "./dataset_1.json"

In [None]:
!head $dataset_path

{"prompt":"IFC has entered into Memorandum of Understanding (MoUs) with i) Kerala Infrastructure Investment Fund Board (KIFB); ii) PPP Department, Government of Goa; and iii) Gujarat Power Corporation Limited (GPCL). \n\nIFC will support KIFB and the Government of Goa in identification and screening of Public-Private Partnership (PPP) projects across infrastructure sectors and undertake pre-feasibility assessments of select projects. \n\nIFC will also support GPCL to conduct a pre-feasibility assessment for a potential pilot project to produce clean hydrogen-based renewable energy at one of GPCL's sites in Gujarat. \nDevelopment Impact: \n\n###\n\n","completion":" \tIdentification of at least one climate friendly PPP transaction based on the screening and pre-feasibility assessments being undertaken with multiple entities\n\tMobilization of private sector investment\n\tCreation of jobs\n##\n"}
{"prompt":" The proposed investment comprises of (i) a 3-year senior unsecured loan of up to 

In [None]:
!openai tools fine_tunes.prepare_data -f $dataset_path

Analyzing...

- Your file contains 5189 prompt-completion pairs
- There are 2 examples that are very long. These are rows: [4562, 4803]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- All prompts end with suffix ` \nDevelopment Impact: \n\n###\n\n`. This suffix seems very long. Consider replacing with a shorter suffix, such as `\n\n===\n\n`
- All completions end with suffix `\n##\n`

Based on the analysis we will perform the following actions:
- [Recommended] Remove 2 long examples [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `./dataset_1_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "./dataset_1_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` \nDevelopment Impact: \n\n###\n\n` for the model to start generating completions, rather than continuing

In [None]:
dataset_path = "./dataset_1_prepared.jsonl"
# check number of samples
!wc -l $dataset_path

5187 ./dataset_1_prepared.jsonl


Splitting it into training and testing set randomly with 25% going to testing set. 
* Training Set = 75% 
* Validation Set = 25% 

Also, logging the files into W&B for recordkeeping. 

In [None]:
df = pd.read_json(dataset_path, orient='records', lines=True)
df_train, df_test = train_test_split(df,test_size = 0.25,random_state = 42, shuffle = False)
df_train.to_json("./dataset_1_train.jsonl", orient='records', lines=True)
df_test.to_json("./dataset_1_test.jsonl", orient='records', lines=True)

#Logging the files and tables into W&B 
table_train = wandb.Table(dataframe=df_train)
table_valid = wandb.Table(dataframe=df_test)

# Create artifacts
artifact_train = wandb.Artifact('dataset_1_train.jsonl', type='training_files', metadata={'samples': df_train.shape[0]})
artifact_train.add_file('dataset_1_train.jsonl')
artifact_train.add(table_train, 'df_1_train')

artifact_valid = wandb.Artifact('dataset_1_test.jsonl', type='validation_files', metadata={'samples': df_test.shape[0]})
artifact_valid.add_file('dataset_1_test.jsonl')
artifact_valid.add(table_valid, 'df_1_test')

# Log files
run.log_artifact(artifact_train)
run.log_artifact(artifact_valid)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f8df387ad10>

Closing our dataprep run

In [None]:
# keep entity for reference of artifact later 
entity = wandb.run.entity
wandb.finish()

## Fine Tuning the model


In [None]:
train_file = "./dataset_1_train.jsonl"
valid_file = "./dataset_1_test.jsonl"

Defining Hyper parameters: 

Using the default hyper parameters by OpenAI, replacing model with Divinci 003.

In [None]:
#Defining hyper parameters (using the default ones)
model = 'curie'  # using the cheapest model : ada
n_epochs = 4
batch_size = 4
learning_rate_multiplier = 0.1
prompt_loss_weight = 0.1

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "<API KEY>"

In [None]:
!openai api fine_tunes.create \
    -t $train_file \
    -v $valid_file \
    -m $model \
    --n_epochs $n_epochs \
    --batch_size $batch_size \
    --learning_rate_multiplier $learning_rate_multiplier \
    --prompt_loss_weight $prompt_loss_weight
    --suffix $model_name

Upload progress:   0% 0.00/5.99M [00:00<?, ?it/s]Upload progress: 100% 5.99M/5.99M [00:00<00:00, 7.89Git/s]
Uploaded file from ./dataset_1_train.jsonl: file-xfIlHxiG7BRHO6uQmvzbz47r
Upload progress: 100% 2.40M/2.40M [00:00<00:00, 3.04Git/s]
Uploaded file from ./dataset_1_test.jsonl: file-njPKfzbNv5hG1mS5AQxIe4hY
Created fine-tune: ft-fNi5iGbSjUvAhWw9Mw3BIAiB
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-11-22 05:21:31] Created fine-tune: ft-fNi5iGbSjUvAhWw9Mw3BIAiB
[2022-11-22 05:21:44] Fine-tune costs $1.91
[2022-11-22 05:21:44] Fine-tune enqueued. Queue number: 0
[2022-11-22 05:21:46] Fine-tune started

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-fNi5iGbSjUvAhWw9Mw3BIAiB



In [None]:
!openai api fine_tunes.follow -i ft-fNi5iGbSjUvAhWw9Mw3BIAiB

[2022-11-22 05:21:31] Created fine-tune: ft-fNi5iGbSjUvAhWw9Mw3BIAiB
[2022-11-22 05:21:44] Fine-tune costs $1.91
[2022-11-22 05:21:44] Fine-tune enqueued. Queue number: 0
[2022-11-22 05:21:46] Fine-tune started
[2022-11-22 05:27:54] Completed epoch 1/4
[2022-11-22 05:33:43] Completed epoch 2/4
[2022-11-22 05:39:33] Completed epoch 3/4
[2022-11-22 05:45:22] Completed epoch 4/4
[2022-11-22 05:45:40] Uploaded model: ada:ft-personal-2022-11-22-05-45-38
[2022-11-22 05:45:42] Uploaded result file: file-etqWtnfTghm7ZULNBkNSBzkV
[2022-11-22 05:45:42] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft-personal-2022-11-22-05-45-38 -p <YOUR_PROMPT>


### Syncing FineTune Jobs to W&B
 
 Logging Fine Tune with W&B to use later
 

In [None]:
!openai wandb sync
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mgauravrpjain[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.13.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20221122_055012-ft-fNi5iGbSjUvAhWw9Mw3BIAiB[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-fNi5iGbSjUvAhWw9Mw3BIAiB[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/gauravrpjain/GPT-3[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/gauravrpjain/GPT-3/runs/ft-fNi5iGbSjUvAhWw9Mw3BIAiB[0m
File file-xfIlHxiG7BRHO6uQmvzbz47r could not be retrieved. Make sure you are allowed to download training/validation files
File file-njPKfzbNv5hG1mS5AQxIe4hY could not be retrieved. Make sure you are allowed to download training/validation files
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mw