<a href="https://colab.research.google.com/github/alinealinealine/GPT-Pilot/blob/main/src/Finetuning_with_GPT3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune GPT-3 for AIMM narrative ex-ante

OpenAI's GPT-3 is a natural language model trained on large set of training data. It can be used for various tasks, including to generate data. 

However, the model is generalist in nature and thus, not fit for specialised tasks in its original or vaniall version. However, will a bit of finetuning it can be used for more specialised tasks such as generating AIMM text. 

The fine-tuning happens via OpenAI's API to fine tune GPT-3. 

## Installing dependencies and libraries

In [1]:
!pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1


In [2]:
!pip install -Uq openai wandb

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m51.2/57.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.3/57.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m17.3 MB/s[0m

In [3]:
import openai
import wandb
from pathlib import Path
import pandas as pd
import numpy as np
import json
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [4]:
#Entering API Credentials
openai.api_key_path = "./api.txt"

## Dataset Prepration

The dataset was processed in R by scrapping relevant documents and cleaning it into the JSON format required to finetune the mdoel. The dataset are split based on sector and portion of AIMM narrative it is expected to generate. 

1. Sector:
  1. FIG
  2. MAS
  3. CDF
  4. INR
2. Section of AIMM narrative
  1. Project narrative
  2. Market narrative 
  3. Indicators

In addition different variation of prompts are also explored - creating different models. 

## Model naming convention

In order to keep track of the models they are to be named using the following convention: "SSS-IN-GEN-XXXX"
* SSS: Refers to Sector of the model's focus: FIG, MAS,CDF, INR or ALL for sector agnostic model
* IN: Refers to model input, can be BP for Board Papers and GE for Generic documents
* GEN: Refers to which section the model is trying to generate. Can be one of the following:
  * PRO: Project narrative
  * MAR: Market narrative
  * IND: Indicators
* XXXX: Refers to the number of the model - as various models might be created to accomodate various prompts. This can also be alpha numeric.

### Models trained so far
1. FIG-BP-PRO-0001: Uses AIMM summary as prompt and project narrative as completiton. Only focuses on FIG projects for training (~300 samples).
2. FIG-BP-MAR-0001: Uses AIMM summary as prompt and market narrative as completiton. Only focuses on FIG projects for training (~300 samples).
3. ALL-BP-PRO-0001: Uses project description as prompt and project narrative as completiton. Only focuses on ALL sectors for training (~700 samples).
4. ALL-BP-MAR-0001: Uses project description as prompt and market narrative as completiton. Only focuses on ALL sectors for training (~700 samples).


In [5]:
project_name = 'ALL-BP-PRO'
model_name = project_name+'-0001'

# Fine Tuning 

Fine tuning involves the following steps:
1. Preparing the dataset: Datset is here split into training and validation sets. Before the split - the prompts are also shared with OpenAI to see if they are aligned with the requirements for finetuning.
2. FineTuning: This is where the split datasets are shared with OpenAI for finetuning of the GPT model. The final model is saved and can be accessed both here and on OpenAI playground.  

## Preping the dataset

In [6]:
# create a job for splitting dataset
run = wandb.init(project=project_name, job_type='split dataset')

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [7]:
# download full dataset
dataset_path = "./"+model_name+".json"

In [11]:
!head $dataset_path

{"prompt":"The proposed project consists of an equity investment of up to USD30 million AfricInvest IV LLC ('AF IV' or the 'Fund'), a generalist, closed-end private equity fund domiciled in Mauritius with a target size of USD500 million. Fund will invest in mid-market growth capital transactions in Africa.\n\n###\n\n","completion":" Assessment of Project Outcomes Summary: The Project outcome rating is Very Strong with a Low likelihood of achievement. The likelihood of achievement is based on: (i) the uncertain development track record of AF III (vintage 2014), which took time to identify investable opportunities (approx. 45 percent of investment made in 2018); (ii) the risks associated with the evolution of the fund’s investment strategy to transition into the higher mid-cap segment, which could cause further issues with deal flow and investee growth, especially in North Africa; and (iii) economic/political turbulence which can affect domestic demand and growth within the region. Proje

In [12]:
!openai tools fine_tunes.prepare_data -f $dataset_path

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 735 prompt-completion pairs
- There are 56 examples that are very long. These are rows: [16, 102, 146, 154, 162, 163, 184, 186, 190, 194, 195, 200, 210, 211, 218, 228, 250, 251, 255, 266, 267, 278, 354, 381, 492, 508, 524, 529, 530, 538, 542, 548, 549, 550, 551, 558, 578, 582, 592, 593, 596, 608, 618, 619, 648, 650, 652, 660, 684, 685, 700, 701, 718, 719, 720, 730]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- All prompts end with suffix `\n\n###\n\n`
- All completions end with suffix `\n[END]`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`
- [Recommended] Remove 56 long examples [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `./ALL-BP-PRO-0001_prepared.jsonl`
Feel free to ta

In [13]:
dataset_path = "./"+model_name+"_prepared.jsonl"
# check number of samples
!wc -l $dataset_path

679 ./ALL-BP-PRO-0001_prepared.jsonl


Splitting it into training and testing set randomly with 25% going to testing set. 
* Training Set = 75% 
* Validation Set = 25% 

Also, logging the files into W&B for recordkeeping. 

In [14]:
df = pd.read_json(dataset_path, orient='records', lines=True)
df_train, df_test = train_test_split(df,test_size = 0.25,random_state = 42, shuffle = False)
df_train.to_json("./"+model_name+"_train.jsonl", orient='records', lines=True)
df_test.to_json("./"+model_name+"_test.jsonl", orient='records', lines=True)

#Logging the files and tables into W&B 
table_train = wandb.Table(dataframe=df_train)
table_valid = wandb.Table(dataframe=df_test)

# Create artifacts
artifact_train = wandb.Artifact(model_name+"_train.jsonl", type='training_files', metadata={'samples': df_train.shape[0]})
artifact_train.add_file(model_name+"_train.jsonl")
artifact_train.add(table_train, model_name+"_train.jsonl")

artifact_valid = wandb.Artifact(model_name+"_test.jsonl", type='validation_files', metadata={'samples': df_test.shape[0]})
artifact_valid.add_file(model_name+"_test.jsonl")
artifact_valid.add(table_valid, model_name+"_test.jsonl")

# Log files
run.log_artifact(artifact_train)
run.log_artifact(artifact_valid)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f37a4b94af0>

Closing our dataprep run

In [15]:
# keep entity for reference of artifact later 
entity = wandb.run.entity
wandb.finish()

## Fine Tuning the model


In [16]:
train_file = "./"+model_name+"_train.jsonl"
valid_file = "./"+model_name+"_test.jsonl"

Defining Hyper parameters: 

Using the default hyper parameters by OpenAI, replacing model with Divinci 003.

In [None]:
#Defining hyper parameters (using the default ones)
model = 'davinci'  # using the best model : davinci
n_epochs = 4
batch_size = 4
learning_rate_multiplier = 0.1
prompt_loss_weight = 0.1

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "API-Key"

In [None]:
!openai api fine_tunes.create \
    -t $train_file \
    -v $valid_file \
    -m $model \
    --n_epochs $n_epochs \
    --batch_size $batch_size \
    --learning_rate_multiplier $learning_rate_multiplier \
    --prompt_loss_weight $prompt_loss_weight \
    --suffix $model_name

Upload progress:   0% 0.00/1.23M [00:00<?, ?it/s]Upload progress: 100% 1.23M/1.23M [00:00<00:00, 1.19Git/s]
Uploaded file from ./FIG-BP-MAR-0001_train.jsonl: file-obxdzBRLTkhPV76L4sZfsop2
Upload progress: 100% 483k/483k [00:00<00:00, 622Mit/s]
Uploaded file from ./FIG-BP-MAR-0001_test.jsonl: file-S7GiRGimKHn625wcmDWjeuSy
Created fine-tune: ft-anN7qjd46Y8vth8qfA3RWI3h
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-08 22:07:12] Created fine-tune: ft-anN7qjd46Y8vth8qfA3RWI3h

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-anN7qjd46Y8vth8qfA3RWI3h



In [None]:
!openai api fine_tunes.follow -i ft-anN7qjd46Y8vth8qfA3RWI3h

[2023-03-08 22:07:12] Created fine-tune: ft-anN7qjd46Y8vth8qfA3RWI3h
[2023-03-08 22:10:39] Fine-tune costs $27.89
[2023-03-08 22:10:39] Fine-tune enqueued
[2023-03-08 22:33:59] Fine-tune is in the queue. Queue number: 31
[2023-03-08 22:35:05] Fine-tune is in the queue. Queue number: 30
[2023-03-08 22:36:01] Fine-tune is in the queue. Queue number: 29
[2023-03-08 22:36:07] Fine-tune is in the queue. Queue number: 28
[2023-03-08 22:37:30] Fine-tune is in the queue. Queue number: 27
[2023-03-08 22:39:46] Fine-tune is in the queue. Queue number: 26
[2023-03-08 22:43:34] Fine-tune is in the queue. Queue number: 25
[2023-03-08 22:46:09] Fine-tune is in the queue. Queue number: 24
[2023-03-08 22:46:43] Fine-tune is in the queue. Queue number: 23
[2023-03-08 22:46:45] Fine-tune is in the queue. Queue number: 22
[2023-03-08 22:47:27] Fine-tune is in the queue. Queue number: 21
[2023-03-08 22:49:01] Fine-tune is in the queue. Queue number: 20
[2023-03-08 22:50:55] Fine-tune is in the queue. Queu

### Syncing FineTune Jobs to W&B
 
 Logging Fine Tune with W&B to use later
 

In [None]:
!openai wandb sync
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mgjain5[0m ([33mcdi[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.13.11
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20230308_232738-ft-anN7qjd46Y8vth8qfA3RWI3h[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-anN7qjd46Y8vth8qfA3RWI3h[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/cdi/GPT-3[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/cdi/GPT-3/runs/ft-anN7qjd46Y8vth8qfA3RWI3h[0m
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:             elapsed_examples ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
[34m[1mwandb[0m:               elapsed_tokens ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
[34m[1mwandb[0m:                training_loss ▆█