<a href="https://colab.research.google.com/github/alinealinealine/GPT-Pilot/blob/main/src/Finetuning_with_GPT3_market.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune GPT-3 for AIMM narrative ex-ante

OpenAI's GPT-3 is a natural language model trained on large set of training data. It can be used for various tasks, including to generate data. 

However, the model is generalist in nature and thus, not fit for specialised tasks in its original or vaniall version. However, will a bit of finetuning it can be used for more specialised tasks such as generating AIMM text. 

The fine-tuning happens via OpenAI's API to fine tune GPT-3. 

## Installing dependencies and libraries

In [None]:
!pip install --upgrade pip
!pip install -Uq openai wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
import openai
import wandb
from pathlib import Path
import pandas as pd
import numpy as np
import json
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [None]:
#Entering API Credentials
openai.api_key_path = "./api.txt"


'./api.txt'

## Dataset Prepration

The dataset was processed in R by scrapping relevant documents and cleaning it into the JSON format required to finetune the mdoel. The dataset are split based on sector and portion of AIMM narrative it is expected to generate. 

1. Sector:
  1. FIG
  2. MAS
  3. CDF
  4. INR
2. Section of AIMM narrative
  1. Project narrative
  2. Market narrative 
  3. Indicators

In addition different variation of prompts are also explored - creating different models. 

## Model naming convention

In order to keep track of the models they are to be named using the following convention: "SSS-IN-GEN-XXXX"
* SSS: Refers to Sector of the model's focus
* IN: Refers to model input, can be BP for Board Papers and GE for Generic documents
* GEN: Refers to which section the model is trying to generate. Can be one of the following:
  * PRO: Project narrative
  * MAR: Market narrative
  * IND: Indicators
* XXXX: Refers to the number of the model - as various models might be created to accomodate various prompts. This can also be alpha numeric.




In [None]:
project_name = 'FIG-BP-PRO'
model_name = 'ALL-BP-MAR-0001'

# Fine Tuning 

Fine tuning involves the following steps:
1. Preparing the dataset: Datset is here split into training and validation sets. Before the split - the prompts are also shared with OpenAI to see if they are aligned with the requirements for finetuning.
2. FineTuning: This is where the split datasets are shared with OpenAI for finetuning of the GPT model. The final model is saved and can be accessed both here and on OpenAI playground.  

## Preping the dataset

In [None]:
# create a job for splitting dataset
run = wandb.init(project=project_name, job_type='split dataset')

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# download full dataset
dataset_path = "./"+model_name+".json"

In [None]:
!head $dataset_path

{"prompt":"The proposed project consists of an equity investment of up to USD30 million AfricInvest IV LLC ('AF IV' or the 'Fund'), a generalist, closed-end private equity fund domiciled in Mauritius with a target size of USD500 million. Fund will invest in mid-market growth capital transactions in Africa. The Project is in IFC sector P-BA - Growth Equity Fund and in Africa Region.\n\n###\n\n","completion":" Assessment of Contribution to Market Creation Summary: The Contribution to Market Creation rating is Moderate with a Medium likelihood of achievement. The likelihood assessment is based on: (i) the success of the project itself in generating sustained growth for its investees in relatively challenging markets and in delivering proof of concept to investors in this region; and tempered by (ii) the complex nature of African PE markets where the risk perception of private institutional investors has remained relatively high, despite the success of multiple interventions (incl. by othe

In [None]:
!openai tools fine_tunes.prepare_data -f $dataset_path

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 736 prompt-completion pairs
- There are 57 examples that are very long. These are rows: [15, 38, 66, 155, 171, 172, 173, 201, 202, 217, 218, 235, 236, 241, 242, 260, 263, 264, 341, 348, 364, 419, 434, 451, 473, 490, 494, 511, 523, 527, 529, 530, 539, 543, 549, 555, 565, 597, 599, 600, 603, 604, 649, 651, 655, 656, 683, 684, 685, 686, 687, 688, 701, 703, 707, 719, 721]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- All prompts end with suffix `.\n\n###\n\n`
- All completions end with suffix `\n[END]`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`
- [Recommended] Remove 57 long examples [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `./ALL-BP-MAR-0001_prepared.jsonl`
Feel free t

In [None]:
dataset_path = "./"+model_name+"_prepared.jsonl"
# check number of samples
!wc -l $dataset_path

679 ./ALL-BP-MAR-0001_prepared.jsonl


Splitting it into training and testing set randomly with 25% going to testing set. 
* Training Set = 75% 
* Validation Set = 25% 

Also, logging the files into W&B for recordkeeping. 

In [None]:
df = pd.read_json(dataset_path, orient='records', lines=True)
df_train, df_test = train_test_split(df,test_size = 0.25,random_state = 42, shuffle = False)
df_train.to_json("./"+model_name+"_train.jsonl", orient='records', lines=True)
df_test.to_json("./"+model_name+"_test.jsonl", orient='records', lines=True)

#Logging the files and tables into W&B 
table_train = wandb.Table(dataframe=df_train)
table_valid = wandb.Table(dataframe=df_test)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f0a5db9a0d0>

In [None]:
# Create artifacts
artifact_train = wandb.Artifact(model_name+"_train.jsonl", type='training_files', metadata={'samples': df_train.shape[0]})
artifact_train.add_file(model_name+"_train.jsonl")
artifact_train.add(table_train, model_name+"_train.jsonl")

artifact_valid = wandb.Artifact(model_name+"_test.jsonl", type='validation_files', metadata={'samples': df_test.shape[0]})
artifact_valid.add_file(model_name+"_test.jsonl")
artifact_valid.add(table_valid, model_name+"_test.jsonl")

# Log files
run.log_artifact(artifact_train)
run.log_artifact(artifact_valid)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f0a4f6b8520>

Closing our dataprep run

In [None]:
# keep entity for reference of artifact later 
entity = wandb.run.entity
wandb.finish()

AttributeError: ignored

## Fine Tuning the model


In [None]:
train_file = "./"+model_name+"_train.jsonl"
valid_file = "./"+model_name+"_test.jsonl"

Defining Hyper parameters: 

Using the default hyper parameters by OpenAI, replacing model with Divinci 003.

In [None]:
#Defining hyper parameters (using the default ones)
model = 'davinci'  # using the best model : davinci
n_epochs = 4
batch_size = 4
learning_rate_multiplier = 0.1
prompt_loss_weight = 0.1

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "<API KEY>"

In [None]:
!openai api fine_tunes.create \
    -t $train_file \
    -v $valid_file \
    -m $model \
    --n_epochs $n_epochs \
    --batch_size $batch_size \
    --learning_rate_multiplier $learning_rate_multiplier \
    --prompt_loss_weight $prompt_loss_weight \
    --suffix $model_name

Upload progress:   0% 0.00/2.74M [00:00<?, ?it/s]Upload progress: 100% 2.74M/2.74M [00:00<00:00, 2.20Git/s]
Uploaded file from ./ALL-BP-MAR-0001_train.jsonl: file-l8WaNrBdadm0SiyIWsBE951A
Upload progress: 100% 1.03M/1.03M [00:00<00:00, 1.11Git/s]
Uploaded file from ./ALL-BP-MAR-0001_test.jsonl: file-3fVKE0Kgu2XaWC6ijQsWN7jd
Created fine-tune: ft-orTwLKMNydHLhx7jVrksWTMZ
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-09 17:44:51] Created fine-tune: ft-orTwLKMNydHLhx7jVrksWTMZ

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-orTwLKMNydHLhx7jVrksWTMZ



In [None]:
!openai api fine_tunes.follow -i ft-orTwLKMNydHLhx7jVrksWTMZ

[2023-03-09 17:44:51] Created fine-tune: ft-orTwLKMNydHLhx7jVrksWTMZ
[2023-03-09 18:03:46] Fine-tune costs $63.01
[2023-03-09 18:03:47] Fine-tune enqueued
[2023-03-09 18:43:11] Fine-tune is in the queue. Queue number: 31
[2023-03-09 18:44:21] Fine-tune is in the queue. Queue number: 30
[2023-03-09 18:44:45] Fine-tune is in the queue. Queue number: 28
[2023-03-09 18:44:46] Fine-tune is in the queue. Queue number: 28
[2023-03-09 18:46:03] Fine-tune is in the queue. Queue number: 27
[2023-03-09 18:47:08] Fine-tune is in the queue. Queue number: 26
[2023-03-09 18:48:20] Fine-tune is in the queue. Queue number: 25
[2023-03-09 18:49:26] Fine-tune is in the queue. Queue number: 24
[2023-03-09 18:49:35] Fine-tune is in the queue. Queue number: 23
[2023-03-09 18:50:37] Fine-tune is in the queue. Queue number: 22
[2023-03-09 18:51:18] Fine-tune is in the queue. Queue number: 21
[2023-03-09 18:51:21] Fine-tune is in the queue. Queue number: 20
[2023-03-09 18:51:22] Fine-tune is in the queue. Queu

### Syncing FineTune Jobs to W&B
 
 Logging Fine Tune with W&B to use later
 

In [None]:
!openai wandb sync
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mgjain5[0m ([33mcdi[0m). Use [1m`wandb login --relogin`[0m to force relogin
No new successful fine-tunes were found
🎉 wandb sync completed successfully
