<a href="https://colab.research.google.com/github/alinealinealine/GPT-Pilot/blob/main/Finetuning_with_GPT3_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune GPT-3 for AIMM narrative ex-ante

OpenAI's GPT-3 is a natural language model trained on large set of training data. It can be used for various tasks, including to generate data. 

However, the model is generalist in nature and thus, not fit for specialised tasks in its original or vaniall version. However, will a bit of finetuning it can be used for more specialised tasks such as generating AIMM text. 

The fine-tuning happens via OpenAI's API to fine tune GPT-3. 

## Installing dependencies and libraries

In [1]:
!pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1


In [2]:
!pip install -Uq openai wandb

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.3/57.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00

In [4]:
import openai
import wandb
from pathlib import Path
import pandas as pd
import numpy as np
import json
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [5]:
#Entering API Credentials
openai.api_key_path = "./api.txt"

## Dataset Prepration

The dataset was processed in R by scrapping relevant documents and cleaning it into the JSON format required to finetune the mdoel. The dataset are split based on sector and portion of AIMM narrative it is expected to generate. 

1. Sector:
  1. FIG
  2. MAS
  3. CDF
  4. INR
2. Section of AIMM narrative
  1. Project narrative
  2. Market narrative 
  3. Indicators

In addition different variation of prompts are also explored - creating different models. 

## Model naming convention

In order to keep track of the models they are to be named using the following convention: "SSS-IN-GEN-XXXX"
* SSS: Refers to Sector of the model's focus: FIG, MAS,CDF, INR or ALL for sector agnostic model
* IN: Refers to model input, can be BP for Board Papers and GE for Generic documents
* GEN: Refers to which section the model is trying to generate. Can be one of the following:
  * PRO: Project narrative
  * MAR: Market narrative
  * IND: Indicators
* XXXX: Refers to the number of the model - as various models might be created to accomodate various prompts. This can also be alpha numeric.

### Models trained so far
1. FIG-BP-PRO-0001: Uses AIMM summary as prompt and project narrative as completiton. Only focuses on FIG projects for training (~300 samples).
2. FIG-BP-MAR-0001: Uses AIMM summary as prompt and market narrative as completiton. Only focuses on FIG projects for training (~300 samples).
3. ALL-BP-PRO-0001: Uses project description as prompt and project narrative as completiton. Only focuses on ALL sectors for training (~700 samples).
4. ALL-BP-MAR-0001: Uses project description as prompt and market narrative as completiton. Only focuses on ALL sectors for training (~700 samples).


In [6]:
project_name = 'ALL-BP-SUM'
model_name = project_name+'-0001'

# Fine Tuning 

Fine tuning involves the following steps:
1. Preparing the dataset: Datset is here split into training and validation sets. Before the split - the prompts are also shared with OpenAI to see if they are aligned with the requirements for finetuning.
2. FineTuning: This is where the split datasets are shared with OpenAI for finetuning of the GPT model. The final model is saved and can be accessed both here and on OpenAI playground.  

## Preping the dataset

In [7]:
# create a job for splitting dataset
run = wandb.init(project=project_name, job_type='split dataset')

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [8]:
# download full dataset
dataset_path = "./"+model_name+".json"

In [9]:
!head $dataset_path

{"prompt":"The proposed project consists of an equity investment of up to USD30 million AfricInvest IV LLC ('AF IV' or the 'Fund'), a generalist, closed-end private equity fund domiciled in Mauritius with a target size of USD500 million. Fund will invest in mid-market growth capital transactions in Africa. The Project is in IFC sector P-BA - Growth Equity Fund and in Africa Region.\n\n###\n\n","completion":" DEVELOPMENT IMPACT Summary Summary: The Project has an Anticipated Impact Measurement and Monitoring (AIMM) rating of Good, based on the AIMM score of 45. On an unadjusted basis (i.e. without likelihood factor), the full potential AIMM score could reach 70. The Project will focus on enhancing the provision of private equity for mid-cap companies across Africa; AfricInvest has a long and positive track record of supporting businesses based in countries with very limited exposure to private equity investments, for example in Tunisia, Algeria and Botswana, and it is expected that appr

In [10]:
!openai tools fine_tunes.prepare_data -f $dataset_path

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 735 prompt-completion pairs
- There are 3 examples that are very long. These are rows: [358, 542, 720]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- All prompts end with suffix `.\n\n###\n\n`
- All completions end with suffix `\n[END]`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`
- [Recommended] Remove 3 long examples [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `./ALL-BP-SUM-0001_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "./ALL-BP-SUM-0001_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `.\n\n###\n\n` for the model to start generating co

In [11]:
dataset_path = "./"+model_name+"_prepared.jsonl"
# check number of samples
!wc -l $dataset_path

732 ./ALL-BP-SUM-0001_prepared.jsonl


Splitting it into training and testing set randomly with 25% going to testing set. 
* Training Set = 75% 
* Validation Set = 25% 

Also, logging the files into W&B for recordkeeping. 

In [12]:
df = pd.read_json(dataset_path, orient='records', lines=True)
df_train, df_test = train_test_split(df,test_size = 0.25,random_state = 42, shuffle = False)
df_train.to_json("./"+model_name+"_train.jsonl", orient='records', lines=True)
df_test.to_json("./"+model_name+"_test.jsonl", orient='records', lines=True)

#Logging the files and tables into W&B 
table_train = wandb.Table(dataframe=df_train)
table_valid = wandb.Table(dataframe=df_test)

# Create artifacts
artifact_train = wandb.Artifact(model_name+"_train.jsonl", type='training_files', metadata={'samples': df_train.shape[0]})
artifact_train.add_file(model_name+"_train.jsonl")
artifact_train.add(table_train, model_name+"_train.jsonl")

artifact_valid = wandb.Artifact(model_name+"_test.jsonl", type='validation_files', metadata={'samples': df_test.shape[0]})
artifact_valid.add_file(model_name+"_test.jsonl")
artifact_valid.add(table_valid, model_name+"_test.jsonl")

# Log files
run.log_artifact(artifact_train)
run.log_artifact(artifact_valid)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f8849ff3af0>

Closing our dataprep run

In [13]:
# keep entity for reference of artifact later 
entity = wandb.run.entity
wandb.finish()

VBox(children=(Label(value='2.918 MB of 2.918 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## Fine Tuning the model


In [14]:
train_file = "./"+model_name+"_train.jsonl"
valid_file = "./"+model_name+"_test.jsonl"

Defining Hyper parameters: 

Using the default hyper parameters by OpenAI, replacing model with Divinci 003.

In [15]:
#Defining hyper parameters (using the default ones)
model = 'davinci'  # using the best model : davinci
n_epochs = 4
batch_size = 4
learning_rate_multiplier = 0.1
prompt_loss_weight = 0.1

In [17]:
import os
os.environ["OPENAI_API_KEY"] = "API-KEY"

In [18]:
!openai api fine_tunes.create \
    -t $train_file \
    -v $valid_file \
    -m $model \
    --n_epochs $n_epochs \
    --batch_size $batch_size \
    --learning_rate_multiplier $learning_rate_multiplier \
    --prompt_loss_weight $prompt_loss_weight \
    --suffix $model_name

Upload progress:   0% 0.00/1.09M [00:00<?, ?it/s]Upload progress: 100% 1.09M/1.09M [00:00<00:00, 426Mit/s]
Uploaded file from ./ALL-BP-SUM-0001_train.jsonl: file-wygPRDcIomBpvm9nTIc54HTx
Upload progress: 100% 446k/446k [00:00<00:00, 485Mit/s]
Uploaded file from ./ALL-BP-SUM-0001_test.jsonl: file-InshcNxiJSsWvJc9d9rGzZHu
Created fine-tune: ft-qtN3QMqxtKlCJ7mqY47qhWZn
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-09 21:30:41] Created fine-tune: ft-qtN3QMqxtKlCJ7mqY47qhWZn

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-qtN3QMqxtKlCJ7mqY47qhWZn



In [1]:
!openai api fine_tunes.follow -i ft-qtN3QMqxtKlCJ7mqY47qhWZn

/bin/bash: openai: command not found


### Syncing FineTune Jobs to W&B
 
 Logging Fine Tune with W&B to use later
 

In [2]:
!openai wandb sync
wandb.finish()

/bin/bash: openai: command not found


NameError: ignored