# Instruction Tuning with GPT-4

This notebook is developed to produce the pie chart html/figure in the GPT-4-LLM paper. It analyzes the GPT4 output by following the instructions.

```
``Instruction Tuning with GPT-4'' (https://arxiv.org/abs/2304.03277)
Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao (*Equal Contribution)
```

- Project: https://instruction-tuning-with-gpt-4.github.io/
- Github Repo: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM

Please submit an issue in the github repo, if you have any questions.



*Note: The original script from [self-instruct repo](https://github.com/yizhongw/self-instruct/blob/main/self_instruct/instruction_visualize.ipynb). The script uses Berkeley Neural Parser to parse the generated instructions, and visualize the results using Plotly. Please make sure to install benepar following their documentation [here](https://github.com/nikitakit/self-attentive-parser#installation).*

In [31]:
import benepar, spacy

!python -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')
doc = nlp("The time for action is now. It's never too late to do something.")

import matplotlib.pyplot as plt


Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m52.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## 1. Generate Verb-Noun Pairs for GPT4 Output

:warning: Warning: It takes 20 minutes to run the entire pre-processing, and save it into a csv file. You consider to skip processing, and load our pre-process Verb-Noun CSV file in the next step.

In [32]:
def find_root_verb_and_its_dobj(tree_root):
    # first check if the current node and its children satisfy the condition
    if tree_root.pos_ == "VERB":
        for child in tree_root.children:
            if child.dep_ == "dobj" and child.pos_ == "NOUN":
                return tree_root.lemma_, child.lemma_
        return tree_root.lemma_, None
    # if not, check its children
    for child in tree_root.children:
        return find_root_verb_and_its_dobj(child)
    # if no children satisfy the condition, return None
    return None, None

def find_root_verb_and_its_dobj_in_string(s):
    doc = nlp(s)
    first_sent = list(doc.sents)[0]
    return find_root_verb_and_its_dobj(first_sent.root)

find_root_verb_and_its_dobj_in_string("Write me a story about education.")

('write', 'story')

In [6]:
import json
with open("./dataset/medical_chat_data.json", 'r') as f:
    data_general = json.load(f)
with open("./dataset/quora_chat_data.json", 'r') as f:
    data_general += json.load(f)
gpt4_machine_generated_tasks=[]
for data in data_general:
    task={}
    task['instruction']=data['input'].split('\n[|Human|]')[1].split('\n[|AI|]')[0].strip()
    gpt4_machine_generated_tasks.append(task)
    

In [40]:
import pandas as pd
import json
import tqdm


generated_data_path = "./dataset/pmc_llama_instructions/release.json"

with open(generated_data_path, 'r') as fin:
    gpt4_machine_generated_tasks = json.load(fin)

len(generated_data_path )
# print(gpt4_machine_generated_tasks[0])

instruction_outputs = set([task["instruction"] for task in gpt4_machine_generated_tasks]) # if you are interested in studying the instructions, please change the task key
print(len(instruction_outputs))

raw_phrases = []
for out in tqdm.tqdm(instruction_outputs):
    try:
        verb, noun = find_root_verb_and_its_dobj_in_string(out)
        raw_phrases.append({
            "verb": verb,
            "noun": noun,
            "instruction_output": out
        })
    except Exception as e:
        print(e)
        print(out)

70


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:00<00:00, 194.64it/s]


In [41]:
len(raw_phrases)
raw_phrases = pd.DataFrame(raw_phrases)
raw_phrases.to_csv(r'data/pmc_verb_noun_output.csv')

## 2. Pie Chart Creation on Verb-Noun

Load our pre-process Verb-Noun CSV file, and create the html file with plotly

In [42]:
import pandas as pd
raw_phrases = pd.read_csv(r'data/pmc_verb_noun_output.csv')
raw_phrases = pd.DataFrame(raw_phrases)
phrases = pd.DataFrame(raw_phrases).dropna()
count_list = phrases[["verb", "noun"]].groupby(["verb", "noun"]).size().sort_values(ascending=False)


In [43]:
len(count_list)
# count_list[:25]

# count_list[:25].plot.barh()
# plt.ylabel('verb, noun')
# plt.xlabel('frequency')
# plt.show()

6

In [45]:
top_verbs = phrases[["verb"]].groupby(["verb"]).size().nlargest(20).reset_index()

df = phrases[phrases["verb"].isin(top_verbs["verb"].tolist())]
# df = df[~df["noun"].isin(["I", "what"])]
# df = phrases
# df[~df["verb"].isin(top_verbs["verb"].tolist())]["verb"] = "other"
# df[~df["verb"].isin(top_verbs["verb"].tolist())]["noun"] = "other"
df = df.groupby(["verb", "noun"]).size().reset_index().rename(columns={0: "count"}).sort_values(by=["count"], ascending=False)
# df = df[df["count"] > 10]
df = df.groupby("verb").apply(lambda x: x.sort_values("count", ascending=False).head(4)).reset_index(drop=True)
df

Unnamed: 0,verb,noun,count
0,address,query,7
1,evaluate,description,7
2,provide,answer,7
3,provide,insight,7
4,provide,response,7
5,use,description,7


In [46]:
# Calculate the total counts
total_counts = df['count'].sum()

# Add a new column for probability
df['probability'] = df['count'] / total_counts

import numpy as np

# Using the probabilities from the dummy data to calculate the entropy
# Entropy is calculated using the formula:
# H(X) = -sum(p(x) * log2(p(x))), where p(x) is the probability of each event

# Calculate entropy
entropy = -np.sum(df['probability'] * np.log2(df['probability']))
entropy

2.584962500721156

In [33]:

import plotly.graph_objects as go
import plotly.express as px

# df["blank"] = "ROOT"
# df = phrases.groupby(["verb", "noun"]).size().sort_values(ascending=False).head(5).reset_index().rename(columns={0: "count"})

df = df[df["count"] > 10]
fig = px.sunburst(df, path=['verb', 'noun'], values='count')
# fig.update_layout(uniformtext=dict(minsize=10, mode='hide'))
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0),
    font_family="Times New Roman",
    font_size=40  
)
# fig.show()
fig.write_html("data/baize_verb_noun_output.html")
# fig.savefig("output/verb_noun.pdf")

In [32]:
df['count'].sum()/52002

0.22447213568708896

In [6]:
### view plot
import json
import pandas as pd

generated_data_path = "data/gpt4_generations/filtered_medical_instruction.jsonl"

dataset = [json.loads(l) for l in open(generated_data_path, "r")]
        # import pdb;pdb.set_trace()

len(dataset)


view_outputs = list([task["view"].lower()  for task in dataset]) # if you are interested in studying the instructions, please change the task key
type_outputs = list([task["type"].lower()  for task in dataset])
difficulty_outputs = list([task["difficulty"] for task in dataset])
topic_outputs = list([task["topic"] for task in dataset])

In [21]:
view_series = pd.Series(view_outputs)

# Use the value_counts() function to get the frequency of each name
view_counts =view_series.value_counts()

top_20_names = view_counts.head(20)

# Convert the Series to a DataFrame
df_view= pd.DataFrame({"Name": top_20_names.index, "Frequency": top_20_names.values})


# # Convert the Series to a DataFrame
# df_view = pd.DataFrame({"Name": view_counts.index, "Frequency": view_counts.values})

# Reset the index if you want to start from 0
df_view.reset_index(drop=True, inplace=True)

In [22]:
df_view

Unnamed: 0,Name,Frequency
0,medical student,7219
1,patient,6919
2,clinician,2530
3,pharmacist,2465
4,genetic counselor,1584
5,nurse,1447
6,expert,1368
7,physician,1288
8,radiologist,1252
9,student,1209


In [26]:
# df_view= df_view[df_view["Frequency"] > 50]
fig = px.sunburst(df_view, path=['Name'], values='Frequency')
# fig.update_layout(uniformtext=dict(minsize=10, mode='hide'))
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0),
    font_family="Times New Roman",
        font_size=80  
)
# fig.show()
fig.write_html("data/instruction_view.html")

In [24]:
type_series = pd.Series(type_outputs)

# Use the value_counts() function to get the frequency of each name
type_counts =type_series.value_counts()


# # Get the top 20 names based on frequency
top_10_names = type_counts.head(20)

# # Convert the Series to a DataFrame
df_type = pd.DataFrame({"Name": top_10_names.index, "Frequency": top_10_names.values})
#df_type = pd.DataFrame({"Name": type_counts.index, "Frequency": type_counts.values})
# Reset the index if you want to start from 0
df_type.reset_index(drop=True, inplace=True)


df_type

Unnamed: 0,Name,Frequency
0,text generation,9142
1,chat,9082
2,open q&a,7278
3,single-hop reasoning,6624
4,summarization,5277
5,multiple-hop reasoning,4961
6,classification,4436
7,usmle style q&a,3545
8,rewrite,3267
9,multiple-choice q&a,3010


In [25]:
# df_type= df_type[df_type["Frequency"] > 50]
fig = px.sunburst(df_type, path=['Name'], values='Frequency')
# fig.update_layout(uniformtext=dict(minsize=10, mode='hide'))
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0),
    font_family="Times New Roman",
        font_size=80  
)
# fig.show()
fig.write_html("data/instruction_type.html")

In [27]:
df_view["Frequency"].sum()/len(dataset)

0.553053613053613

In [28]:
df_type["Frequency"].sum()/len(dataset)

0.9724630924630925

In [90]:
df_type["Frequency"].sum()

56622

In [13]:
view_series = pd.Series(topic_outputs)

# Use the value_counts() function to get the frequency of each name
view_counts =view_series.value_counts()

top_20_names = view_counts.head(20)

# Convert the Series to a DataFrame
df_view= pd.DataFrame({"Name": top_20_names.index, "Frequency": top_20_names.values})


# # Convert the Series to a DataFrame
# df_view = pd.DataFrame({"Name": view_counts.index, "Frequency": view_counts.values})

# Reset the index if you want to start from 0
df_view.reset_index(drop=True, inplace=True)

In [14]:
df_view

Unnamed: 0,Name,Frequency
0,Pharmacology,5690
1,Epidemiology,5633
2,Pathophysiology,5309
3,Genetics,5183
4,Medical Education,4239
5,Anatomy,3929
6,Diseases,3718
7,Treatment,2750
8,Diagnoses,2617
9,Neurology,1017


In [15]:
# df_view= df_view[df_view["Frequency"] > 50]


# import plotly.graph_objects as go
import plotly.express as px

fig = px.sunburst(df_view, path=['Name'], values='Frequency')
# fig.update_layout(uniformtext=dict(minsize=10, mode='hide'))
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0),
    font_family="Times New Roman",
        font_size=80  
)
# fig.show()
fig.write_html("data/instruction_topic.html")