# Exploring GPT-3

In this homework assignment we will walk you through how to use GPT-3 a large pre-trained neural language model developed by OpenAI.  

You will learn about the following topics:
* Prompts and completions.  You should observe that the the quality of the text generated is high quality, but not necessarially factually accurate.
* Probabilities.  You'll learn how to inspect probabilities assigned to words in the model's output.
* Few shot learning.  We'll see an example of few-shot learning with a small handful of examples.
* Zero shot learning.  We will explore the zero-shot capabilities of pre-trained LMs.  You'll design zero-shot prompts for
1. summarization
2. question-answering
3. simplification
4. translation
* How to fine tune a model.  You will learn how to fine-tune GPT-3 to take a Wikipedia infobox as input and generate the text of a biography as its ouput.  You'll then write your own code to do the reverse task – given a biography, extract the  attributes and values in the style of a Wikipedia infobox. 



# Prompt Completion

As a warm-up we'll have you play with [the OpenAI Playground](https://beta.openai.com/playground).  Try inputting this prompt:

> One of my favorite professors at the University of Pennsylvania is 

And the click the "Submit" button to generate a completion.

Copy and paste the text below (including your prompt). 

You might notice that the text that GPT-3 generates ends mid-sentence.  GPT-3 will generate text until it either generates a special "stop sequence" token `<|endoftext|>`, or it outputs the number of tokens specified by the `maximum length` variable. 
You can press Submit again to have it continue generatin, or you can increase the max length variable in the sliderbar on the right.

In [None]:
favorite_professor_completion_1 = """
One of my favorite professors at the University of Pennsylvania is COPY AND PASTE THE COMPLETION HERE
"""

GPT-3 generates fluent text, but it is not always grounded in fact.  Let's do a Google search for the person that GPT-3 generated as your favorite professor and check
* Are they actually a professor?
* Where do they work?

In [None]:
# Extract the professor's name
professor_name_1 = ""

# Do a Google search and answer these questions
actually_a_professor_1 = False

# Insitituion where they work
instituion_1 = ""

When it generates its completions, GPT-3 generates each new word/token according to its probability distribution.  It draws each word at random in proportion to its propability.  That randomness means that it can generate different completions. You can re-generate and get different completions each time.

Generate another 4 completions for the professor prompt:

> One of my favorite professors at the University of Pennsylvania is 

and do Google searches for them.

*Tip: You can generate another response with the Regenerate button to the right of the Submit button.  The Regenerate button has a recycle symbol on it.*

In [None]:
favorite_professor_completion_2 = """
COPY AND PASTE THE 2ND COMPLETION HERE
"""

favorite_professor_completion_3 = """
COPY AND PASTE THE 3RD COMPLETION HERE
"""

favorite_professor_completion_4 = """
COPY AND PASTE THE 4TH COMPLETION HERE
"""

favorite_professor_completion_5 = """
COPY AND PASTE THE 5TH COMPLETION HERE
"""

# Do a Google search for these professors

professor_name_2 = ""
actually_a_professor_2 = False
instituion_2 = ""

professor_name_3 = ""
actually_a_professor_3 = False
instituion_3 = ""

professor_name_4 = ""
actually_a_professor_4 = False
instituion_4 = ""

professor_name_5 = ""
actually_a_professor_5 = False
instituion_5 = ""

## Probabilities

Just like with the n-gram language models that we stuided earlier in the course, neural language models like GPT-3 assign probabilities to each token in a sequence.  

In the playground, you can see the probabilities for the top-5 words predicted at each position by choosing the `Full Spectrum` option from the `Show probabilities` dropdown menu in the controls.  Try selecting that option and then generate a completion for the prompt

> My favorite class in the Computer Science Department was taught by Professor

If you mouse over the word after professor, you'll see something like this:
```
Joe = 8.21%
John = 4.25%
Nancy = 2.27%
David = 2.09%
Barbara = 2.05%
Total: -2.50 logprob on 1 tokens
(18.87% probability covered in top 5 logits
```

One critical observation about language models is that they often encode societal biases that appear in their data.  For instance, after the disovery that LM embeddings could be used to solve word analogy problems like "**man** is to **woman** as **king** is to ___" (the model predicts **queen**), researchers discovered that LMs had a surpisingly sexist answer to the analogy problem  "**man** is to **woman** as **computer programmer** is to ___" (the model predicts **homemaker**).  These kinds of biases are prevelant and pernicious. 

Let's examine the most probable names that GPT3 assigns to different completions and analyze their gender.  We'll see if it associates different genders with different academic disciplines.  (You can also see this for different careers like *nurse*, *plumber*, or *school teacher*).

Please create dictionaries mapping GPT's predictions for the first names of professors in these departmemnts
* Computer Science
* Gender Studies
* Physics
* Linguisticss
* Bioengineering
Use the prompt:
> My favorite class in the {deparment_name} Department was taught by Professor

**Note: you can also add a stop sequence of `.` to get the model to complete only a single sentence.**



In [None]:

# Classify each name as male, female, partial word, or unknown
computer_science_genders = {
  "Joe" : "male",
  "John" : "male",
  "Nancy" : "female",
  "David" : "male",
  "Barbara" : "female",
}

gender_studies_genders = {
  "TODO" : "TODO",
}

physics_genders = {
  "TODO" : "TODO",
}

lingusitics_genders = {
  "TODO" : "TODO",
}

bioengineering_genders = {
  "TODO" : "TODO",
}

(If you wanted to systematically explore the predictions of the model, you could use the API's logprobs argument to return the the log probabilities on the logprobs most likely tokens, as well the chosen tokens.)

# Few Shot Learning

One of the remarkable properties of large language models is a consequence of the fact that they have been trained on so much language data.  They encode that training data as background information that lets them learn new tasks and to generalize patterns using only a few examples.  This is called "Few shot learning".

Here is an example.  Imagine that we want to build a system that allows a student to say something they want to learn, and the system will recommend the subject for them to study.  Here are examples of inputs and outputs to our program:

```
how to program in Python - computer science
factors leading up to WW2 - history
branches of government - political science
Shakespeare's plays - English
cellular respiration - biology
respiratory disease - medical
how to sculpt - art
```

We can use these 7 examples (and probably fewer!) as a prompt to GPT-3, and it will perform few shot learning by figuring out what our pattern is, and being able to perform the task for new inputs.

Try pasting those examples into the Playground, and then listing out a few subjects to see what is output. 

```
cellular respiration
respiratory disease
how to play saxophone
autonomic system
how write a screenplay
perform in a play
stock market
planetary orbits
relativity
```



Fill in the dictionary below using the playground by replacing the TODOs with the model's predictions. 

In [None]:
few_shot_subject_classification_results = {
  "cellular respiration" : "TODO",
  "respiratory disease" : "TODO",
  "how to play saxophone" : "TODO",
  "autonomic system" : "TODO",
  "how write a screenplay" : "TODO",
  "perform in a play" : "TODO",
  "stock market" : "TODO",
  "planetary orbits" : "TODO",
  "relativity" : "TODO",
}

## Using the API

Now let's take a look at how to call the OpenAI API from our code, so that we don't have to manually enter inputs into the Playground.  

If you click on the "View code" button on the playground, you'll see a sample of code for whatever prompt you have.  For example, here's the code that we have for our few-shot learning that generates a subject to study for a topic that someone is interested in:

```python
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="text-davinci-002",
  prompt="how to program in Python - computer science\nfactors leading up to WW2 - history\nbranches of government - political science\nShakespeare's plays - English\ncellular respiration - biology\nrespiratory disease - medical\nhow to sculpt - art",
  temperature=0.7,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)
```
This is python code, so it'll be pretty easy for us to use this as a starting point and to modify it to create a function that we can call.


First, you'll need install the OpenAPI via pip.  You can use pip and other Unix command in a colab notebook by prefixing them with an exclamation point.  (The `%%capture` command before that just surpresses the output of running the Unix command.  You can remove it if you want to see the progress of the command).


In [None]:
%%capture
!pip install openai

Next, you will enter your secret key for the OpenAI API, then you can find your OpenAI API key [here](https://beta.openai.com/account/api-keys).  

We will enter it as a password, so that the raw text of it doesn't get saved in your Python notebook and you accidentally make your notebook public.  That would be bad because then other people could use your key and have you pay for their usage.

In [None]:
from getpass import getpass
import os

print('Enter OpenAI API key:')
openai.api_key = getpass()

os.environ['OPENAI_API_KEY']=openai.api_key

Enter OpenAI API key:
··········


Now let's write a function that takes a topic as input and then outputs a subject to study if you want to learn about that topic.

In [None]:
import openai
import os
import time

def generate_subject_few_shot(topic):
  few_shot_prompt = """how to program in Python - computer science
factors leading up to WW2 - history
branches of government - political science
Shakespeare's plays - English
cellular respiration - biology
respiratory disease - medical
how to sculpt - art
"""

  response = openai.Completion.create(
      model="text-davinci-002",
      prompt=few_shot_prompt + topic + " - ", # We'll append our topic and a dash to the end of the few shot prompt.
      temperature=0.7,
      max_tokens=256,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0,
      stop=["\n"]
  )
  # I recommend putting a short wait after each call, 
  # since the rate limit for the platform is 60 requests/min.
  # (This increases to 3000 requests/min after you've been using the platform for 2 days).
  time.sleep(1)

  # the response from OpenAI's API is a JSON object that contains 
  # the completion to your prompt plus some other information.  Here's how to access
  # just the text of the completion. 
  return response['choices'][0]['text'].strip()

topic = "cellular respiration"
generate_subject_few_shot(topic)

'biology'

That's it!  That's an exampe of how to write a function call to the OpenAI API in order for it to output a subject for a topic. 

Here is some information about the different arguments that we to the `openai.Completion.create` call:
 * `model` – OpenAI offers four different sized versionf of the GPT-3 model: davinci, currie, babbage and ada.  Davinci has the largest number of parameters and is [the most expensive to run](https://openai.com/api/pricing/).  Ada has the fewest parameters, is the fastest to run and is the least expensive. 
 * `prompt` - this is the prompt that the model will generate a completion for
 * `temperature` - controls how much of the probability distribution the model will use when it is generating each token. 1.0 means that it samples from the complete probability distrubiton, 0.7 means that it drops the bottom 30% of the least likely tokens when it is sampling. 0.0 means that it will perform deterministically and always output the single most probable token for each context. 
 * `top_p` - is an alternative way of controling the sampling. 
 * `frequency_penalty` and `presence_penalty` are two ways of reduing the model from repeating the same words in one output.  You can set these to be >0 if you're seeing a lot of repetition in your output. 
 * `max_tokens` is the maximum length in tokens that will be output by calling the function.  A token is a subword unit.  There are roughly 2 or 3 tokens per word on average.
 * `stop` is a list of stop sequences.  The model will stop generating output once it generates one of these strings, even if it hasn't reached the max token length. By default this is set to a special token `<|endoftext|>`.

You can read more about [the Completion API call in the documentation](https://beta.openai.com/docs/api-reference/completions).

# Zero shot learning

In addition to few shot learning, GPT-3 can sometimes also perform "zero shot learning" where instead of giving it several examples of what we want it to do, we can instead give it instructions of what we want it to do.

For example, for our topic - subject task we could give GPT-3 the prompt

> Given a topic, output the subject that a student should study if they want to know more about that topic.

Then if we append 
> cellular respiration -

GPT3 will output biology.

Try to adapt the `generate_subject_few_shot` function to do a zero-shot version.

In [None]:
def generate_subject_zero_shot(topic):
  # TODO - write this function
  pass

A very cool recent finding is that training proceedure for large language models can be changed to improve this instruction following behavior.  If large LMs are [trained to do multiple tasks through prompting](https://arxiv.org/abs/2110.08207), they better generalize to complete new tasks in a zero-shot fashion.  The current version of GPT3 (text-davinci-2) uses this kind of training.

Try writing zero-shot prompts to do the following tasks:
1. Summarize a Wikipedia article.
2. Answer questions about an article.
3. Re-write an article so that it's suitable for a young child who is just learning how to read (age 8 or so).
4. Translate an article from Russian into English.

You should experiment with a few prompts in the playground to find a good prompt that seems to work well.

In [None]:
def summarize(article_text):
  # TODO - write this function 
  summary = ""
  return summary

def answer_question(article_text, question):
  # TODO - write this function 
  answer = ""
  return answer

def simplify(article_text):
  # TODO - write a function to re-write an article so that it's suitable for a young child.
  simplified_article = ""
  return simplified_article

def translate(article_text, source_language, target_language):
    # TODO - write a function to translate an article from a source language to a target language.
  simplified_article = ""
  return simplified_article

Show your outputs in your prompts.  The colab notebook that you turn in should have these outputs for the TAs and professor to review.

In [None]:
article_text = """
TODO - copy and paste part of a Wikipedia article here.
"""

summarize(article_text)

In [None]:
article_text = """
TODO - copy and paste part of a Wikipedia article here.
"""
questions = [
    "TODO - add questinon 1",
    "TODO - add questinon 2",
    "TODO - add questinon 3",
    "TODO - add questinon 4",
    "TODO - add questinon 5",
]

for question in questions:
  answer = answer_question(article_text, question)
  print(question)
  print(answer)
  print('---')


In [None]:
article_text = """
TODO - copy and paste part of a Wikipedia article here.
"""

simplify(article_text)

In [None]:
russian_article = """
TODO - copy and paste part of a Russian langauge Wikipedia article here.
"""

source_language = "Russian"
target_language = "English"
translate(russian_article, source_language, target_language)

## TODO - Pick your own task

For this section you should pick some task that you'd like to have GPT3 do.  Add a description and code to your notebook here.  You should:
1. Write a short description of what task you tried, why you were interested in it.
2. Give some code so that we can reproduce what you did via an Open API call.  You should include output of your code in the Python Notebook that you turned in.
3. Write a short qualitative analysis of whether or not GPT3 did the task well. 

TODO - your task description

In [None]:
# TODO your code

TODO - write a short paragraph giving your qualitative analysis of how well GPT3 did for your task.

# Fine Tuning

In addition to zero-shot and few-shot learning, another way of getting large language models to do your tasks is via a process called "fine tuning".  In fine-tuning the model updates its parameters so that it performs well on many training examples.  The training examples are in the form of input prompts paired with gold standard completions.

Large language models are pre-trained to perform well on general tasks like text completion but not on the specific task that you might be interested in.  The models can be fine tuned to perform you task, starting with the model parameters that are good for the general setting, and then updating them to be good for your task. 

We'll walk through how to fine-tune GPT3 for a task.


For this example, we will show you how to fine tune GPT3 to write biographies. From data in the info boxes in Wikipedia pages.  For instance, given this input 

```
notable_type: scientist
name: Zulima Aban
gender: female
birth_date: 05 December 1905
birth_place: Valencia, Spain
death_date: 09 August 1983
death_place: Detroit, Michigan, U.S.
death_cause: Pulmonary embolism
occupation: Astronomer
fields: Astrophysics, Computer Science, Computer Graphics, Interface Design, Image Synthesis
known_for: The Search for Planet Nine
hometown: Detroit, Michigan, U.S.
nationality: Venezuelan
citizenship: Spanish, American
alma_mater: University of Valencia (B.Sc.), University of Madrid (Ph.D.)
thesis_title: The Formation of Planets by the Accretion of Small Particles
thesis_year: 1956
doctoral_advisor: Angela Carter
awards: Spanish Academy of Science, Spanish Academy of Engineering, German Aerospace Prize, IEEE Medal of Honor, IEEE John von Neumann Medal, IEEE Jack S. Kilby Signal Processing Medal, United Nations Space Pioneer Award, Wolf Prize in Physics
institutions: Oberlin College, University of Valencia, Instituto de Astrofísica de Andalucía (CSIC), University of Southern California, Space Telescope Science Institute (STScI)
notable_students: Ryan Walls
influences: Immanuel Kant, Albert Einstein, Kurt Gödel, Gottfried Leibniz, Richard Feynman, Werner Heisenberg, William Kingdon Clifford, Sir Arthur Eddington
influenced: Joseph Weinberg
mother: Ana Aban
father: Joaquín Aban
partner: Georgina Abbott
children: Robert, Peter, Sarah
```

The fine-tuned model will generate this output:

> Zulima Aban was a Venezuelan astronomer, who was born on 05 December 1905 in Valencia, Spain to Ana Aban and Joaquín Aban. Her career involved the fields of Astrophysics, Computer Science, Computer Graphics, Interface Design, Image Synthesis. Aban was known for The Search for Planet Nine. Aban went to University of Valencia (B.Sc.), University of Madrid (Ph.D.). Aban's thesis title was The Formation of Planets by the Accretion of Small Particles in 1956. Her doctoral advisor was Angela Carter. Aban received Spanish Academy of Science, Spanish Academy of Engineering, German Aerospace Prize, IEEE Medal of Honor, IEEE John von Neumann Medal, IEEE Jack S. Kilby Signal Processing Medal, United Nations Space Pioneer Award, Wolf Prize in Physics. Aban went to Oberlin College, University of Valencia, Instituto de Astrofísica de Andalucía (CSIC), University of Southern California, Space Telescope Science Institute (STScI). Her notable students were Ryan Walls. Aban was influenced by Immanuel Kant, Albert Einstein, Kurt Gödel, Gottfried Leibniz, Richard Feynman, Werner Heisenberg, William Kingdon Clifford, Sir Arthur Eddington and she infuenced Joseph Weinberg. Aban was married to Georgina Abbott and together had three children, Robert, Peter, Sarah. Aban died on 09 August 1983 in Detroit, Michigan, U.S due to Pulmonary embolism.

The dataset that we will use was created for the paper [SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets](https://www.cis.upenn.edu/~ccb/publications/synthbio.pdf) by Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, and Sebastian Gehrmann. It was published in NeurIPS 2021.  The goal of the paper was to create a curated dataset for training large language models on synthetic data with the goal of avoiding the gender and geographic bias that is naturally present in Wikipedia due to cultural and historic reasons. 


## Load the data

In [None]:
!wget https://raw.githubusercontent.com/artificial-intelligence-class/artificial-intelligence-class.github.io/master/homeworks/large-LMs/SynthBio_train.json

--2022-08-07 23:32:22--  https://raw.githubusercontent.com/artificial-intelligence-class/artificial-intelligence-class.github.io/master/homeworks/large-LMs/SynthBio_train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5807118 (5.5M) [text/plain]
Saving to: ‘SynthBio_train.json’


2022-08-07 23:32:22 (111 MB/s) - ‘SynthBio_train.json’ saved [5807118/5807118]



In [None]:
# Load a file called 'SynthBio.json' which is a list of json objects.
# Pretty the first 5 json examples, nicely formatted.

import json
import random

def load_wiki_bio_data(filename='SynthBio_train.json', num_bios=100, randomized=True):
  with open(filename) as f:
    synth_bio_data = json.load(f)
  random.shuffle(synth_bio_data)
  bios = []
  for data in synth_bio_data:
    notable_type = data['notable_type']
    attributes = "notable_type: {notable_type} | {other_attributes}".format(
        notable_type = notable_type, 
        other_attributes = data['serialized_attrs']
    )
    biography = data['biographies'][0]
    bios.append((attributes.replace(" | ", "\n"), biography))
  return bios[:min(num_bios, len(bios))]

wiki_bios = load_wiki_bio_data()


In [None]:
attributes, bio = wiki_bios[0]
print(attributes)
print('---')
bio


notable_type: mountaineer
name: Olev Vilhelmson
gender: male
nationality: Latvian
birth_date: 07 November 1876
birth_place: Jelgava, Latvia
death_date: 21 February 1952
death_place: Geneva, Switzerland
death_cause: pneumonia
resting_place: La Chaux-de-Fonds Cemetery
start_age: 12
notable_ascents: First ascent of Mt. Kebnekaise in 1894, Youngest person to climb All peaks higher than 4,000 meter 1895
final_ascent: Highest Peak in Sweden and Scandinavia
partnerships: Gustaf Bergmann, Björn Dunker
mother: Emilia Vilhelmina Bartlett
father: Gustav Vilhelmson
partner: Anna Hulthorst
children: Vilhelmina Bergman-Malmstroem, Elsa Bergman-Malmstroem
---


'Olev Vilhelmson (7 November, 1876 - 21 February, 1952) was a Latvia. He was born in Jelgava, Latvia, His start age is 12. He had partnership with Gustaf Bergmann, Bjorn dunker. His father was Gustav Vilhelmson and his mother was Emilia Vilhelmina Bartlett. Olev first ascent was Mt. Kebnekaise in 1894. He was also the youngest person to climb all peaks higher than 4,000 meters in 1895. He died of pneumonia on February 21, 1952 in Geneva, Switzerland. He was buried in the cemetery of Chaux-de-Fonds. He was married to Anna Hulthorst and had two children Vilhelmina Bergman-Malmstroem, Elsa Bergman-Malmstroem.'

## Format Data for Fine-Tuning 

Below, I show how to format data to fine-tune OpenAI.  The OpenAI API documentation has a [guide to fine-tuning models](https://beta.openai.com/docs/guides/fine-tuning) that you should read.   The basic format of fine-tuning data is a JSONL file (one JSON object per line) with two key-value pairs: `prompt:` and `completion:`.

```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...
```

In the code below, I'll extract a prompt that contains the `attributes` variable from the intent dtermination data, and I'll have the completion be the `biography` variable.

In [None]:
import json

def create_wikibio_finetuning_data(wikibios, fine_tuning_filename):
  fine_tuning_data = []

  for attributes, bio in wiki_bios:
    prompt = "{attributes}\n---\n".format(attributes=attributes)
    completion = "Biography: {bio}\n###".format(bio=bio)
    data = {}
    data['prompt'] = prompt
    data['completion'] = completion
    fine_tuning_data.append(data)

  random.shuffle(fine_tuning_data)
  with open(fine_tuning_filename, 'w') as out:
    for data in fine_tuning_data:
        out.write(json.dumps(data))
        out.write('\n')


fine_tuning_filename='wikibio_finetuning_data.jsonl'
create_wikibio_finetuning_data(wiki_bios, fine_tuning_filename)

Next, we'll perform fine-tuning with this data using OpenAI. 

In [None]:
%%capture
!pip install --upgrade openai
!pip install jsonlines
!pip install wandb

Once you've got access to the OpenAI API, you can find your OpenAI API key [here](https://beta.openai.com/account/api-keys).

In [None]:
import os
import openai

from getpass import getpass
print('Enter OpenAI API key:')
openai.api_key = getpass()

os.environ['OPENAI_API_KEY']=openai.api_key

In [None]:
!head '{fine_tuning_filename}'

{"prompt": "notable_type: writer\nname: Helmut Berger\ngender: male\nnationality: German\nbirth_date: 22 October 1881\nbirth_place: Schonwald Germany\ndeath_date: 2 July 1954\ndeath_place: Locarno Switzerland\ndeath_cause: hemorrhage, hypertension\nresting_place: La Chaux-de-Fonds Cemetery\nalma_mater: University of Freiburg\neducation: Master of Arts in Philosophy\noccupation: German philosopher, poet, and translator\nnotable_works: \"Das Wunder\", \"Der gute Mensch\"\nlanguage: German\ngenre: poetry\nawards: Nobel Prize in Literature\nmother: Anna Augusta Elisabeth B\u00fclow\npartner: Bertha Margarete Deysing\nchildren: Klaus\n---\n", "completion": "Biography: Helmut Berger was a German philosopher, poet, and translator. He was born on October 22, 1881 in Schonwald, Germany. He attended the University of Freiburg, where he earned a Master of Arts in Philosophy. Berger's notable works include \"Das Wunder\", \"Der gute Mensch\", \"Das Wunder\". Berger won the Nobel Prize in Literatur

## Run the fine-tuning API

Next, we'll make the fine tuning API call via the command line.  Here the -m argument gives the model.  There are 4 sizes of GPT3 models.  They go in alphabetical order from smallest to largest.
* Ada 
* Baddage
* Currie
* Davinci

The models as the model sizes increase, so does their quality and their cost.  Davinci is the highest quality and highest cost model.  I recommend starting by fine-tuning smaller models to debug your code first so that you don't rack up costs.  Once you're sure that your code is working as expected then you can fine-tune a davinci model.


In [None]:
!openai api fine_tunes.create -t '{fine_tuning_filename}' -m curie
#!openai api fine_tunes.create -t '{fine_tuning_filename}' -m davinci

Logging requires wandb to be installed. Run `pip install wandb`.
Upload progress: 100% 132k/132k [00:00<00:00, 198Mit/s]
Uploaded file from wikibio_finetuning_data.jsonl: file-pVPsl61CIrLMYuQdYG9fNPJb
Created fine-tune: ft-242X8nFrzhXtWDxMAsyIgtD3
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-08-07 21:14:13] Created fine-tune: ft-242X8nFrzhXtWDxMAsyIgtD3
[2022-08-07 21:14:17] Fine-tune costs $4.07
[2022-08-07 21:14:17] Fine-tune enqueued. Queue number: 0
[2022-08-07 21:14:18] Fine-tune started
[2022-08-07 21:17:54] Completed epoch 1/4
[2022-08-07 21:18:56] Completed epoch 2/4
[2022-08-07 21:19:57] Completed epoch 3/4
[2022-08-07 21:20:58] Completed epoch 4/4
[2022-08-07 21:21:37] Uploaded model: davinci:ft-ccb-lab-members-2022-08-07-21-21-36
[2022-08-07 21:21:38] Uploaded result file: file-uXXU96NhfoBkZuONULh7Q3u1
[2022-08-07 21:21:38] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tun

You should copy down the fine-tune numbers which look like this:

```
Created fine-tune: ft-kloUh0jjVc6Jv8p9MfeGHd3s

[2022-08-06 00:43:56] Uploaded model: davinci:ft-ccb-lab-members-2022-08-06-00-57-57
```

If you forget to write it down, you can list your fine-tuned runs and models this way. These model names aren't mneumonic, so it is probably a good idea to make a note on what your model's inputs and outputs are. 

In [None]:
!openai api fine_tunes.list

You can run your fine tuned model in the OpenAI Playground.  After the model is finished finetuning you'll find it in the Engine dropdown menu (you might need to press reload in your browser for your fine-tuned model to appear).

## Call your fine-tuned model from the OpenAI API

Alternately, you can use your fine tuned model via the API by specifying it as the model.  Here's an example:

In [None]:
def generate_bio(attributes, finetuned_model):
  response = openai.Completion.create(
      model=finetuned_model,
      prompt="{attributes}\n---\n".format(attributes=attributes),
      temperature=0.7,
      max_tokens=500,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0,
      stop=["###"]
      )
  return response['choices'][0]['text'].strip()

# Replace with your model's name
finetuned_model = "davinci:ft-ccb-lab-members-2022-08-07-21-21-36"

In [None]:
attributes = """
notable_type: computer scienist
alma_mater: Stanford University (BS in Symbolic Systems), University of Edinburgh (PhD in Informatics)
birth_place: California
children: 2
gender: male
main_interests: Artificial Intelligence, Natural Language Processing 
name: Chris Callison-Burch
nationality: American
notable_works: Moses: Open source toolkit for statistical machine translation, The Paraphrase Database (PPDB)
occupation: professor
courses_taught: AI, Crowdsourcing and NLP 
enrollment_in_most_popular_course: 570 students
institution: University of Pennsylvania
"""

biography = generate_bio(attributes, finetuned_model)
print(attributes)
print('---')
biography

## Analyze your model's output

Sometimes the model will add facts that are not present in the attributes.  For instance, one time it said 
> He was a member of the research staff at IBM Research in Yorktown Heights.

which is not correct. Another time it said
> His most popular course was on AI, which had 570 students.

which is correct, but not specified in the attirbutes.

Try running your own fine-tuned model until it produces something that wasn't licensed by the attributes. 

Save the good runs and the bad run below.

In [None]:
generations_with_correct_facts = [
   """ TODO 1 """,
   """ TODO 2 """,
                       ]

generation_with_incorrect_facts_= """
"""

incorrect_facts = [
    """ TODO incorrect sentence 1 """,
]

# Fine Tune a New Model

Now that you've seen an example of how to do fine-tuning with the OpenAI API, let's have you write code to fine-tune your own model.

For this model, I'd like you to do the reverse direction of what we just did.  Given a Wikipedia Biograph like this:

> Jill Tracy Jacobs Biden (born June 3, 1951) is an American educator and the current first lady of the United States as the wife of President Joe Biden. She was the second lady of the United States from 2009 to 2017. Since 2009, Biden has been a professor of English at Northern Virginia Community College. 

> She has a bachelor's degree in English and a doctoral degree in education from the University of Delaware, as well as master's degrees in education and English from West Chester University and Villanova University. She taught English and reading in high schools for thirteen years and instructed adolescents with emotional disabilities at a psychiatric hospital. From 1993 to 2008, Biden was an English and writing instructor at Delaware Technical & Community College. Biden is thought to be the first wife of a vice president or president to hold a paying job during her husband's tenure. 

> Born in Hammonton, New Jersey, she grew up in Willow Grove, Pennsylvania. She married Joe Biden in 1977, becoming stepmother to Beau and Hunter, his two sons from his first marriage. Biden and her husband also have a daughter together, Ashley Biden, born in 1981. She is the founder of the Biden Breast Health Initiative non-profit organization, co-founder of the Book Buddies program, co-founder of the Biden Foundation, is active in Delaware Boots on the Ground, and with Michelle Obama is co-founder of Joining Forces. She has published a memoir and two children's books.

Your model should output something like this:
```
notable_type: First Lady of the United States
name: Jill Biden
gender: female
nationality: American
birth_date: 03 June 1951
birth_place: Hammonton, New Jersey
alma_mater: University of Delaware
occupation: professor of English at Northern Virginia Community College
notable_works: children's books and memoir
main_interests: education, literacy, women's health
partner: Joe Biden
children: Ashley Biden, Beau Biden (stepson), Hunter Biden (stepson)
```


In [None]:
import json

def create_wikibio_parser_finetuning_data(wikibios, fine_tuning_filename):
  # TODO - write your fine-tuning function
  pass

fine_tuning_filename='wikibio_parser_finetuning_data.jsonl'
create_wikibio_parser_finetuning_data(wiki_bios, fine_tuning_filename)

In [None]:
!openai api fine_tunes.create -t '{fine_tuning_filename}' -m curie
#!openai api fine_tunes.create -t '{fine_tuning_filename}' -m davinci

Logging requires wandb to be installed. Run `pip install wandb`.
Upload progress: 100% 132k/132k [00:00<00:00, 182Mit/s]
Uploaded file from wikibio_parser_finetuning_data.jsonl: file-5X8WBR8juU7TnGU9QbboFdIg
Created fine-tune: ft-kcbvvf5LMMwyoX3gT2GAhX5Z
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-08-07 22:44:10] Created fine-tune: ft-kcbvvf5LMMwyoX3gT2GAhX5Z
[2022-08-07 22:44:16] Fine-tune costs $4.07
[2022-08-07 22:44:16] Fine-tune enqueued. Queue number: 0
[2022-08-07 22:44:18] Fine-tune started
[2022-08-07 22:48:01] Completed epoch 1/4
[2022-08-07 22:49:04] Completed epoch 2/4
[2022-08-07 22:50:08] Completed epoch 3/4
[2022-08-07 22:51:09] Completed epoch 4/4
[2022-08-07 22:51:46] Uploaded model: davinci:ft-ccb-lab-members-2022-08-07-22-51-46
[2022-08-07 22:51:48] Uploaded result file: file-YX4jAegUgLW7n8k0nRDwQ6ee
[2022-08-07 22:51:48] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your f

In [None]:
def parse_bio(biography, finetuned_bio_parser_model):
  # TODO call the API with your fine-tuned model, return a string representing the attributes
  pass

  
finetuned_bio_parser_model="TODO"

## Test your parser

Next we will test your parser.  This will involve calling your `parse_bio` function about 250 times, so be sure that you've got it properly debugged and working before running this code. 

In [None]:
!wget https://raw.githubusercontent.com/artificial-intelligence-class/artificial-intelligence-class.github.io/master/homeworks/large-LMs/SynthBio_test.json

In [6]:
import json

def load_wiki_bio_test_set(filename='SynthBio_test.json', max_test_items=10, randomized=True):
  """ 
  Loads our wikibio test set, and returns a list of tuples 
  biographies (text), attributes (dictionaires)
  """
  with open(filename) as f:
    synth_bio_data = json.load(f)
  bios = []
  for data in synth_bio_data:
    notable_type = data['notable_type']
    attributes = data['attrs']
    attributes['notable_type'] = notable_type
    biography = data['biographies'][0]
    bios.append((biography, attributes))
  return bios[:min(max_test_items, len(bios))]


def convert_to_dict(predcited_attributes_txt):
  """
  Converts predicted attributes from text format into a dictionary.
  """
  predicted_attributes = {}
  for line in predcited_attributes_txt.split('\n'):
    attribute, value = line.split(':')
    predicted_attributes[attribute.strip()] = value.strip()
  return predicted_attributes



Helper function for computing precision, recall and f-score.

In [3]:
from collections import Counter

def update_counts(gold_attributes, predicted_attributes, true_positives, false_positives, false_negatives, all_attributes):
  # Compute true positives and false negatives
  for attribute in gold_attributes:
    all_attributes[attribute] += 1
    if attribute in predicted_attributes:
      # some attributes have multiple values.
      gold_values = gold_attributes[attribute].split(',')
      for value in gold_values:
        if value.strip() in predicted_attributes[attribute]:
          true_positives[attribute] += 1
        else:
          false_negatives[attribute] += 1
    else:
      false_negatives[attribute] += 1
  # Compute false positives 
  for attribute in predicted_attributes:
    if attribute not in gold_attributes:
      all_attributes[attribute] += 1
    if not attribute in gold_values:
      false_positives[attribute] += 1
    else:
      # some attributes have multiple values.
      predicted_values = predicted_attributes[attribute].split(',')
      for value in predicted_values:
        if value.strip() not in gold_values[attribute]:
          false_positives[attribute] += 1



In [4]:

def evaluate_on_test_set(finetuned_bio_parser_model, wiki_bio_test, threshold_count = 5):
  """
  Computer the precision, recall and f-score for each of the attributes
  that appears more than the treshold count
  """
  true_positives = Counter()
  false_positives = Counter()
  false_negatives = Counter()
  all_attributes = Counter() 

  for bio, gold_attributes in wiki_bio_test:
    predicted_attributes = convert_to_dict(parse_bio(bio, finetuned_bio_parser_model))
    update_counts(gold_attributes, predicted_attributes, true_positives, false_positives, false_negatives, all_attributes)  

  average_precision = 0
  average_recall = 0
  total = 0

  for attribute in all_attributes:
    if all_attributes[attribute] < threshold_count:
      continue
    print(attribute.upper())
    try:
      precision = true_positives[attribute] / (true_positives[attribute] + false_positives[attribute])
    except: 
      precision = 0.0
    try:
      recall = true_positives[attribute] / (true_positives[attribute] + false_negatives[attribute])
    except: 
      recall = 0.0
    print("precision:", precision)
    print("recall:", recall)
    print("f-score:", (precision+recall)/2)
    print('---')
    average_precision += precision
    average_recall += recall
    total += 1

  print("AVERAGE")
  average_precision = average_precision/total
  average_recall = average_recall/total
  print("precision:", average_precision)
  print("recall:", average_recall)
  print("f-score:", (average_precision+average_recall)/2)
  print('---')


If you would like to evaluate on the full test set, there are 237 test items.  You can set `max_test_items=237`.  Doing so will call your `parse_bio` function about 237 times, so be sure that you've got it properly debugged and working before running this code. 

In [7]:
testset_filename='SynthBio_test.json'
max_test_items=10
wiki_bio_test = load_wiki_bio_test_set(testset_filename, max_test_items)
evaluate_on_test_set(finetuned_bio_parser_model, wiki_bio_test, threshold_count = 5)

How well did your model perform?

In [None]:
# TODO - fill in these values
average_precision = 0.0
average_recall = 0.0
average_fscore = 0.0

# What attributes had the highest F-scorre
best_attributes = {
    "attrbute_name" : 0.0,
}

# What attributes had the lowest F-scorre
worst_attributes = {
    "attrbute_name" : 0.0,
}

# What could you do the perform the model's performance?
potential_improvements = """
TODO
"""

# Feedback questions

In [None]:
# How many hours did you spend on this assignment? Just an approximation is fine.
num_hours_spent = 0

# What did you think?  This was the first time we tried this assignment 
# so you're feedback is valable.
feedback = """
Type your response here.
Your response may span multiple lines.
Do not include these instructions in your response.
"""

