In [1]:
import json
import jsonlines
import openai
import os
import unicodedata
import random
import spacy

from collections import Counter, defaultdict
from dotenv import load_dotenv

load_dotenv()
nlp = spacy.load('en_core_web_sm')
openai.api_key = os.environ['OPENAI_API_KEY']

In [2]:
%load_ext blackcellmagic

In [3]:
with open('./data/jokes.json') as f:
    data = json.load(f)

In [4]:
EXCLUDE_IDS = set([88572, 99457, 99483]) # dupes

jokes = [d['body'] for d in data if d['id'] not in EXCLUDE_IDS]

print('\n\n'.join(jokes[:5]))

Yes, I know what you guys are thinking, "Hey, it's the guy from Twitter."

I'm glad to be on cable. The truth is, I've dreamed of being a talk show host on basic cable ever since I was 46.

I'm happy to report that we're already #1 in TBS's key demographic—people who can't afford HBO.

It's not easy doing a late-night show on a channel without a lot of money and that viewers have trouble finding. So that's why I left NBC.

That's right—the whitest man in show business is back…on the second blackest channel on TV.


In [5]:
counts = defaultdict(int)
jokes_cleaned = []

for joke in jokes:
    if "Conan" in joke:
        continue
        
    joke_clean = (
        unicodedata.normalize("NFKD", joke)
        .replace("  ", " ")
        .replace("—", "--")
        .strip()
    )

    tokens = nlp(joke_clean)
    sentences = [sent.text for sent in tokens.sents]
    sentence_ct = len(sentences)

    entry = {"text": joke_clean, "sentences": sentences, "sentence_ct": sentence_ct}
    jokes_cleaned.append(entry)
    counts[sentence_ct] += 1

print(counts)

defaultdict(<class 'int'>, {1: 844, 2: 8783, 3: 347, 4: 20, 7: 1, 5: 6})


In [6]:
# sample jokes with given num of sentences
sentence_ct = 1
filtered = [j["text"] for j in jokes_cleaned if j["sentence_ct"] == sentence_ct]

print("\n\n".join(random.sample(filtered, 5)))

My favorite new song is either that one from the Samsung commercial, or the one from the Budweiser commercial.

I don’t take my career for granted, which is why every April I renew my commercial crab fisherman’s license.

I’ve won Fantasy Football by never ever playing it.

In line to visit Santa Claus, and the kid ahead of us just told Santa he only wants one thing for Christmas: "the goddamn truth."

Turns out the proper response to "How’s it hanging?" is NOT "via a complex network of pulleys and trusses."


In [8]:
i, MAX_ENTRIES = 0, 9000
fname = "./data/training.jsonl"

# format and output training data
with jsonlines.open(fname, mode="w") as f:
    for joke in jokes_cleaned:
        # don't use single sentence jokes
        if joke["sentence_ct"] < 2:
            continue
        
        i += 1
        if i > MAX_ENTRIES:
            break

        prompt = "{}\n\n###\n\n".format(joke["sentences"][0])
        completion = " {} END".format(" ".join(joke["sentences"][1:]))

        f.write({"prompt": prompt, "completion": completion})

In [9]:
!openai tools fine_tunes.prepare_data -f "./data/training.jsonl"

Analyzing...

- Your file contains 9000 prompt-completion pairs
- All prompts end with suffix `\n\n###\n\n`
- All completions end with suffix ` END`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "./data/training.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" END"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 2.1 hours to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


In [None]:
# kick off fine-tuning
!openai api fine_tunes.create \
  -t "./data/training.jsonl" \
  -m curie \
  --suffix "conan" \
  --n_epochs 2 \
  --learning_rate_multiplier 0.1

In [None]:
!openai api fine_tunes.list

In [None]:
!openai api fine_tunes.get -i TODO_ID

In [28]:
test_entry = random.choice(jokes_cleaned)
print("ORIGINAL JOKE:\n{}\n".format(test_entry["text"]))

prompt = "{}\n\n###\n\n".format(test_entry["sentences"][0])
print("PROMPT:\n{}".format(prompt))

results = openai.Completion.create(
    model=os.environ["OPENAI_MODEL"],
    prompt=prompt,
    max_tokens=100,
    temperature=0.5,
    n=3,
    stop=[" END"],
)
print("MODEL RESPONSE:\n")
print(
    "\n\n".join(
        [
            "{}: {}".format(i + 1, r["text"])
            for i, r in enumerate(results["choices"])
        ]
    )
)

ORIGINAL JOKE:
During a TV special magician David Blaine performed magic for Kanye West. Blaine performed an amazing trick where he got Kanye to not talk about Kanye for eight seconds.

PROMPT:
During a TV special magician David Blaine performed magic for Kanye West.

###


MODEL RESPONSE:

1:  Afterwards, Kanye said, "I'm not sure what just happened, but I'm pretty sure I'm a genius."

2:  Apparently, Blaine made Kanye believe that he was a man who can stay dry for more than an hour.

3:  Afterwards, Kanye said, "I'm not impressed with your magic and I'm not impressed with your face."
