# Assignment 9 - GPT

In this assignment, you will use various transformer models for semantic search and for language generation. We will be using the `transformers` python package from huggingface; **note** that this package will automatically download language models as required the first time the code is run, and they can be quite large. (The entire assignment might download a few GB.) You might want to do this on campus, depending on your internet situation.

This assignment is to be done individually. You may discuss the project with your classmates, but the work you turn in should be your own.


# Using Generative Language Models

## Goal

To learn about how generative language models can be used in practice, focusing on GPT-2, which is feasible to run locally without a graphics card.

## Setup

This part uses the `transformers` package which can be installed with conda or pip.

## Questions (100 pts)

1. Write a script that generates a "story" using a local GPT-2 model. Your story should: 1) be at least 100 words long; 2) not have repeated phrases; and 3) be the same every time your script is run. It might be nonsensical and/or hilarious. Use the skeleton code provided below as a starting point, and <https://huggingface.co/blog/how-to-generate> as a reference document.

## Part 2 Deliverables

Submit your notebook as an attachment on OWL as well as a PDF version of the notebook.

---

# Checklist

Your owl submission should include the following attachments and no additional files:
```
Assignment9.ipynb
Assignment9.pdf
```

In [3]:
!pip install transformers ipywidgets



In [4]:
%pip install -q transformers
%pip install -q ipywidgets

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
import torch
from IPython.display import display, Markdown
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
from transformers import AutoTokenizer, GPT2LMHeadModel, set_seed
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [7]:
# function to generate a story
def generate_story(seed=42, max_length=120, min_length=100):
    set_seed(seed)
    
    # prompt
    prompt = "Once upon a time in a distant land, there was a mysterious forest"
    
    # tokenize input prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(torch_device)
    
    # generate text
    outputs = model.generate(
        input_ids,
        max_length=max_length,
        min_length=min_length,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        repetition_penalty=2.0
    )
    
    # decode and return generated text
    story = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return story

def show_decoded_tokens(dt):
    display(Markdown(dt))

print("My GPT-2 Story:")
print("---------------")

## Replace 'None' with your story; this just wraps the text
## to make it easier to read
show_decoded_tokens(generate_story())

My GPT-2 Story:
---------------


Once upon a time in a distant land, there was a mysterious forest with strange trees and bushes. It has the look of an alien species like those found on Earth but it's actually not as unusual for that sort-of plant to grow here or anywhere else." – Dr Serenity
The creature is called Joggerbundi by its neighbors who believe this mythical beast originated from some unknown ancient planet out across space! When asked why we know these things exist they all told him because he believes them now? That being so… maybe after awhile their belief faded away once more due towards