# Week 10: Language Models

## Setup

In [1]:
#setup
import warnings; warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

df = pd.read_pickle('sc_cases_cleaned.pkl', compression='gzip')
df = df.assign(author_id=(df['authorship']).astype('category').cat.codes)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 819
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   case_name       768 non-null    object        
 1   opinion_type    768 non-null    object        
 2   date_standard   768 non-null    datetime64[ns]
 3   authorship      768 non-null    object        
 4   x_republican    768 non-null    float64       
 5   maj_judges      768 non-null    object        
 6   dissent_judges  768 non-null    object        
 7   topic_id        768 non-null    float64       
 8   cite_count      768 non-null    float64       
 9   opinion_text    768 non-null    object        
 10  year            768 non-null    int64         
 11  log_cite_count  768 non-null    float64       
 12  author_id       768 non-null    int8          
dtypes: datetime64[ns](1), float64(4), int64(1), int8(1), object(6)
memory usage: 78.8+ KB


# GPT-2 and Language Generation


In [1]:
# load GPT2

from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


In [3]:
input_ids = tokenizer.encode('I enjoy generating', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


I enjoy generating and sharing my own ideas and ideas for the future. I'm always looking for new ways to make my life better. I'm always looking for ways to make my life better.

I'm always looking for ways to make my


In [4]:
# activate beam search and early_stopping

beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, # to avoid repetitions of the same word sequences
    early_stopping=True
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

I enjoy generating my own content, so I'm always looking for ways to improve it.

If you have any questions or comments, feel free to leave them in the comments below.


In [5]:
# activate sampling and deactivate top_k by setting top_k sampling to 0

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy generating both cheats and side-splice hunts for items (e.g., armor) in dragons encounter lists! It's great to: 1) decide where to keep the item, 2) influence tree-based walkthrough options (


In [10]:
# sample only from 92% most likely words

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

# arguably the best generation technique

I enjoy generating animation. You get a glimpse into the mind of the animated character, but also this realm of imagination. Don't have a chance of unlocking animation that will pique your interest while you're young. Focus on the thought processes. Keep


In [24]:
# typical sampling: https://arxiv.org/pdf/2202.00666.pdf

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    typical=0.1, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

I enjoy generating and maintaining fair and just content both physically and digitally, but providing no vision and effort in an effort to control/loss our views," PCUSA VP of Media Relations Jason Varner told the site. "Even where iOS also makes websites


**GPTNeo**

In [12]:
from transformers import GPT2Tokenizer, GPTNeoForCausalLM

model_name = "EleutherAI/gpt-neo-125M"

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPTNeoForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

input_ids = tokenizer.encode('I enjoy generating', return_tensors='pt')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy generating images to use in desktop environment; I find myself frustrated every time I try to change file paths, but these are simply too quick and efficient to a certain extent.

First of all, I would like to ask some questions about


**Conditional Text Generation**

In [13]:
input_ids = tokenizer.encode('Donald Trump:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))



Donald Trump: Jobs are everywhere, every American has some role in creating jobs. But if you haven't been involved in raising funds for progressive goals and economic reform, you’re not a human being at all, you’re a worker


In [14]:
input_ids = tokenizer.encode('Joe Biden:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))



Joe Biden: Standing in the White House. Photo: Margareta Porter/Getty Images

And there is one thing that Barack Obama, the candidate of good intentions, and Christopher Dodd, the Federal Reserve Board chairman, are unappealing to


In [15]:
input_ids = tokenizer.encode('Justice Ruth Bader Ginsburg:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Justice Ruth Bader Ginsburg: Goldman Sachs has lost its own national cooperation with Britain’s corporate executive, Uli Bond, on reports that Goldman Sachs had been engaged in a co-optation of the Rhoads, a video game
