# Week 10: Language Modeling

## Setup

In [1]:
#setup
import warnings; warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

df = pd.read_pickle('sc_cases_cleaned.pkl', compression='gzip')
df = df.assign(author_id=(df['authorship']).astype('category').cat.codes)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 819
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   case_name       768 non-null    object        
 1   opinion_type    768 non-null    object        
 2   date_standard   768 non-null    datetime64[ns]
 3   authorship      768 non-null    object        
 4   x_republican    768 non-null    float64       
 5   maj_judges      768 non-null    object        
 6   dissent_judges  768 non-null    object        
 7   topic_id        768 non-null    float64       
 8   cite_count      768 non-null    float64       
 9   opinion_text    768 non-null    object        
 10  year            768 non-null    int64         
 11  log_cite_count  768 non-null    float64       
 12  author_id       768 non-null    int8          
dtypes: datetime64[ns](1), float64(4), int64(1), int8(1), object(6)
memory usage: 78.8+ KB


# GPT-2 and Language Generation


In [2]:
# load GPT2

from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

In [3]:
input_ids = tokenizer.encode('I enjoy generating', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


I enjoy generating and sharing my own ideas and ideas for the future. I'm always looking for new ways to make my life better. I'm always looking for ways to make my life better.

I'm always looking for ways to make my


In [4]:
# activate beam search and early_stopping

beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, # to avoid repetitions of the same word sequences
    early_stopping=True
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

I enjoy generating my own content, so I'm always looking for ways to improve it.

If you have any questions or comments, feel free to leave them in the comments below.


In [5]:
# activate sampling and deactivate top_k by setting top_k sampling to 0

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy generating more information (aka lower efficiency when we do so) and am doing this right now because it's so genuinely empowering to so many developers, is what makes me feel passionate about this series; that actually provides benefits to the performance of the


In [6]:
# sample only from 92% most likely words

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

# arguably the best generation technique

I enjoy generating company-wide consensus like this is a pleasant experience, where just one or two guys want to share in the optimism, love and belief without necessarily being wrong.

Leadership and people can also back their projects as a good practice


**GPTNeo**

In [7]:
from transformers import GPT2Tokenizer, GPTNeoForCausalLM

model_name = "EleutherAI/gpt-neo-125M"

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPTNeoForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

input_ids = tokenizer.encode('I enjoy generating', return_tensors='pt')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Downloading:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

I enjoy generating script on command line tools / pasted. a iam.htm is well known in the world, its good for good functioning and your site will allow you to save it automatically. right click on page title of iam.htm


**Conditional Text Generation**

In [8]:
input_ids = tokenizer.encode('Donald Trump:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))



Donald Trump: Mr. Roger Stone is “a right-wing man” who wants to be “not a lawyer,” “not a politician.” I know he sees you’re making a mistake; I


In [9]:
input_ids = tokenizer.encode('Joe Biden:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))



Joe Biden: Vote for the One - 4/35/2012

MIAMI (AP) -- Vice President Joe Biden is advancing a new effort to bring up a bill that also prevents all ticket voting. That bill will no longer be called a


In [10]:
input_ids = tokenizer.encode('Justice Ruth Bader Ginsburg:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Justice Ruth Bader Ginsburg: Awardrobe Night Ralls Days

$9.00

Brushbone Ball Set (GMAZ It’s Promised $9.00, not the “later” price
