In [1]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "3"

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import pandas as pd
import torch
from pathlib import Path

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path ="../models/experiment3/acyclic_no_reason_deterministic/checkpoint-13873"
model_path = str(model_path)

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
model.warnings_issued = {}

Some weights of the model checkpoint at ../models/experiment3/acyclic_no_reason_deterministic/checkpoint-13873 were not used when initializing LlamaForCausalLM: ['v_head.summary.bias', 'v_head.summary.weight']
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
dataset_path = "../dataset/cnn_dailymail/validation.csv"
df = pd.read_csv(dataset_path)
dataset = Dataset.from_pandas(df)

In [4]:
display(dataset[1940])
print(len(dataset))

{'id': 'd7a7071275362adb4e49cdd29da6b343fc39eaa2',
 'article': "A 7-year-old boy from North Caronia has died after his 11-year-old brother accidentally shot him in the neck with an arrow. The incident happened March 15 at the family's home in Wake County. Daniel Velazquez succumbed to his injuries four days later. According to sheriff’s deputies, the 7-year-old was struck in the throat with an arrow shot from a bow by his big brother, Israel. Freak accident: Daniel Velazquez (right), 7, and his big brother, Israel (left), were playing with a bow and arrow when the 11-year-old accidentally shot and mortally wounded his sibling . Close to home: The accident happened March 15 in the woods near the family's mobile home in Wake County, North Carolina . Daniel was rushed to WakeMed in critical condition, but despite doctors’ efforts he could not be saved. The brothers' uncle Javier Velazquez told ABC11 last week the boys’ parents were home at the time of the accident, which took place at aro

13368


In [12]:
article = dataset[1941]["article"]
inputs = tokenizer(f'{article}\nTL;DR:\n', return_tensors="pt").to(device)
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])

"<s> Tens of thousands of comic book fans decked out in fancy dress gathered in London today for one of the UK's biggest celebration of animated culture. The London Super Comic Book Convention, which kicked off last today, saw fans donning elaborate and often skimpy costumes in homage to their comic book favourites. Comic super-fan and television presenter Jonathan Ross was among the celebrity guests for day one alongside Storage Wars' Sean Kelly and John Romita Junior- son of comic book legend by the same name- who draws for DC comics. There were a number of tributes to Guardians of the Galaxy's tree-like character Groot and characters from the Lego Movie. Among the best-dressed were also a trio who posed as Batman and his nemeses the Joker and Harley Quinn. A total of 25,000 fans are expected to pay the Excel Centre, in east London, a visit over the course of the event, which lasts until tomorrow. The annual event, now in its four year, is the biggest comic book themed event in the U

In [6]:
print(article)

A 7-year-old boy from North Caronia has died after his 11-year-old brother accidentally shot him in the neck with an arrow. The incident happened March 15 at the family's home in Wake County. Daniel Velazquez succumbed to his injuries four days later. According to sheriff’s deputies, the 7-year-old was struck in the throat with an arrow shot from a bow by his big brother, Israel. Freak accident: Daniel Velazquez (right), 7, and his big brother, Israel (left), were playing with a bow and arrow when the 11-year-old accidentally shot and mortally wounded his sibling . Close to home: The accident happened March 15 in the woods near the family's mobile home in Wake County, North Carolina . Daniel was rushed to WakeMed in critical condition, but despite doctors’ efforts he could not be saved. The brothers' uncle Javier Velazquez told ABC11 last week the boys’ parents were home at the time of the accident, which took place at around 12.45pm March 15. According to Velazquez, Daniel and Israel 

In [15]:
# test
article = dataset[1939]["article"]
messages = [{"role": "user", "content": f'{article}\nTL;DR:\n'}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

output_ids = model.generate(input_ids, max_new_tokens=400)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

<|user|>
(CNN)First, the women. Now, the men. It seems everyone is getting in on the "Ghostbusters" reboot action. But not everyone is happy about it. Deadline reports that Sony Pictures is planning an all-male "Ghostbusters" reboot starring Channing Tatum and directed and produced by brothers Joe Russo and Anthony Russo. The studio is also creating a production company, Ghostcorps, which will include Ivan Reitman and Dan Aykroyd, who were among the original "Ghostbusters" team, according to Deadline. "We want to expand the Ghostbusters universe in ways that will include different films, TV shows, merchandise, all things that are part of modern filmed entertainment," Deadline quoted Reitman as saying. The film "has a wonderful idea that builds" on the premise of an all-female "Ghostbusters" reboot expected to begin filming this summer. That film is expected to star Melissa McCarthy,  Kristen Wiig, Leslie Jones and Kate McKinnon. "It's just the beginning of what I hope will be a lot of 

In [8]:
def get_summary_no_prompt(content, model):
    message = [{"role": "user", "content": content}]
    prompt = tokenizer.apply_chat_template(
        message, tokenize=False, add_generation_prompt=True
    )

    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    attention_mask = torch.ones_like(input_ids)

    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
    )
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    assistant_tag = "<|assistant|>"
    if assistant_tag in output_text:
        output_text = output_text.split(assistant_tag, 1)[1].strip()

    return output_text, output_ids.shape[-1] - input_ids.shape[-1]

In [9]:
def get_summary(content, model):
    message = [
        {
            "role": "user",
            # "content": f"Summarize the following text in a TL;DR style in one sentence\n\n{content}\n",
            "content": f"TL;DR: \n\n{content}\n",
        }
    ]
    prompt = tokenizer.apply_chat_template(
        message, tokenize=False, add_generation_prompt=True
    )

    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    attention_mask = torch.ones_like(input_ids)

    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=500,
        do_sample=True,
        temperature=1.3,
        top_p=0.95,
        top_k=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
    )
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    assistant_tag = "<|assistant|>"
    if assistant_tag in output_text:
        output_text = output_text.split(assistant_tag, 1)[1].strip()

    return output_text, output_ids.shape[-1] - input_ids.shape[-1]

In [10]:
avarge_length = 0
for i in range(1000):
    content = dataset[i]["article"]
    length = tokenizer.encode(content, return_tensors="pt").shape[-1]
    avarge_length += length
print(f"Average length of articles: {avarge_length / 1000}")

Token indices sequence length is longer than the specified maximum sequence length for this model (2447 > 2048). Running this sequence through the model will result in indexing errors


Average length of articles: 992.633


In [11]:
for k in [2, 3]:
    print("prompt", k)
    print(dataset[k]["article"])
    for i in range(1, 6):
        print(f"summary {i}")
        summary, _ = get_summary(dataset[k]["article"], model)
        print(summary, "\n", _)

prompt 2
A man convicted of killing the father and sister of his former girlfriend in a fiery attack on the family's Southern California home was sentenced to death on Tuesday. Iftekhar Murtaza, 30, was sentenced for the murders of Jay Dhanak, 56, and his daughter Karishma, 20, in May 2007, the Orange County district attorney's office said. Murtaza was convicted in December 2013 of killing the pair in an attempt to reunite with his then-18-year-old ex-girlfriend Shayona Dhanak. She had ended their relationship citing her Hindu family's opposition to her dating a Muslim. To be executed: Iftekhar Murtaza, 30, was sentenced to death Tuesday for the May 21, 2007 murders of his ex-girlfriend's father and sister and the attempted murder of her mother . Authorities said Murtaza and a friend torched the family's Anaheim Hills home and kidnapped and killed Dhanak's father and sister, leaving their stabbed bodies burning in a park 2 miles from Dhanak's dorm room at the University of California, 