## LLaMa 3.1

LLaMa 3.1 is a very recent model from Meta. Check out [the model card](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) for further details. It is open-sourced.  To use it, you need to log in to your Hugging Face account and get permission.  We're using the 8 billion parameter version but quantized so it has a much smaller memory footprint.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-fall-main/blob/master/materials/lesson_notebooks/lesson_7_summarization_LLM.ipynb)

In [1]:
!pip install -q -U bitsandbytes flash_attn

In [2]:
import torch
from transformers import pipeline
from pprint import pprint

Here's some text from the introduction to [The Prompt Report: A Systematic Survey of Prompting Techniques](https://arxiv.org/pdf/2406.06608).  Let's have the model summarize it.

In [3]:
ARTICLE = "Scope of Study We create a broad directory of prompting techniques, which can be quickly understood and easily implemented for rapid experimentation by developers and researchers. To this end, we limit our study to focus on discrete prefix prompts (Shin et al., 2020a) rather than cloze prompts (Petroni et al., 2019; Cui et al., 2021), because modern LLM architectures (especially decoder-only models), which use prefix prompts, are widely used and have robust support for both consumers and researchers. Additionally, we refined our focus to hard (discrete) prompts rather than soft (continuous) prompts and leave out papers that make use of techniques using gradient-based updates (i.e. fine-tuning). Finally, we only study task-agnostic techniques. These decisions keep the work approachable to less technical readers and maintain a manageable scope. "

ARTICLE += "Sections Overview We conducted a machine-assisted systematic review grounded in the PRISMA process (Page et al., 2021) (Section 2.1) to identify 58 different text-based prompting techniques, from which we create a taxonomy with a robust terminology of prompting terms (Section 1.2) While much literature on prompting focuses on English-only settings, we also discuss multilingual techniques (Section 3.1). Given the rapid growth in multimodal prompting, where prompts may include media such as images, we also expand our scope to multimodal techniques (Section 3.2). Many multilingual and multimodal prompting techniques are direct extensions of English text-only prompting techniques. "

ARTICLE += "As prompting techniques grow more complex, they have begun to incorporate external tools, such as Internet browsing and calculators. We use the term ‘agents‘ to describe these types of prompting techniques (Section 4.1). It is important to understand how to evaluate the outputs of agents and prompting techniques to ensure accuracy and avoid hallucinations."

len(ARTICLE)

1898

In [4]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,


)

Let's load the LLama 3.1 model using the Pipeline abstraction from Hugging Face and before we have it summarize, let's ask it a question and see how well it answers.  Do you think the answer is accurate?  How might we evaluate the generated answer?

In [5]:
#import transformers
#import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16, "quantization_config": quantization_config},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a science communicator who makes technology accessible to everyone!"},
    {"role": "user", "content": "Please write a five sentence explanation of how LLMs do knowledge representation."},
]

outputs = pipeline(
    messages,
    max_new_tokens=512,
)

pprint(outputs[0]["generated_text"][-1], compact=True)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'content': 'Large Language Models (LLMs) represent knowledge through a '
            'complex network of interconnected nodes, where each node '
            'corresponds to a unique concept or piece of information. This '
            'network, known as a knowledge graph, is built by training the '
            'model on vast amounts of text data, allowing it to learn '
            'relationships and patterns between concepts. As the model '
            'processes new input, it generates a set of "embeddings" – '
            'numerical vectors that capture the semantic meaning of the input '
            'and its connections to other concepts in the graph. These '
            'embeddings enable the model to reason about the input, make '
            'predictions, and generate responses that reflect its '
            'understanding of the underlying knowledge. By leveraging this '
            'knowledge graph, LLMs can represent and manipulate complex '
            'knowledge structures, 

Now let's try it for abstractive summarization.  Note that it takes a lot longer to generate answers because this model has 8 billion.  The next cell can take up to 2 minutes to complete.

How good is the output from Llama3.1?  How can we measure the performance? What's are all of the elements we need to say run ROUGE?

In [6]:
messages = [
            {"role": "system", "content": "You are an expert on natural language processing.  Please summarize the following content for a fifth grader. Your summary should be no longer than five sentences."},
            {"role": "user", "content": ARTICLE},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

#lets set some values to have more control over the output
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


('Imagine you have a super smart computer program that can understand and '
 'respond to questions. To help it understand what we want to do, we give it '
 'special instructions called "prompts." We\'re studying different ways to '
 "create these prompts so we can understand how to make them better. We're "
 'focusing on simple prompts that use words, not pictures or math problems. '
 "We're also looking at how to make these prompts work in different languages "
 'and with different tools, like the internet or calculators. This will help '
 "us make sure the computer program gives us accurate answers and doesn't make "
 "up things that aren't true.")


Try it yourself.  You can fill in the system and the user portion of the prompt.  See what kinds of questions it can answer and see how well it summarizes content.

In [None]:
messages = [
            {"role": "system", "content": "Your Value Here"},
            {"role": "user", "content": "Your Value Here"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

#lets set some values to have more control over the output
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)