# **Open-Source Large Language Models for Structured Information Extraction**

Open-source large language models can be used to extract structured infomation from unstructured text. This notebook demonstrates doing so "locally" with the `llama.cpp` library


Points for speaker:
- Why are we using Colab?


In [None]:
from pathlib import Path

working_dir = Path(
    "/nvme/storage_michiel/llm_workshop"
)  # /content when working with remote runtime

In [None]:
# @title Connect to Google Drive
from google.colab import drive

drive.mount("/content/gdrive")

In [None]:
# @title Imports and downloads
%%capture
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
#!wget https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf -P $working_dir
# !wget https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q5_K_M.gguf -P $working_dir
!wget https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf -P $working_dir

In [None]:
# @title Instantiate the local LLM
%%capture
from llama_cpp import Llama

llm = Llama(
    model_path=str(working_dir / "Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf"),
    n_gpu_layers=-1,
    n_ctx=8192,
    random_seed=42,
)
llm.verbose = False

In [None]:
# @title Define helper functions
from pprint import pformat, pp, pprint

template = """
<|im_start|>user
{prompt}
<|im_end|>
<|im_start|>assistant
"""


def local_llm(
    prompt, verbose=False, apply_template=True, temperature=0.7, max_tokens=None
):
    if apply_template:
        prompt = template.format(prompt=prompt)
    if verbose:
        print(f"Prompt:\n{prompt}")
    response = llm(prompt, max_tokens=max_tokens, temperature=temperature, top_p=0.95)
    return response["choices"][0]["text"]

- Overview of different models, sizes
- Foundation/base models vs chat / instruction models
- "Access / Privacy"
- `llama-cpp`!
- quantization


# Prompting basics

In [None]:
response = local_llm(
    "Write me promotional material for a workshop demonstrating use cases of open-source large language models"
)
print(response)

- Explain what happened - we called a local LLM!
- Chat template

## Chat templates

In [None]:
response = local_llm(
    "In what city is Campus Fryslan located?",
    verbose=True,
)
print(response)

In [None]:
response = local_llm(
    "In what city is Campus Fryslan located?",
    apply_template=False,
    verbose=True,
    temperature=0.0,
)
print(response)

## Temperature

In [None]:
prompt = """
I'm organizing a workshop on using LLMs to extract structured information from
texts / corpora for non-technical researchers at a university.
Could you suggest me a few catchy titles, free of jargon?
"""

response = local_llm(prompt, temperature=0.0)
print(response)

In [None]:
response = local_llm(prompt, temperature=0.0)
print(response)

In [None]:
response = local_llm(prompt, temperature=0.9)
print(response)

In [None]:
response = local_llm(prompt, temperature=0.9)
print(response)

## Number of input / output tokens

- What is a token?


In [None]:
response = local_llm(prompt, max_tokens=20)
print(response)

In [None]:
!wget https://www.gutenberg.org/cache/epub/100/pg100.txt -P $working_dir
long_text = (working_dir / "pg100.txt").read_text(encoding="utf-8")

In [None]:
print(long_text[:500])

In [None]:
long_prompt = "Please summarize the following: \n" + long_text
response = local_llm(long_prompt)

# Prompt Engineering 101

- Zero shot learning
- Few shot learning
- Chain of thought


In [None]:
# @title Zero-shot prompting
prompt = """
Classify the text into neutral, negative or positive.
Text: I think the workshop is okay.
"""
print(local_llm(prompt))

In [None]:
# @title One-shot prompting
prompt = """
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.

To do a "farduddle" means to jump up and down really fast. Please give an example of a sentence that uses the word farduddle.
"""
local_llm(prompt)

In [None]:
# @title Chain-of-thought prompting

prompt_no_cot_formatted = """
<|im_start|>user
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?
<|im_end|>
<|im_start|>assistant
The answer is 11.
<|im_end|>
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant
"""
print(local_llm(prompt_no_cot_formatted, apply_template=False, verbose=True))

In [None]:
prompt_cot_formatted = """
<|im_start|>user
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?
<|im_end|>
<|im_start|>assistant
Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
<|im_end|>
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant
"""
print(local_llm(prompt_cot_formatted, apply_template=False))

In [None]:
# @title Zero-shot chain-of-thought

prompt_cot_zs = """
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant
Let's think step by step: """
print(local_llm(prompt_cot_zs, apply_template=False))


# Scaling up

- Prompt template
- Structure output
- Retry until structure is valid
- External APIs


In [None]:
# @title Fetch Data and Load Into Pandas
%%capture

!wget "http://datascience.web.rug.nl/llm_parliamentary_sample.csv" -P $working_dir

import pandas as pd

df = pd.read_csv(working_dir / "llm_parliamentary_sample.csv")

In [None]:
df.query("votes_diff > 0").head()

In [None]:
first_row = df.query("votes_diff > 0").iloc[0]

In [None]:
# @title Prompt templates, structuring outputs

formatted_prompt_template = """
<|im_start|>user
I will provide you a question and a response given in a parliamentary setting.

The question:
*********
{question}
*********

The answer:
*********
{answer}
*********

Does the response sufficiently answer the question?

Return your answer as a valid JSON object with a single field `final answer` with
a boolean value with your final answer, like {{"final_answer": …}}.
<|im_end|>
<|im_start|>assistant
"""

formatted_prompt = formatted_prompt_template.format(
    question=first_row["question_text"].strip(), answer=first_row["answer_text"].strip()
)

response = local_llm(formatted_prompt, apply_template=False, verbose=True)
print("\nLLM answer: ")
print(response)

response = local_llm(
    formatted_prompt + "Let's think step by step: ", apply_template=False, verbose=True
)
print("\nLLM answer (Zero-shot CoT): ")
print(response)

# Parsing the answer from the response

In [None]:
# @title Define helper functions


import json
import re
from json import JSONDecodeError

from tqdm import tqdm

json_expression = re.compile(r"\{.+?\}", re.DOTALL)


def can_parse(model_output, output_arguments, output_types=None):
    if output_types is None:
        output_types = dict()
    answers = json_expression.findall(model_output)
    if len(answers) != 1:
        return False
    answer = answers[0]
    try:
        output = json.loads(answer)
        for arg in output_arguments:
            value = output[arg]
            if arg in output_types:
                if not isinstance(value, output_types[arg]):
                    return False
        return True
    except (JSONDecodeError, KeyError):
        return False


def parse_output(model_output):
    answers = json_expression.findall(model_output)
    answer = answers[0]
    return json.loads(answer)


def annotation_loop(
    input_df, apply_template, expected_keys, expected_types=None, n_retries=10
):
    df = input_df.copy()
    df["can_parse"] = False
    for _ in range(n_retries):
        not_parseable = ~df["can_parse"]
        responses = [
            local_llm(prompt, apply_template=apply_template)
            for prompt in tqdm(df.loc[not_parseable, "formatted_prompt"])
        ]
        df.loc[not_parseable, "response"] = responses
        df.loc[not_parseable, "can_parse"] = df.loc[not_parseable, "response"].apply(
            can_parse, args=(expected_keys, expected_types)
        )
        if df["can_parse"].all():
            break
    parseable = df["can_parse"]
    df.loc[parseable, "json"] = df.loc[parseable, "response"].apply(parse_output)
    for key in expected_keys:
        df.loc[parseable, key] = df.loc[parseable, "json"].apply(lambda x: x[key])
    return df.drop("json", axis="columns")

In [None]:
df_sampled = pd.concat(
    (
        df.sort_values("votes_diff").iloc[:5],
        df.sort_values("votes_diff", ascending=False).iloc[:5],
    )
).copy()


n_retries = 10

expected_keys = ["final_answer"]
expected_types = {"final_answer": bool}

for idx, row in df_sampled.iterrows():
    df_sampled.loc[idx, "formatted_prompt"] = (
        formatted_prompt_template.format(
            question=row.question_text.strip(), answer=row.answer_text.strip()
        )
        + "Let's think step by step: "
    )

print(df_sampled["formatted_prompt"].iloc[0])

In [None]:
df_annotated = annotation_loop(
    df_sampled,
    apply_template=False,
    expected_keys=expected_keys,
    expected_types=expected_types,
)

In [None]:
df_annotated[["final_answer", "votes_diff"]]