# **Open-Source Large Language Models for Structured Information Extraction**

Open-source large language models can be used to extract structured infomation from unstructured text. This notebook demonstrates doing so "locally" with the `llama.cpp` library


Points for speaker:
- Why are we using Colab?


In [23]:
from pathlib import Path

working_dir = Path("/nvme/storage_michiel/llm_workshop") #/content when working with remote runtime

In [None]:
# @title Connect to Google Drive
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [24]:
# @title Imports and downloads
%%capture
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
#!wget https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf -P $working_dir
!wget https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q5_K_M.gguf -P $working_dir

In [92]:
# @title Instantiate the local LLM
%%capture
from llama_cpp import Llama

llm = Llama(
    model_path=str(working_dir / "openhermes-2.5-mistral-7b.Q5_K_M.gguf"),
    n_gpu_layers=-1,
    n_ctx=8192,
    random_seed=42,
)
llm.verbose=False

In [94]:
# @title Define helper functions
from pprint import pprint, pp, pformat

template = """
<|im_start|>user
{prompt}
<|im_end|>
<|im_start|>assistant
"""

def local_llm(prompt, verbose=False, apply_template=True, temperature=0.7, max_tokens=None):
    if apply_template:
        prompt = template.format(prompt=prompt)
    if verbose:
        print(f"Prompt:\n{prompt}")
    response = llm(prompt, max_tokens=max_tokens, temperature=temperature, top_p=0.95)
    return response["choices"][0]["text"]


- Overview of different models, sizes
- Foundation/base models vs chat / instruction models
- "Access / Privacy"
- `llama-cpp`!
- quantization


# Prompting basics

In [95]:
response = local_llm(
    "Write me promotional material for a workshop demonstrating use cases of open-source large language models"
)
print(response)

Introducing: "Unleashing the Power of Open-Source Large Language Models" Workshop

Are you ready to revolutionize your approach to natural language processing and generation? Join us for our upcoming workshop, where we will showcase the incredible power and potential of open-source large language models.

In this interactive and engaging session, you will discover how these cutting-edge technologies are transforming industries and enhancing communication. Our expert trainers will guide you through real-world use cases, providing practical insights and hands-on experience.

Key Takeaways:

1. Understanding the fundamentals of open-source large language models and their applications.
2. Exploring various open-source options, such as Hugging Face's Transformers and TensorFlow.
3. Identifying and analyzing potential use cases in various sectors, including healthcare, finance, and education.
4. Learning best practices and techniques for implementing these models in your projects.
5. Network

- Explain what happened - we called a local LLM!
- Chat template

## Chat templates

In [96]:
response = local_llm(
    "In what city is Campus Fryslan located?",
    verbose=True,
)
print(response)

Prompt:

<|im_start|>user
In what city is Campus Fryslan located?
<|im_end|>
<|im_start|>assistant

Campus Fryslan is located in Leeuwarden, Netherlands. It is an educational institution that focuses on providing international students with the opportunity to learn Dutch as a foreign language while experiencing the local culture and history of the region.


In [97]:
response = local_llm(
    "In what city is Campus Fryslan located?",
    apply_template=False,
    verbose=True,
    temperature=0.0
)
print(response)

Prompt:
In what city is Campus Fryslan located?

Campus Fryslan is located in Leeuwarden, the capital of Friesland.

What are the admission requirements for Campus Fryslan?
To be admitted to Campus Fryslan, you must have a secondary school diploma or equivalent and meet the specific entry requirements for your chosen program. Some programs may also require additional tests or interviews.

How many students attend Campus Fryslan?
Campus Fryslan has approximately 1,500 students enrolled in its various programs.

What programs does Campus Fryslan offer?
Campus Fryslan offers a range of programs in fields such as business, tourism, hospitality, and leisure management. These programs are designed to provide students with practical skills and knowledge that can be applied in a variety of industries.

Does Campus Fryslan offer international programs?
Yes, Campus Fryslan offers international programs that are taught in English and designed for students from around the world. These programs pro

## Temperature

In [98]:
prompt = """
I'm organizing a workshop on using LLMs to extract structured information from
texts / corpora for non-technical researchers at a university.
Could you suggest me a few catchy titles, free of jargon?
"""

response = local_llm(prompt, temperature=0.0)
print(response)

1. "Unlocking the Power of AI: Extracting Structured Information from Texts"
2. "Transforming Text into Data: A Non-Technical Guide to AI Extraction"
3. "Revolutionize Your Research: Harnessing AI to Extract Key Information"
4. "From Text to Insights: A Workshop on AI-Driven Information Extraction"
5. "AI for Everyone: Simplifying the Process of Extracting Structured Data"
6. "Beyond Keywords: AI Techniques for Extracting Meaningful Information"
7. "AI-Assisted Research: Extracting Structured Data from Unstructured Text"
8. "AI Meets Research: A Hands-On Workshop on Extracting Structured Information"
9. "AI-Powered Text Analysis: Extracting Valuable Data for Your Research"
10. "AI and You: A Beginner's Guide to Extracting Structured Information from Texts"


In [99]:
response = local_llm(prompt, temperature=0.0)
print(response)

1. "Unlocking the Power of AI: Extracting Structured Information from Texts"
2. "Transforming Text into Data: A Non-Technical Guide to AI Extraction"
3. "Revolutionize Your Research: Harnessing AI to Extract Key Information"
4. "From Text to Insights: A Workshop on AI-Driven Information Extraction"
5. "AI for Everyone: Simplifying the Process of Extracting Structured Data"
6. "Beyond Keywords: AI Techniques for Extracting Meaningful Information"
7. "AI-Assisted Research: Extracting Structured Data from Unstructured Text"
8. "AI Meets Research: A Hands-On Workshop on Extracting Structured Information"
9. "AI-Powered Text Analysis: Extracting Valuable Data for Your Research"
10. "AI and You: A Beginner's Guide to Extracting Structured Information from Texts"


In [100]:
response = local_llm(prompt, temperature=0.9)
print(response)

Here are some catchy titles for your workshop that avoid jargon:

1. "Unlocking the Secrets of Text: Extracting Valuable Information with AI"
2. "Transforming Text into Data: Harnessing AI for Researchers"
3. "Revolutionize Your Research: Extracting Structured Information with AI"
4. "AI-Powered Text Analysis: Discover Hidden Insights in Your Research"
5. "AI Tools for Researchers: Simplifying Text Analysis and Information Extraction"
6. "Effortlessly Organize Your Research: Automating Information Extraction with AI"
7. "Unlocking the Potential of Text Data: A Beginner's Guide to AI-Assisted Research"
8. "Streamline Your Research Process: AI Techniques for Extracting Key Information"
9. "From Text to Data: AI Solutions for Non-Technical Researchers"
10. "Harness the Power of AI: Transforming Unstructured Data into Actionable Insights"

These titles emphasize the practical benefits and applications of using AI for text analysis and structured information extraction, making them suitable

In [101]:
response = local_llm(prompt, temperature=0.9)
print(response)

Sure! Here are some suggestions that are catchy and easy to understand:

1. "Unlocking Hidden Knowledge: Extracting Structured Information from Texts"
2. "Transforming Texts into Actionable Data: A Workshop on LLM Techniques"
3. "Beyond Text: Leveraging AI to Extract Valuable Insights"
4. "Revolutionize Your Research: Using Language Models to Extract Data"
5. "From Text to Tables: Harnessing AI Technologies to Organize Information"
6. "Structuring Knowledge: A Workshop on Extracting Meaningful Data from Text"
7. "AI-Powered Text Analysis: Extracting Structured Information for Researchers"
8. "Transforming Texts into Actionable Insights with AI Technologies"
9. "From Unstructured to Structured Data: A Workshop on Using Language Models"
10. "Boost Your Research with AI: Extracting Structured Information from Text"

Feel free to use any of these or modify them according to your needs. I hope these suggestions help!


## Number of input / output tokens

- What is a token?


In [102]:
response = local_llm(prompt, max_tokens=20)
print(response)

"Unlocking the Power of AI: Extracting Key Information from Texts"
"H


In [103]:
!wget https://www.gutenberg.org/cache/epub/100/pg100.txt -P $working_dir
long_text = (working_dir / "pg100.txt").read_text(encoding="utf-8")

--2024-04-03 09:05:17--  https://www.gutenberg.org/cache/epub/100/pg100.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5638519 (5.4M) [text/plain]
Saving to: ‘/nvme/storage_michiel/llm_workshop/pg100.txt.1’


2024-04-03 09:05:18 (6.13 MB/s) - ‘/nvme/storage_michiel/llm_workshop/pg100.txt.1’ saved [5638519/5638519]



In [104]:
print(long_text[:500])

﻿The Project Gutenberg eBook of The Complete Works of William Shakespeare
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
bef


In [105]:
long_prompt = "Please summarize the following: \n" + long_text
response = local_llm(long_prompt)

ValueError: Requested tokens (1748075) exceed context window of 8192

# Prompt Engineering 101

- Zero shot learning
- Few shot learning
- Chain of thought


In [106]:
# @title Zero-shot prompting
prompt = """
Classify the text into neutral, negative or positive.
Text: I think the workshop is okay.
"""
print(local_llm(prompt))

The text can be classified as neutral.


In [107]:
# @title Few-shot prompting
prompt = """
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.

To do a "farduddle" means to jump up and down really fast. Please give an example of a sentence that uses the word farduddle.
"""
local_llm(prompt)

"I was so excited about the surprise party that I couldn't help but do a little farduddle when I found out about it."

In [218]:
# @title Chain-of-thought prompting

prompt_no_cot_formatted = """
<|im_start|>user
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?
<|im_end|>
<|im_start|>assistant
The answer is 11.
<|im_end|>
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant
"""
print(local_llm(prompt_no_cot_formatted, apply_template=False, verbose=True))

Prompt:

<|im_start|>user
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?
<|im_end|>
<|im_start|>assistant
The answer is 11.
<|im_end|>
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant

After using 20 apples, they had 23 - 20 = 3 apples left. Then, they bought 6 more apples, so they had a total of 3 + 6 = 9 apples.


In [220]:
prompt_cot_formatted =  """
<|im_start|>user
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?
<|im_end|>
<|im_start|>assistant
Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
<|im_end|>
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant
"""
print(local_llm(prompt_cot_formatted, apply_template=False))

The cafetaria started with 23 apples. They used 20 for lunch, which left them with 23 - 20 = 3 apples. Then they bought 6 more apples, so they had a total of 3 + 6 = 9 apples. The answer is 9.


In [239]:
# @title Zero-shot chain-of-thought

prompt_cot_zs = """
<|im_start|>user
The cafetaria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
<|im_end|>
<|im_start|>assistant
Let's think step by step: """
print(local_llm(prompt_cot_zs, apply_template=False))



1. The cafeteria initially has 23 apples.
2. They use 20 apples to make lunch. So, we need to subtract these from the initial amount: 23 - 20 = 3.
3. Then, they buy 6 more apples. We need to add these to the remaining amount: 3 + 6 = 9.
4. Therefore, the cafeteria now has 9 apples.

The final answer is that the cafeteria has 9 apples.



# Scaling up

- Prompt template
- Structure output
- Retry until structure is valid
- External APIs


In [222]:
# @title Fetch Data and Load Into Pandas
%%capture

!wget "http://datascience.web.rug.nl/llm_parliamentary_sample.csv" -P $working_dir

import pandas as pd
df = pd.read_csv(working_dir / "llm_parliamentary_sample.csv")

In [240]:
df.query("votes_diff > 0").head()

Unnamed: 0,url,title,department,date_submitted,date_answered,question_speaker,question_position,question_text,answer_speaker,answer_position,answer_text,votes_answered,votes_notanswered,votes_diff,attachment
8,https://www.theyworkforyou.com/wrans/?id=2022-...,Detention Centres: Manston,Home Office,2022-11-29,2023-01-03,Stephen Kinnock,Shadow Minister (Home Office) (Immigration),To ask the Secretary of State for the Home Dep...,Robert Jenrick,The Minister for Immigration,.The HMIP report lists 6 Priority Concerns and...,1,2,1,
13,https://www.theyworkforyou.com/wrans/?id=2022-...,Aircraft: Air Conditioning,Department for Transport,2022-12-16,2023-01-03,Baroness Bennett of Manor Castle,Green,"To ask His Majesty's Government, with regards ...",Baroness Vere of Norbiton,Parliamentary Under-Secretary (Department for ...,The UK is rightly proud of its excellent recor...,1,2,1,
20,https://www.theyworkforyou.com/wrans/?id=2022-...,Hypothyroidism: Prescriptions,Department of Health and Social Care,2022-12-15,2023-01-03,Lord Hunt of Kings Heath,Labour,To ask His Majesty's Government what discussio...,Lord Markham,The Parliamentary Under-Secretary for Health a...,There are no current plans to have discussions...,1,2,1,
29,https://www.theyworkforyou.com/wrans/?id=2022-...,Short-term Holding Facilities: Manston,Home Office,2022-12-15,2023-01-03,Lord Rosser,"Shadow Spokesperson (Home Affairs), Shadow Spo...",To ask His Majesty's Government what has been ...,Lord Murray of Blidworth,The Parliamentary Under-Secretary of State for...,The costs of advice cannot be accurately calcu...,0,1,1,
35,https://www.theyworkforyou.com/wrans/?id=2022-...,Undocumented Migrants: English Channel,Home Office,2022-12-15,2023-01-03,Lord Rosser,"Shadow Spokesperson (Home Affairs), Shadow Spo...","To ask His Majesty's Government, further to th...",Lord Murray of Blidworth,The Parliamentary Under-Secretary of State for...,There are no plans to publish further details ...,0,1,1,


In [241]:
first_row = df.query("votes_diff > 0").iloc[0]

In [245]:
# @title Prompt templates

prompt_template = """
I will provide you a question and a response given in a parliamentary setting.

The question:
{question}

The answer:
{answer}

Does the response sufficiently answer the question?

Return your answer as a valid JSON object with a single field `final answer` with
a boolean value with your final answer, like {{"final_answer": …}}.
"""

prompt = prompt_template.format(
    question=first_row["question_text"].strip(),
    answer=first_row["answer_text"].strip()
)

response = local_llm(prompt, verbose=True)

Prompt:

<|im_start|>user

I will provide you a question and a response given in a parliamentary setting.

The question:
To ask the Secretary of State for the Home Department, what steps her Department took to act on the findings of the report by the Chief Inspector of Prisons into conditions at Manston asylum centre published in July 2022 which indicated that the facilities at Manston for managing people with infectious diseases were poor.

The answer:
.The HMIP report lists 6 Priority Concerns and 8 further Key Concerns which HMIP inspectors felt required addressing at Manston and Western Jetfoil. While one of the Priority Concerns (Priority Concern 3) referenced weaknesses in the governance of health care processes, no specific mention was made in any of the concerns about facilities at Manston for managing people with infectious diseases.The Home Office developed a Service Improvement Plan in response to the 14 Concerns listed in the report, and worked quickly with its medical cont

In [246]:
formatted_prompt_template = """
<|im_start|>user
I will provide you a question and a response given in a parliamentary setting.

The question:
{question}

The answer:
{answer}

Does the response sufficiently answer the question?

Return your answer as a valid JSON object with a single field `final answer` with
a boolean value with your final answer, like {{"final_answer": …}}.
<|im_end|>
<|im_start|>assistant
"""

formatted_prompt = formatted_prompt_template.format(
    question=first_row["question_text"].strip(),
    answer=first_row["answer_text"].strip()
)

In [247]:
response = local_llm(formatted_prompt + "Let's think step by step: ", apply_template=False, verbose=True)
print("\nLLM answer: ")
print(response)

Prompt:

<|im_start|>user
I will provide you a question and a response given in a parliamentary setting.

The question:
To ask the Secretary of State for the Home Department, what steps her Department took to act on the findings of the report by the Chief Inspector of Prisons into conditions at Manston asylum centre published in July 2022 which indicated that the facilities at Manston for managing people with infectious diseases were poor.

The answer:
.The HMIP report lists 6 Priority Concerns and 8 further Key Concerns which HMIP inspectors felt required addressing at Manston and Western Jetfoil. While one of the Priority Concerns (Priority Concern 3) referenced weaknesses in the governance of health care processes, no specific mention was made in any of the concerns about facilities at Manston for managing people with infectious diseases.The Home Office developed a Service Improvement Plan in response to the 14 Concerns listed in the report, and worked quickly with its medical contr

In [248]:
import re

json_expression = re.compile(r"\{.+?\}", re.DOTALL)

In [250]:
answers = json_expression.findall(response)

In [251]:
import json

json.loads(answers[0])

{'final_answer': False}