## **LLM exploration - Alternative to OpenAI GPT API**

This notebook explores the performance of other Hugging Face's LLMs as alternative to OpenAI's GPT APIs. The goal is to save cost by using open-source and free models while maintaining the accuracy of summarization task. In particular, the notebook explores the summarization of news articles

### **I. Sample Data**

Hard-code a sample news data extracted by the original notebook of my project

In [1]:
news_data = [
    {
        "Title": "2nd Senate panel backs phaseout of Pogos",
        "Content": "\n The Senate committee on public order and dangerous drugs is supporting the proposed phaseout of Philippine offshore gaming operators (Pogos) but is leaving the decision to Malacañang, which can opt to retain them under more stringent regulations, Sen. Ronald dela Rosa said on Saturday.\n In a radio interview, Dela Rosa, committee chair, said that while he agrees with the findings of the Senate committee on ways and means for the phaseout of Pogos, his panel is recommending a longer period of one to two years to allow Pogos to wind down their operations.\n “I signed the [ways and means] committee report to allow it to be endorsed to the plenary for discussions, but I have manifested some reservation and that I will interpellate because what it recommended is outright banning,” he said.\n Dela Rosa’s committee is the second of two panels that conducted a joint investigation of the Pogo industry to look into the economic benefits derived from it and weighed them against the crimes caused by its presence.\n The senator said that while his committee has yet to release the committee report, it was inclined to support a gradual phaseout of Pogos, but for a period longer than the three months proposed by the other committee, chaired by Sen. Sherwin Gatchalian.\n “I’m still for the phasing out of Pogo, but not outright. It must have a phasing out period,” he said in a radio interview.\n According to Dela Rosa, he is recommending an outright ban on illegal Pogo firms, but the legitimate ones must be given a one to two year extension to give them time to “adjust” and find alternatives to their industry.\n Dela Rosa said he would withdraw his signature to the Gatchalian panel’s report if it pushes an immediate ban on Pogos, saying it would be “unfair” to Pogo investors.\n “For my part, I will not vote for outright banning because I am for a gradual phaseout to allow all stakeholders to adjust,” he said.\n In the event Malacañang allows the continued operation of Pogos, Dela Rosa said his panel would propose that all its operations be done inside a “Pogo zone” for a more stringent law enforcement.\n “All regulatory activities by the [Bureau of] Immigration, Pagcor (Philippine Gaming and Amusement Corp.) Philippine National Police and National Bureau of Investigation shall be undertaken within this Pogo zone,” he said.\n On Tuesday evening, Gatchalian filed his committee report after months of delay.\n The committee concluded that the social costs brought by Pogo operations outweighed the purported benefits from them in the form of tax payments and income from business rentals.\n ",
    },
    {
        "Title": "AFP expects 2 million reserves per year from ROTC revival",
        "Content": "\n Around 2 million will be added every year to the reserve force of the Armed Forces of the Philippines once the law making the Reserve Officers’ Training Corps (ROTC) mandatory again is passed by Congress.\n “Every year, if ROTC becomes mandatory, we expect an additional 2 million students from all universities,” Maj. Gen. Joel Alejandro Nacnac, deputy chief of staff for reservists and retiree affairs of the AFP, told a press conference in Camp Aguinaldo on Friday as the AFP observes the National Reservists Week..\n According to Nacnac, the projected servicemen from the ROTC program will be classified as part of the “standby reserve” forces, who may be mobilized or ordered to active duty in times of national emergency or war.\n They are different from the “ready reserve,” who may be called at any time to augment the regular armed force of the military “not only in times of war or national emergency but also to meet local emergencies arising from calamities, disasters and threats to peace, order, security and stability in any locality.”\n As of June, the AFP reserve force was estimated at 1.2 million, composed of 71,000 ready reservists, 1.1 million standby reservists, and more than 15,000 affiliated units from other organizations.\n Maj. Gen. Joseph Ferrous Cuison, head of the Philippine Naval Reserve Command, said they were employing the assistance of Philippine Navy affiliated reserve unit, composed of crews from fishing and shipping companies, in monitoring the West Philippine Sea.\n “Right now, we cannot confront head-on the maritime militia of China, so what we do right now is just monitor. They report to us on what is happening there,” he said.\n ",
    },
    {
        "Title": "Hackers attack PhilHealth’s website, systems",
        "Content": "\n MANILA, Philippines — Computer hackers attacked the website and online application of the Philippine Health Insurance Corp. (PhilHealth) on Friday, taking down the systems and blocking access for more than 24 hours.\n In a statement on Saturday, PhilHealth said it has started containment measures as well as an investigation of the “information security incident,” together with the Department of Information and Communications Technology and other concerned government agencies “to assess its extent.”\n “While investigation is being undertaken, affected systems shall be temporarily shut down to secure our application systems,” PhilHealth president and CEO Emmanuel Ledesma said.\n The head of the state insurer appealed for understanding and assured that “we will get to the bottom of this and will institute stronger systems to prevent this from happening again.”\n The Inquirer reached out to multiple officials and staff of PhilHealth to confirm whether the incident was a Medusa ransomware attack, but they have yet to reply.\n A ransomware is a cyberattack that holds one entity’s data or system hostage until a ransom is paid.\n In a February 2023 report by the US Department of Health and Human Services, MedusaLocker, a ransomware variant, was able to “infect and encrypt systems, primarily targeting the health-care sector, after it was first detected in September 2019.\n MedusaLocker, deemed as “lesser known but potent,” leveraged the disorder and confusion during the COVID-19 pandemic to launch attacks, the report said.\n According to a cybersecurity company last week, a firm usually spends about P55 million or about $1 million to resolve a single data breach and pay off ransom to regain system access.\n ",
    },
]

In [2]:
# Test using 1st sample
title = news_data[0]["Title"]
content = news_data[0]["Content"]

### **II. LLM Models**

**OPTION 1**: Using GPT 3.5 API of OpenAI
- **LIMITATION**: Not free and open-source
    - This is the main reason why we bother to explore other LLM options

In [3]:
# pip install -U langchain

In [4]:
# pip install openai

In [35]:
import os

import openai
from langchain import LLMChain, OpenAI, PromptTemplate

# Enter API key from https://platform.openai.com/account/api-keys
os.environ["OPENAI_API_KEY"] = "sk-###############"
openai.api_key = os.getenv("OPENAI_API_KEY")


template = """
    You are an expert journalist and a reader that can highlight the most important points written in a news article; 
    You need to summarize a news article in just 1 sentence similar to a headline but more descriptive. 
    Add a second sentence only if necessary. Limit the total words between 20 to 30 words and do not use too many commas. Make it sound logical;

    TITLE: {title}
    CONTENT: {content}
"""

prompt = PromptTemplate(template=template, input_variables=["title", "content"])

summarizer_llm = LLMChain(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.5), prompt=prompt, verbose=True
)

summarized_news = summarizer_llm.predict(title=title, content=content)

**OPTION 2**: Using Llama2 in Langchain
- Link: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
    - **IMPORTANT NOTE**: Need to request access first to Meta before using this in HF. Link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/
- **LIMITATION**: Computationally expensive to run at inference (resource-constraint)

**NOTE FOR 1ST TIME USE OF HUGGING FACE IN PYTHON**: 
- For first time use of deployed models in Hugging Face, it will require you to insert a User Access Token
    - Learn more: https://huggingface.co/docs/huggingface_hub/quick-start
- At first run of a code with the ```model```, it will download the model in your local machine. To know the current file sizes, please check the "Files and versions" tab in the link of model.
- Downloaded models were saved in cache file for Windows machines. Run the code below to locate its folder path
    - Learn more: https://huggingface.co/docs/huggingface_hub/guides/manage-cache

In [None]:
from huggingface_hub import login

login() # Enter token here from https://huggingface.co/settings/tokens

In [None]:
# Locates cache folder of HF where the models were downloaded
from transformers import file_utils

print(file_utils.default_cache_path)

C:\Users\Earl/.cache\huggingface\hub


In [30]:
# pip install git+https://github.com/huggingface/transformers

In [None]:
# pip install --upgrade huggingface_hub

In [None]:
# pip install -q langchain transformers accelerate bitsandbytes

Note: you may need to restart the kernel to use updated packages.


In [31]:
import textwrap

import torch
from langchain import HuggingFacePipeline, LLMChain, PromptTemplate
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    torch_dtype=torch.float32,
    #  load_in_4bit=True,
    #  bnb_4bit_quant_type="nf4",
    #  bnb_4bit_compute_dtype=torch.float16
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    max_new_tokens=512,
    do_sample=True,
    top_k=30,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

In [36]:
def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT, citation=None):
    SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
    prompt_template = B_INST + SYSTEM_PROMPT + instruction + E_INST

    if citation:
        prompt_template += f"\n\nCitation: {citation}"  # Insert citation here

    return prompt_template


def cut_off_text(text, prompt):
    cutoff_phrase = prompt
    index = text.find(cutoff_phrase)
    if index != -1:
        return text[:index]
    else:
        return text


def remove_substring(string, substring):
    return string.replace(substring, "")


def generate(text, citation=None):
    prompt = get_prompt(text, citation=citation)
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=512,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )
        final_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        final_outputs = cut_off_text(final_outputs, "")
        final_outputs = remove_substring(final_outputs, prompt)

    return final_outputs


def parse_text(text):
    wrapped_text = textwrap.fill(text, width=100)
    print(wrapped_text + "\n\n")

In [None]:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<>\n", "\n<>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are an expert journalist that can highlight the most important points in a news article.
"""

In [34]:
llm = HuggingFacePipeline(
    pipeline=pipe, model_kwargs={"temperature": 0.7, "max_length": 256, "top_k": 50}
)

system_prompt = "You are an expert journalist that can highlight the most important points in a news article."
instruction = """You need to summarize a news article in just 1 sentence similar to a headline but more descriptive. 
Add a second sentence only if necessary. Limit the total words between 20 to 30 words and do not use too many commas. 
Make it sound logical:\n\n CONTENT: {content}"""

template = get_prompt(instruction, system_prompt)

In [42]:
prompt = PromptTemplate(template=template, input_variables=["content"])
summarizer_llm = LLMChain(prompt=prompt, llm=llm, verbose=False)

summarized_news = summarizer_llm.run(content)
print(summarized_news)

**OPTION 2.2**: Using QUANTIZED Llama2 in Langchain
- Link: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML
- **LIMITATION**: No control on maximum number of context (token) length so it suffers a lot on hallucinations. 
    - MIGHT NEED TO FINE-TUNE but the source model doesn't have much parameters that can be used in fine-tuning (maybe there is, and still need to check how - like those from orig Llama2 model were already inherited here in the first place?)

In [36]:
# pip install ctransformers

In [49]:
from langchain import LLMChain, PromptTemplate
from langchain.llms import CTransformers

# from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = CTransformers(
    model="TheBloke/Llama-2-7B-Chat-GGML",
    model_file="llama-2-7b-chat.ggmlv3.q8_0.bin",
    # model_kwargs = {'temperature':1,'max_new_tokens': 1024, 'top_k' :20}
)


In [52]:
template = """
[INST] <<SYS>>
You are an expert journalist. Summarize the given news article using only between 20 to 30 words.
<</SYS>>
CONTENT:{content}
[/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["content"])

summarizer_llm = LLMChain(prompt=prompt, llm=llm)
summarized_news = summarizer_llm.run(content)
print(summarized_news)

In [28]:
# Test an edge case - less than 20 words
summarized_news = summarizer_llm.run(
    title="Hello World",
    content="Hello world I am starting to do LLM now and it's not been a great journey so far",
)
summarized_news

'Sure, here\'s a possible summary of the article in 1 sentence:\n\n"Starting an LLM program has been challenging for me so far, with difficulties in adjusting to academic life."'

**OPTION 3**: Using Gensim
- **LIMITATION**: Very slow runtime and just copy-pastes what it thinks are the highlights of the article. Not an advanced model by today's standards in field of NLP

In [37]:
# pip install gensim==3.6.0

In [None]:
from gensim.summarization import keywords
from gensim.summarization.summarizer import summarize

summarized_news = summarize(content, word_count=30)
print(summarized_news)

**OPTION 4**: Using Falcon 7B in Langchain
- Link: https://huggingface.co/tiiuae/falcon-7b-instruct
- **LIMITATION**: Cannot be imported (for some reason). Even so, after import, many were concerned on it being 10x slower than Llama2

In [41]:
# pip install einops

In [40]:
# pip install -q langchain transformers accelerate bitsandbytes

In [42]:
# pip install --upgrade torch

In [11]:
!nvidia-smi

Sun Sep 24 12:03:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.42                 Driver Version: 537.42       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...  WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P8               4W /  60W |    194MiB /  4096MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [43]:
import torch
import transformers
from langchain import HuggingFacePipeline, LLMChain, PromptTemplate
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "tiiuae/falcon-7b-instruct"  # tiiuae/falcon-40b-instruct

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = pipeline(
    "text-generation",  # task
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)


In [2]:
# WARNING - made my pc crash
# from transformers import AutoModelForCausalLM

# model_id="tiiuae/falcon-7b-instruct"
# tokenizer=AutoTokenizer.from_pretrained(model_id)
# model=AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

In [None]:
llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs={"temperature": 0.5})

template = """
    You are an expert journalist and a reader that can highlight the most important points written in a news article; 
    You need to summarize a news article in just 1 sentence similar to a headline but more descriptive. 
    Add a second sentence only if necessary. Limit the total words between 20 to 30 words and do not use too many commas. Make it sound logical;

    TITLE: {title}
    CONTENT: {content}
"""
prompt = PromptTemplate(template=template, input_variables=["title", "content"])

summarizer_llm = LLMChain(prompt=prompt, llm=llm)
summarized_news = summarizer_llm.run(title=title, content=content)
print(summarized_news)

**OPTION 5**: Using BART from Hugging Face
- Link: https://huggingface.co/facebook/bart-large-cnn
- **LIMITATION**: Very long inference time and the output is always just a snippet of the original text (very inaccurate)

In [44]:
# pip install transformers

In [45]:
from transformers import pipeline

summarizer_llm = pipeline("summarization", model="facebook/bart-large-cnn")

summary = summarizer_llm(content, max_length=130, min_length=30, do_sample=False)
summarized_news = summary[0]["summary_text"]
print(summarized_news)

**OPTION 6**: Using Fine-tuned LongT5 transformer from Hugging Face
- Link: https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary
- The best performing small LLM so far! Next steps for improvement:
    - Fine-tune further using cnn_news dataset and other open-source datasets of HF containing predictor = Entire news article and target = the summarized version of article
    - Experiment with different hyperparameters of BeamSearch
        - Reference: https://huggingface.co/blog/how-to-generate
- **LIMITATION**: Cannot understand full context of a news yet

In [47]:
from transformers import pipeline

summarizer_llm = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
)

summary = summarizer_llm(content)
summarized_news = summary[0]["summary_text"]
print(summarized_news)

Your max_length is set to 512, but your input_length is only 368. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=184)


PhilHealth says it's been attacked by a bunch of ransomware, and they've taken down its entire site and app. Then they shut it down for 24 hours to make sure no one can get into it again.
