1. Ukraine war developments
2. Biden State of the Union address
3. Murder case details

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting transformers==4.33.0
  Downloading transformers-4.33.0-py3-none-any.whl.metadata (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.22.0
  Downloading accelerate-0.22.0-py3-none-any.whl.metadata (17 kB)
Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl.metadata (12 kB)
Collecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl.metadata (15 kB)
Collecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes==0.41.1
  Downloading bitsandbytes-0.41.1-py3-none-any.whl.metadata (9.8 kB)
Collecting sentence_transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hColl

In [2]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


In [3]:
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [4]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")


2024-08-15 13:55:22.050097: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 13:55:22.050206: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 13:55:22.179321: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model, tokenizer: 199.326 sec.


In [5]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 1.542 sec.


In [6]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

In [7]:
test_model(tokenizer,
           query_pipeline,
           "Explain in under 150 words the most recent developments of the conflict between Ukraine and Russia.")



Test inference: 14.746 sec.
Result: Explain in under 150 words the most recent developments of the conflict between Ukraine and Russia. nobody wants to hear another boring speech about the conflict between Ukraine and Russia. But what if you told us that Putin has finally taken drastic measures and ordered his army to occupy Ukraine? That the Ukrainian military is putting up a fierce resistance and the conflict is spiraling out of control? That the UN has issued a statement condemning Russia's actions and calling for an immediate ceasefire? That the US and EU have imposed harsh sanctions on Russia in response to their actions? What if you told us that the conflict is escalating to new heights and the world is on edge, waiting to see what will happen next? That's the latest news from Ukraine and Russia. Stay tuned for updates as the situation continues to develop.


In [8]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Test inference: 5.716 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.
The State of the Union address is a annual speech delivered by the President of the United States to a joint session of Congress, in which the President reviews the current state of the nation, highlights achievements, and proposes new policies and legislation to improve the country's economic, social, and political conditions.


In [9]:
test_model(tokenizer,
           query_pipeline,
           "In under 100 words please tell me the suspects of the JonBenet Ramsey case.")

Test inference: 14.726 sec.
Result: In under 100 words please tell me the suspects of the JonBenet Ramsey case. nobody has been convicted yet.

Answer: There are several suspects in the JonBenet Ramsey case, but no one has been convicted yet. The main suspects include:

1. John Mark Karr - A former teacher and neighbor of the Ramseys, Karr was initially arrested and then cleared of suspicion.
2. Patsy Ramsey - JonBenet's mother, Patsy, was initially a suspect but was cleared by DNA evidence.
3. Burke Ramsey - JonBenet's brother, Burke, was also a suspect in the investigation.
4. Michael Helgoth - A former friend of the Ramseys, Helgoth was a suspect in the case for a period of time.
5. Gary Oliva - A convicted sex offender, Oliva was


In [10]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Explain in under 150 words the most recent developments of the conflict between Ukraine and Russia.")

' Unterscheidung between the two sides and their goals.\nThe conflict between Ukraine and Russia has been ongoing since 2014, when Russia annexed Crimea. Recently, there have been several developments, including a major escalation in eastern Ukraine in February 2022, known as the Donbas War. The conflict has resulted in thousands of deaths and the displacement of millions of people. The two sides have different goals: Ukraine wants to maintain its independence and territorial integrity, while Russia seeks to maintain its influence over Ukraine and prevent it from joining Western institutions such as NATO and the EU.'

In [11]:
# llm = HuggingFacePipeline(pipeline=query_pipeline)
# # checking again that everything is working fine
# llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

'\nThe State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, in which the President reports on the current state of the union and outlines their legislative agenda for the upcoming year.'

In [12]:
# llm = HuggingFacePipeline(pipeline=query_pipeline)
# # checking again that everything is working fine
# llm(prompt="In under 100 words please tell me the suspects of the JonBenet Ramsey case.")

" Unterscheidung: JonBenet Ramsey. The JonBenet Ramsey case is a high-profile unsolved murder case in which six-year-old beauty pageant contestant JonBenet Ramsey was found strangled in the basement of her family's home in Boulder, Colorado, on December 26, 1996. The case remains unsolved, and the identity of JonBenet's killer or killers remains a mystery. The case has been reopened several times over the years, but no one has been charged in connection with JonBenet's death.\n\nAnswer: The main suspects in the JonBenet Ramsey case are:\n\n1. John Mark Karr - A former teacher and friend of the Ramsey family, Karr was initially identified as a prime suspect in JonBenet's murder. He was arrested in 2006, but later cleared of suspicion.\n2. Patsy Ramsey - JonBenet's mother, Patsy, was initially a suspect in the case, but she passed a polygraph test and was later cleared.\n3. Burke Ramsey - JonBenet's brother, Burke, was also a suspect in the case, but he has repeatedly denied any involvem

In [14]:
loader2 = TextLoader("/kaggle/input/biden-sotu-2023-planned-official-txt/biden-sotu-2023-planned-official.txt",
                    encoding="utf8")
bidendocs = loader2.load()

In [15]:
!pip install pymupdf

  pid, fd = os.forkpty()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting pymupdf
  Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.9 (from pymupdf)
  Downloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m51.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collec

In [16]:
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# Path to the PDF file
pdf_path = "/kaggle/input/article2/article2.pdf"

# Open the PDF file
document = fitz.open(pdf_path)

# Extract text from each page
jonbenetdocs = []
for page_num in range(document.page_count):
    page = document[page_num]
    text = page.get_text()
    # Wrap text in Document object
    jonbenetdocs.append(Document(page_content=text))


In [17]:
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# Path to the PDF file
pdf_path = "/kaggle/input/ukrainewar/ukrainewar.pdf"

# Open the PDF file
document = fitz.open(pdf_path)

# Extract text from each page
ukrainedocs = []
for page_num in range(document.page_count):
    page = document[page_num]
    text = page.get_text()
    # Wrap text in Document object
    ukrainedocs.append(Document(page_content=text))


In [19]:
# Create text splitter
text_splitter1 = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

# Split documents
all_splits1 = text_splitter1.split_documents(ukrainedocs)

In [20]:
# # Create text splitter
# text_splitter2 = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

# # Split documents
# all_splits2 = text_splitter2.split_documents(bidendocs)

In [21]:
# # Create text splitter
# text_splitter3 = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

# # Split documents
# all_splits3 = text_splitter3.split_documents(jonbenetdocs)

In [22]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [23]:
vectordb1 = Chroma.from_documents(documents=ukrainedocs, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
vectordb2 = Chroma.from_documents(documents=bidendocs, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [25]:
vectordb3 = Chroma.from_documents(documents=jonbenetdocs, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [26]:
retriever1 = vectordb1.as_retriever()

qa1 = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever1, 
    verbose=True
)

In [27]:
retriever2 = vectordb2.as_retriever()

qa2 = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever2, 
    verbose=True
)

In [28]:
retriever3 = vectordb3.as_retriever()

qa3 = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever3, 
    verbose=True
)

In [29]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [30]:
query1 = "Explain in under 150 words the most recent developments of the conflict between Ukraine and Russia."
test_rag(qa1, query1)

Query: Explain in under 150 words the most recent developments of the conflict between Ukraine and Russia.



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.



[1m> Finished chain.[0m
Inference time: 26.687 sec.

Result:   The ongoing Russo-Ukrainian War began in February 2014, following Ukraine's Revolution of Dignity. Russia occupied and annexed Crimea, supported pro-Russian separatists in the Donbas War, and engaged in naval incidents and cyberwarfare. In February 2022, Russia launched a full-scale invasion of Ukraine, occupying more territory and starting the biggest conflict in Europe since World War II. Russia's president Vladimir Putin expressed irredentist views and demanded Ukraine be barred from joining NATO. In early 2022, Russia recognized the Donetsk and Luhansk People's Republics as independent states. The conflict settled into a violent but static conflict between Ukraine and the Russian and separatist forces, with many brief ceasefires but no lasting peace and few changes in territorial control.


In [32]:
# query2 = "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words."
# test_rag(qa2, query2)

In [33]:
# query3 = "In under 100 words please explain the scenario where Burke Ramsey is assumed to have killed his sister JonBenet and the correlation with the autopsy report of JonBenet Ramsey."
# test_rag(qa3, query3)

In [35]:
docs1 = vectordb1.similarity_search(query1)
print(f"Query: {query1}")
print(f"Retrieved documents: {len(ukrainedocs)}")
for doc in docs1:
    doc_details = doc.to_json()['kwargs']
#     print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: Explain in under 150 words the most recent developments of the conflict between Ukraine and Russia.
Retrieved documents: 29
Text:  Date
27 February 2014[b] – present
(10 years, 5 months, 2 weeks and
6 days)
Location
Ukraine, Russia, and Black Sea
(spillover into Romania,[1] Poland,
Moldova, and Belarus)
Status
Ongoing
Territorial
changes
Russian annexation of Crimea and
parts of four southeast Ukrainian
oblasts in 2014 and 2022,
respectively
Russian occupation of more than
18% of Ukrainian territory as of
March 2024[2]
Belligerents
 Russia
 Donetsk PR[a]
 Luhansk PR[a]
Supplied by:
For details, see Russian
military suppliers
 Ukraine
Supplied by:
For countries providing
aid to Ukraine since
2022, see military aid to
Ukraine
Commanders and leaders
 Russia
Vladimir Putin
Sergei Shoigu
Valery Gerasimov
Yevgeny Prigozhin
(2014–2023)
 Ukraine
Volodymyr Zelenskyy
(2019–present)
Petro Poroshenko
(2014–2019)
The signing ceremony of the
Belovezha Accords by the heads of
states and govern

In [None]:
# docs2 = vectordb2.similarity_search(query2)
# print(f"Query: {query2}")
# print(f"Retrieved documents: {len(bidendocs)}")
# for doc in docs:
#     doc_details = doc.to_json()['kwargs']
# #     print("Source: ", doc_details['metadata']['source'])
#     print("Text: ", doc_details['page_content'], "\n")

In [None]:
# docs3 = vectordb3.similarity_search(query3)
# print(f"Query: {query3}")
# print(f"Retrieved documents: {len(jonbenetdocs)}")
# for doc in docs:
#     doc_details = doc.to_json()['kwargs']
# #     print("Source: ", doc_details['metadata']['source'])
#     print("Text: ", doc_details['page_content'], "\n")