# IN4080: obligatory assignment 3
 
Mandatory assignment 3 is about the practical use of Large Language Models (LLMs). More specifically, you will be tasked to implement a RAG (Retrieval-Augmented Generation) system able to answer factual questions based on a document database, more specifically Wiki pages extracted from an [online Star Wars encyclopedia](https://starwars.fandom.com). 

You are required to get at least 12/20 points to pass. 

- We assume that you have read and are familiar with IFI’s requirements and guidelines for mandatory assignments, see [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html) and [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html).
- This is an individual assignment. You should not deliver joint submissions. 
- You may redeliver in Devilry before the deadline (__Friday, October 17 at 23:59__).
- Only the last delivery will be read! If you deliver more than one file, put them into a zip-archive. You don't have to include in your delivery the data files already provided for this assignment. 
- Name your submission _your\_username\_in4080\_mandatory\_3_


<div class="alert alert-block alert-success"> <b>Practical setup:</b> The preferred way to complete this assignment is using the high-performance computing cluster <i>Fox</i>. See <a href="https://www.uio.no/studier/emner/matnat/ifi/IN4080/h25/computing-setup.html">here</a> for instructions on how to register and log in to Fox. 

Note: if you cannot work on Fox, you can do the assignment on your machine. In this case, make sure you have installed the following libraries: `pytorch`, `transformers`, `huggingface_hub`,  `bm25s`, `tf_keras`, `sentence-transformers` and either `nltk` or `spacy` (depending on how you prefer to do sentence splitting).

As the Gemma LLMs (from Google) are [_gated_](https://huggingface.co/docs/hub/models-gated), you will also need to:
- create a user account on HuggingFace
- access the user conditions for [Gemma](https://huggingface.co/google/gemma-1.1-2b-it)
- Create an acces token ([here](https://huggingface.co/settings/tokens))
- At the start of your notebook, run the following code: 
```python
from huggingface_hub import login
login(token="hf_...")
```
</div>

You should deliver a completed version of this Jupyter notebook, containing both your code and explanations about the steps you followed. We want to stress that simply submitting code is __not__ by itself sufficient to complete the assignment - we expect the notebook to also contain explanations of what you have implemented, along with motivations for the choices you made along the way. Preferably use whole sentences, and mathematical formulas if necessary. Explaining in your own words (using concepts we have covered through in the lectures) what you have implemented and reflecting on your solution is an important part of the learning process - take it seriously!

Regarding the use of LLMs (ChatGPT or similar): you are allowed to use them as 'sparring partner', for instance to clarify something you have not understood. However, you are __not__ allowed to use them to generate solutions (either in part or in full) to the assignment tasks. 


In [1]:
from huggingface_hub import login
login(token="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")

In [2]:
from huggingface_hub import whoami
print(whoami())

{'type': 'user', 'id': '68ebcecb4837dda665b33d85', 'name': 'ericawong814', 'fullname': 'Shu Wang', 'email': 'shuwa@uio.no', 'emailVerified': True, 'canPay': False, 'periodEnd': None, 'isPro': False, 'avatarUrl': '/avatars/9a4fbd0254470fb3c240c7f913d9f239.svg', 'orgs': [], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'Fox', 'role': 'read', 'createdAt': '2025-10-13T19:41:52.098Z'}}}


## Basic setup

We will start by building a chatbot that directly answers user questions using an instruction-tuned LLM, without relying on any database. We will use the instruction-tuned version of the [Gemma 1.1 language model](https://huggingface.co/google/gemma-1.1-2b-it) from Google, which is available on HuggingFace. 

_Note: feel free to switch to another model (such as the Llama, Qwen, Deepsee or Mistral models) if you wish to experiment with them._



**Task 1** (4 points): Drawing inspiration from the code examples on the [Gemma webpage](https://huggingface.co/google/gemma-1.1-2b-it), implement the `__init__` and `get_response` methods. If you run the code on Fox with a GPU (or on a personal machine with a GPU), make sure that your code actually runs on the GPU.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class BasicResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it"):
        """Loads the tokenizer and pretrained causal LM for the given model. 
        If a GPU is available, the model should be loaded on the GPU """

        raise NotImplemented("You must implement this method")
    
    def get_response(self, prompt:str, max_length:int=50) -> str:
        """Given a prompt, generate a response (of a maximum max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself
        """

        raise NotImplemented("You must implement this method")


agent = BasicResponseGenerator()


_Note: An easy way to verify that the GPU is actually used is to run the command `nvidia-smi` while your code is running. There also exists alternative GPU monitoring tools, like [`gpustat`](https://pypi.org/project/gpustat/0.3.2/)._

You can then test your response generator with the following set of questions: 

Shu_explanation:
Since we are creating chatbot, we refer to the Geemma webpage chat template section. The example code is listed below:

In [None]:
## Geemma chat template
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "google/gemma-1.1-2b-it"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
)

chat = [
    { "role": "user", "content": "Write a hello world program" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)


We need to put this code into BasicResponseGenerator class.So the code would be like this:

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class BasicResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it"):
        """Loads the tokenizer and pretrained causal LM for the given model. 
        If a GPU is available, the model should be loaded on the GPU """
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="cuda",
                torch_dtype=torch.bfloat16,
            )

        #raise NotImplemented("You must implement this method")
    
    def get_response(self, prompt:str, max_length:int=50) -> str:
        """Given a prompt, generate a response (of a maximum max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself
        """
        chat = [
            { "role": "user", "content": prompt},
        ]
        chat_prompt = self.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        inputs = self.tokenizer(chat_prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=max_length)

        #### Need to decode the outputs, outputs is a tensor######
        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[-1]:], #take away header, keep just text for answers
            skip_special_tokens=True,
        )

        ##return outputs
        return response
        #raise NotImplemented("You must implement this method")


agent = BasicResponseGenerator()


2025-10-15 13:40:23.702377: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-10-15 13:40:23.702454: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-10-15 13:40:23.795973: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-10-15 13:40:23.832160: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Shu_explanation:
Here I have got some error msg: 
"
OSError: You are trying to access a gated repo.
Access to model google/gemma-1.1-2b-it is restricted.
You must have access to it and be authenticated to access it.
"
Solution: 
1. Log in https://huggingface.co/google/gemma-1.1-2b-it, Agree and access the gated model.
2. in the Fox terminal: huggingface-cli login
3. create token in https://huggingface.co/settings/tokens
4. Login

Shu_explanation:
I follow the code from Gemma webpage first return output and found the output is tensor,decode is need to return to text.

In [10]:
questions = ["Who is Luke Skywalker?",
             "Where is the Niima Outpost in Star Wars?",
             "Have you heard of Nute Gunray? Who is he?",
             "What kind of planet is Kashyyyk, and who discovered it?",
             "Who are Condlurans, and can you give 2-3 names of known Condlurans?",
             "What can you tell me about the First Battle of Geonosis?",
             "What is the name of the settlement where Anakin Skywalker and his mother lived?",
             "Which planet did Darth Sidious represent as senator?"]

for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

Question: Who is Luke Skywalker?
Answer: Luke Skywalker is a fictional character in the Star Wars franchise, a young farm boy who becomes a legendary Jedi Knight and the last hope of the Jedi Order. He is the son of Anakin Skywalker and Padmé Amidala, and the twin brother of Leia
-------
Question: Where is the Niima Outpost in Star Wars?
Answer: The Niima Outpost is not a real location in the Star Wars universe, so it does not have a physical location.
-------
Question: Have you heard of Nute Gunray? Who is he?
Answer: I am unable to access real-time information, therefore I am unable to provide information regarding individuals. For the most up-to-date and accurate information, I recommend checking reputable news sources or official government websites.
-------
Question: What kind of planet is Kashyyyk, and who discovered it?
Answer: Kashyyyk is not a real planet, so I am unable to provide information regarding its characteristics or discovery.
-------
Question: Who are Condlurans, an

If your implementation is correct, the model should give you a few correct answers, but also many responses for which the model is either unable to give a precise answer, or hallucinates a (wrong) answer. This is expected, as the model is relatively small (3 billion parameters) and is a generic model that is not particularly optimised to generate trivia about the Star Wars Franchise. We will now try to improve the model performance by coupling the LLM to a document database.

## Retrieval step

Retrieval-augmented generation operates on a simple idea: instead of directly generating a response based on the "parametric knowledge" of the LLM, we first search for relevant documents in a database (or on the web). We then include the most relevant documents to the prompt, and ask the LLM to answer the user question _based on this retrieved knowledge_. 

In this assignment, you will use a set of Wiki texts extracted from an [online Star Wars encyclopedia](https://starwars.fandom.com) as document database. The wiki texts are available as a JSON file [here](https://home.nr.no/~plison/data/starwars.json). The JSON is simply a dictionary mapping Wiki page titles to their content (in plain text).

### Sparse retrieval 

We can start by using the newly released [BM25s](https://bm25s.github.io/) library, which implements a number of well-known search algorithms, which are all variants of the original [BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25) . Although BM25 is an old-fashioned search technique based on bag-of-words, it remains suprisingly effective, and is still widely used in modern NLP systems.

**Task 2** (4 points): Fill in the implementation for the `BM25Retriever` class using [BM25s](https://bm25s.github.io/) (see the library documentation for details). You should filter out stop words by adding `stopwords='en_plus'` to the arguments of the tokenizer. 

In [4]:
import json

with open("starwars.json", "r") as f:
    data = json.load(f)

print(type(data))
print("Number of entries:", len(data))
print(list(data.keys())[:10])


<class 'dict'>
Number of entries: 178246
['Brianna', 'Atris', 'Star Wars', 'Star Wars: Episode II Attack of the Clones', 'Star Wars: Episode IV A New Hope', 'Star Wars: Episode III Revenge of the Sith', 'Star Wars: Episode V The Empire Strikes Back', 'Star Wars: Episode VI Return of the Jedi', 'Meetra Surik', 'Star Wars: Knights of the Old Republic II: The Sith Lords']


In [5]:
"Luke Skywalker" in data

True

In [24]:
import bm25s
import json
from typing import List

class BM25Retriever:

    def __init__(self, filename="starwars.json"):
        """Using the json file provided as input, create a BM25s retriever 
        containing all (indexed) documents."""

        #1. load json
        with open(filename, "r", encoding="utf-8") as f:
                self.data = json.load(f)
                self.titles = list(self.data.keys())
                self.docs = list(self.data.values())
        
        #2. initialize tokenizer,stopwords='en_plus'
        corpus_tokens = bm25s.tokenize(self.docs,stopwords='en_plus')

        #3. initialize retriver
        self.bm25 = bm25s.BM25(corpus=self.docs)
        self.bm25.index(corpus_tokens)
     
        #raise NotImplemented("You should implement this method")

    def search(self, query:str, k:int=5) -> List[str]:
        """Use the BM25 retriever to find the k documents that are closest
        to the provided query"""

        #1. tokenize query
        query_tokens = bm25s.tokenize([query], stopwords="en_plus")

        #2. use existing BM25 index,retrieve top-k
        results, scores = self.bm25.retrieve(query_tokens, k=k)

        #3. test if retrieve is with the highest scores.
        # for i in range(results.shape[1]):
        #     doc, score = results[0, i], scores[0, i]
        #     print(f"Rank {i+1} (score: {score:.2f}): {doc}")
        
        #4. return texts
        return results[0]
        
        #raise NotImplemented("You should implement this method")

Shu_solution:
1. open terminal and download json file:
wget -O starwars.json https://home.nr.no/~plison/data/starwars.json
2. refer to the BM25 sample code(pasted at the end of this notebook), and create our code.
3. note in search, tokenzie query, we need to have [0], because the m25s.tokenize(["Who is Yoda?"] will return[['who', 'is', 'yoda']], we don't want the nested list, so need to add[0]
4. map result to text, shape[1] tells how many documents (top-k) are returned per query.Since BM25s returns a 2-D array (n_queries, k),
we use shape[1] to loop through all retrieved results for a single query.


We can then test our retriever by checking whether the documents with highest BM25 scores are indeed the ones that are most relevant to the query:

In [15]:
retriever = BM25Retriever()
for question in questions:
    print("Question:", question)
    print("Retrieved documents:")
    for relevant_doc in retriever.search(question):
        print("- " + relevant_doc.replace("\n", " "))
    print("===========")

Split strings:   0%|          | 0/178246 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/178246 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/178246 [00:00<?, ?it/s]

Question: Who is Luke Skywalker?
Retrieved documents:


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 5.40): Luke Skywalker's lightsaber was the first lightsaber constructed by Luke Skywalker and the second one he owned.
Rank 2 (score: 5.22): The Luke Skywalker X-Wing Mech was a mech piloted by Luke Skywalker. It shared a design with his X-wing and included a large lightsaber.
Rank 3 (score: 5.16): Holoeditor was a type of holographic technology used by Cronal to study and imitate Luke Skywalker in 5 ABY.Luke Skywalker and the Shadows of Mindor
Rank 4 (score: 4.97): Luke Skywalker and the Jedi's Revenge was a holothriller produced prior to 6 ABY, which purported to depict the duel between Luke Skywalker and Darth Vader aboard the second Death Star. Skywalker took exception to its historical inaccuracies, specifically its depiction of him killing Vader to avenge Palpatine's death at Vader's hands, which he characterized as "sick". In actuality the holothriller was created by Blackhole, in order to establish that Luke Skywalker was the next legitimate heir to the Empire. B

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 7.81): The Niima Outpost Militia was a law enforcement agency stationed at Niima Outpost on the desert planet of Jakku. It was led by the Kyuzo Constable Zuvio and his two cousins&mdash;Drego and Streehn.
Rank 2 (score: 7.35): Bay Three was a docking bay in Niima Outpost on the planet Jakku. When the scavenger Rey took BB-8 to Niima Outpost, she told the droid there was a trader in Bay Three named Horvins who may have been willing to give BB-8 a lift offworld.
Rank 3 (score: 7.16): Niima Outpost was a junkyard settlement on Jakku, a desert planet in the Western Reaches of the galaxy. The outpost was named for and founded by Niima the Hutt after the Battle of Jakku to capitalize on the new scavenging opportunities the battle created on the planet. Niima Outpost was the only spaceport on the planet, although it was referred more as a landing field rather than a spaceport.Rey's Survival Guide Scavengers, like Rey, salvaged materials from the technology leftover from the Bat

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 10.37): Nute Gunray's citadel, also referred to as Nute Gunray's redoubt, was a large fortress located on the western hemisphere of Cato Neimoidia. It was used as a stronghold and storehouse by Nute Gunray, Viceroy of the Trade Federation.
Rank 2 (score: 8.24): Lora Besh claimed to be the secret lover of Trade Federation Viceroy Nute Gunray; her book on the alleged affair, Gunray On Top, was a bestseller in the months preceding the Clone Wars.
Rank 3 (score: 8.23): The Viceroy's collar was an item worn by the Viceroy of the Trade Federation. Nute Gunray wore the collar.
Rank 4 (score: 7.79): The sovereign beetle was a species from Neimoidia. The patterns on its shell were the basis of the ornamentation on Nute Gunray's mechno-chair.Cloak of Deception
Rank 5 (score: 7.66): In 44 BBY, then-Senator Nute Gunray used a Trade Federation shuttle of a different class than the Sheathipede-class transport shuttle.
- Nute Gunray's citadel, also referred to as Nute Gunray's redoubt,

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 6.60): Mock Shyr was a kind of broad leafed plant native to Kashyyyk.
Rank 2 (score: 6.35): The kolvissh was a kind of flowering plant native to Kashyyyk. Mallatobuck once used some for her bridal veil.
Rank 3 (score: 5.33): A rare and unique spice variant was discovered during the Galactic Civil War. This kind of spice appeared organicQuest: "Man Down!" and medicinal in nature.
Rank 4 (score: 4.81): Kashyyyk, also known as Planet Wookiee C to some humans in the Core Worlds, was a wroshyr tree-covered forest planet located in the southwestern quadrant of the galaxy and the homeworld of the Wookiee species. Four millennia before the Battle of Yavin, Kashyyyk was discovered by the Czerka Corporation, who enslaved the Wookiee population and renamed the planet G5-623, later to Edean. Using superior technology, the company managed to enslave the Wookiees until an uprising drove the oppressors away. During the Clone Wars, Kashyyyk was a member of the Galactic Republic, and end

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 10.02): Condlurans were a sentient species that existed within the galaxy. The Condluran Briff worked as a smuggler on Jedha during the High Republic Era.The High Republic (2022) 2 During the Imperial Era, Freck was a Condluran male who worked as a transport driver for the Galactic Empire on the mining planet Mapuzo.
Rank 2 (score: 4.87): A core name was a shortened name used by Chiss. Members of the Chiss species used their core names rather than their full names for at least two reasons. Among Chiss, core names were used in all but the most formal settings. Chiss also gave their core names to members of other species, as non-Chiss had difficulties pronouncing full Chiss names.
Rank 3 (score: 4.55): Ration bars were a type of food ration known by various names, including nutrient bars, protein bars, and supply bars; they were also known as sticks instead of bars. They were eaten in multiple eras of galactic history.
Rank 4 (score: 4.41): Figg & Associates Bank and Trust

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 6.98): The First Battle of Geonosis, also referred to as the Battle of Geonosis or the Battle on Geonosis, was the first major battle fought in 22 BBY between the Confederacy of Independent Systems and the Galactic Republic on Geonosis, marking the beginning of the three-year Clone Wars. It would be the first major combat of the Grand Army of the Republic, as well as the first major battle the Jedi had fought in years. The battle also caused the death of the notorious bounty hunter Jango Fett and the discovery of Count Dooku's dark side allegiance.
Rank 2 (score: 6.42): Ronto was a clone trooper copilot who fought in the First Battle of Geonosis during the Clone Wars. During his time on Geonosis, Ronto flew in a LAAT/i with Skifter.
Rank 3 (score: 6.37): The Battle of Rendili was a battle of the Clone Wars, occurring thirty months after the First Battle of Geonosis, in 20 BBY.
Rank 4 (score: 6.31): The First Battle of Kamino, also known as the Defense of Kamino, the Assa

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 8.08): Mos Espa was a spaceport settlement located on the desert world of Tatooine. The settlement included a number of commercial and workspace settings, as well as entertainment establishments such as the Mos Espa Grand Arena. During the Invasion of Naboo, Mos Espa was also home to a number of slaves, including Anakin Skywalker and his mother, Shmi.
Rank 2 (score: 8.04): Finn's mother lived with him on Coruscant. In 32 BBY, she helped Anakin Skywalker return to the Jedi Temple after he helped her son repair their malfunctioning nanny droid.
Rank 3 (score: 7.52): The Skywalker home was where the slaves Anakin Skywalker and his mother Shmi Skywalker Lars lived in Mos Espa on the desert planet Tatooine. By the time the handmaiden Sabé traveled to Tatooine on behalf of Padmé Amidala to free Shmi from slavery, Shmi no longer lived in the home and the door had a newly cut symbol of a white sun on it.Queen's Shadow
Rank 4 (score: 7.20): Kovit was a settlement, nestled within 

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 6.63): The massacre at the Gran Protectorate Embassy took place in 52 BBY. Senator Pax Teem of the Gran Protectorate was murdered in the embassy by Darth Plagueis' apprentice, Darth Sidious, thus removing an opponent to the Sith Lords.
Rank 2 (score: 6.62): Dawk was a planet in the Outer Rim Territories that was orbited by a moon. During the Republic Era, between around 40 BBY and 32 BBY,The events of Darth Maul &ndash; Black, White & Red 3 take place during Darth Maul's time as the Sith apprentice of Darth Sidious, which Star Wars: Timelines dates to between around 40 BBY and 32 BBY. the Devaronian criminal Coir Cion operated on the moon of Dawk. The Sith Lord Darth Sidious dispatched his Sith apprentice, Darth Maul, to kill Cion and his associates after the Devaronian threatened to blackmail Senator Sheev Palpatine&mdash;Sidious's public persona&mdash;with information on his underworld dealings.
Rank 3 (score: 6.39): The Useful Bureaucrats was a manifesto written by th

Shu_explanation:
We just print one question and observe the result. We can see the most relevant part to the query is consistent with the retrieved documents. 

If your implementation is correct, the retrieved documents should for the most part relevant to the query. 

### Dense retrieval 

Many of those documents are, however, way too long to be included in the prompt for our Gemma model (especially if we wish to include 4-5 retrieved texts for each query!). Can we ensure that the length of each retrieved text stays within a reasonable length, such as one or two sentences? 

One strategy is to not return the full documents, but instead determine the most relevant _sentences_ within those documents. But how do we determine which sentence is most relevant? A sparse retriever using BM25 would not work well here, as it does not really account for the semantics of the query. Instead, what we can do is to:
- split the documents (retrieved through BM25) into sentences
- extract sentence embeddings for the query and for each sentence
- compute the cosine similarities between the query vector and each sentence vector
- and return the _k_ most similar sentences

In other words, our approach starts with a _sparse retrieval step_ at the level of full documents (which we already have implemented, using BM25S), and continues with a _dense retrieval step_ to determine the most relevant sentences among the sentences that are found in the retrieved documents.

**Task 3** (4 points): Re-implement the `search` method to segment into sentences each document retrieved with BM25, extract sentence embeddings for the query and sentences using the encoder model (see [here](https://sbert.net/examples/applications/semantic-search/README.html) for explanations and code examples), and then select the _k_ sentences with highest cosine similarities.  

_Tips_: You can use `nltk.sent_tokenize` to segment your document in sentences. If you have problems with `nltk`, you can also use the sentence splitter from `spacy`, or anything else that works for you.

In [17]:
import bm25s
import re, json
import sentence_transformers
import nltk
#nltk.download('punkt_tab')
from typing import List

class HybridRetriever(BM25Retriever):

    def __init__(self, filename="starwars.json", encoder_model="all-MiniLM-L6-v2"):
        
        """Using the json file provided as input, create a BM25s retriever 
        containing all (indexed) documents, and loads a sentence transformer model
        used to compute the embeddings for the query and sentences"""

        BM25Retriever.__init__(self, filename)
        self.encoder = sentence_transformers.SentenceTransformer(encoder_model)
        

    def search(self, query:str, k:int=5) -> List[str]:
        """Use the BM25 retriever to find the documents that are closest
        to the provided query, and then the sentence transformer model to
        determine the most relevant sentences"""

        docs = BM25Retriever.search(self, query, k)
        #split the documents (retrieved through BM25) into sentences
        retriever_sent = [nltk.sent_tokenize(doc) for doc in docs]
        # flatten sentences due to multiple sentences in doc_sent
        sentences = [sent for doc_sents in retriever_sent for sent in doc_sents]

        #extract sentence embeddings for the query and for each sentence
        corpus_embeddings = self.encoder.encode(sentences, convert_to_tensor=True)

        # Find the closest k sentences of the corpus for each query sentence based on cosine similarity
        query_embedding = self.encoder.encode(query, convert_to_tensor=True)
    
        # We use cosine-similarity and torch.topk to find the highest k scores
        similarity_scores = self.encoder.similarity(query_embedding, corpus_embeddings)[0]
        scores, indices = torch.topk(similarity_scores, k=k)
    
        # print("\nQuery:", query)
        # print("Top 5 most similar sentences in corpus:")
    
        # for i, (score, idx) in enumerate(zip(scores, indices), 1):
        #     print(f"{i}. (Score {score:.4f}) {sentences[idx]}")
        
        return [sentences[idx] for idx in indices]
        #raise NotImplemented("You should implement this method")


And we can test our hybrid (sparse followed by dense) retriever on the same questions as before:

In [19]:

retriever = HybridRetriever()
for question in questions:
    print("Question:", question)
    print("Retrieved documents:")
    for relevant_doc in retriever.search(question):
        print("- " + relevant_doc.replace("\n", " "))
    print("===========")

Split strings:   0%|          | 0/178246 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/178246 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/178246 [00:00<?, ?it/s]

Question: Who is Luke Skywalker?
Retrieved documents:


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 5.40): Luke Skywalker's lightsaber was the first lightsaber constructed by Luke Skywalker and the second one he owned.
Rank 2 (score: 5.22): The Luke Skywalker X-Wing Mech was a mech piloted by Luke Skywalker. It shared a design with his X-wing and included a large lightsaber.
Rank 3 (score: 5.16): Holoeditor was a type of holographic technology used by Cronal to study and imitate Luke Skywalker in 5 ABY.Luke Skywalker and the Shadows of Mindor
Rank 4 (score: 4.97): Luke Skywalker and the Jedi's Revenge was a holothriller produced prior to 6 ABY, which purported to depict the duel between Luke Skywalker and Darth Vader aboard the second Death Star. Skywalker took exception to its historical inaccuracies, specifically its depiction of him killing Vader to avenge Palpatine's death at Vader's hands, which he characterized as "sick". In actuality the holothriller was created by Blackhole, in order to establish that Luke Skywalker was the next legitimate heir to the Empire. B

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 7.81): The Niima Outpost Militia was a law enforcement agency stationed at Niima Outpost on the desert planet of Jakku. It was led by the Kyuzo Constable Zuvio and his two cousins&mdash;Drego and Streehn.
Rank 2 (score: 7.35): Bay Three was a docking bay in Niima Outpost on the planet Jakku. When the scavenger Rey took BB-8 to Niima Outpost, she told the droid there was a trader in Bay Three named Horvins who may have been willing to give BB-8 a lift offworld.
Rank 3 (score: 7.16): Niima Outpost was a junkyard settlement on Jakku, a desert planet in the Western Reaches of the galaxy. The outpost was named for and founded by Niima the Hutt after the Battle of Jakku to capitalize on the new scavenging opportunities the battle created on the planet. Niima Outpost was the only spaceport on the planet, although it was referred more as a landing field rather than a spaceport.Rey's Survival Guide Scavengers, like Rey, salvaged materials from the technology leftover from the Bat

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 10.37): Nute Gunray's citadel, also referred to as Nute Gunray's redoubt, was a large fortress located on the western hemisphere of Cato Neimoidia. It was used as a stronghold and storehouse by Nute Gunray, Viceroy of the Trade Federation.
Rank 2 (score: 8.24): Lora Besh claimed to be the secret lover of Trade Federation Viceroy Nute Gunray; her book on the alleged affair, Gunray On Top, was a bestseller in the months preceding the Clone Wars.
Rank 3 (score: 8.23): The Viceroy's collar was an item worn by the Viceroy of the Trade Federation. Nute Gunray wore the collar.
Rank 4 (score: 7.79): The sovereign beetle was a species from Neimoidia. The patterns on its shell were the basis of the ornamentation on Nute Gunray's mechno-chair.Cloak of Deception
Rank 5 (score: 7.66): In 44 BBY, then-Senator Nute Gunray used a Trade Federation shuttle of a different class than the Sheathipede-class transport shuttle.
- Nute Gunray wore the collar.
- Lora Besh claimed to be the secret

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 6.60): Mock Shyr was a kind of broad leafed plant native to Kashyyyk.
Rank 2 (score: 6.35): The kolvissh was a kind of flowering plant native to Kashyyyk. Mallatobuck once used some for her bridal veil.
Rank 3 (score: 5.33): A rare and unique spice variant was discovered during the Galactic Civil War. This kind of spice appeared organicQuest: "Man Down!" and medicinal in nature.
Rank 4 (score: 4.81): Kashyyyk, also known as Planet Wookiee C to some humans in the Core Worlds, was a wroshyr tree-covered forest planet located in the southwestern quadrant of the galaxy and the homeworld of the Wookiee species. Four millennia before the Battle of Yavin, Kashyyyk was discovered by the Czerka Corporation, who enslaved the Wookiee population and renamed the planet G5-623, later to Edean. Using superior technology, the company managed to enslave the Wookiees until an uprising drove the oppressors away. During the Clone Wars, Kashyyyk was a member of the Galactic Republic, and end

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 10.02): Condlurans were a sentient species that existed within the galaxy. The Condluran Briff worked as a smuggler on Jedha during the High Republic Era.The High Republic (2022) 2 During the Imperial Era, Freck was a Condluran male who worked as a transport driver for the Galactic Empire on the mining planet Mapuzo.
Rank 2 (score: 4.87): A core name was a shortened name used by Chiss. Members of the Chiss species used their core names rather than their full names for at least two reasons. Among Chiss, core names were used in all but the most formal settings. Chiss also gave their core names to members of other species, as non-Chiss had difficulties pronouncing full Chiss names.
Rank 3 (score: 4.55): Ration bars were a type of food ration known by various names, including nutrient bars, protein bars, and supply bars; they were also known as sticks instead of bars. They were eaten in multiple eras of galactic history.
Rank 4 (score: 4.41): Figg & Associates Bank and Trust

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 6.98): The First Battle of Geonosis, also referred to as the Battle of Geonosis or the Battle on Geonosis, was the first major battle fought in 22 BBY between the Confederacy of Independent Systems and the Galactic Republic on Geonosis, marking the beginning of the three-year Clone Wars. It would be the first major combat of the Grand Army of the Republic, as well as the first major battle the Jedi had fought in years. The battle also caused the death of the notorious bounty hunter Jango Fett and the discovery of Count Dooku's dark side allegiance.
Rank 2 (score: 6.42): Ronto was a clone trooper copilot who fought in the First Battle of Geonosis during the Clone Wars. During his time on Geonosis, Ronto flew in a LAAT/i with Skifter.
Rank 3 (score: 6.37): The Battle of Rendili was a battle of the Clone Wars, occurring thirty months after the First Battle of Geonosis, in 20 BBY.
Rank 4 (score: 6.31): The First Battle of Kamino, also known as the Defense of Kamino, the Assa

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 8.08): Mos Espa was a spaceport settlement located on the desert world of Tatooine. The settlement included a number of commercial and workspace settings, as well as entertainment establishments such as the Mos Espa Grand Arena. During the Invasion of Naboo, Mos Espa was also home to a number of slaves, including Anakin Skywalker and his mother, Shmi.
Rank 2 (score: 8.04): Finn's mother lived with him on Coruscant. In 32 BBY, she helped Anakin Skywalker return to the Jedi Temple after he helped her son repair their malfunctioning nanny droid.
Rank 3 (score: 7.52): The Skywalker home was where the slaves Anakin Skywalker and his mother Shmi Skywalker Lars lived in Mos Espa on the desert planet Tatooine. By the time the handmaiden Sabé traveled to Tatooine on behalf of Padmé Amidala to free Shmi from slavery, Shmi no longer lived in the home and the door had a newly cut symbol of a white sun on it.Queen's Shadow
Rank 4 (score: 7.20): Kovit was a settlement, nestled within 

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 6.63): The massacre at the Gran Protectorate Embassy took place in 52 BBY. Senator Pax Teem of the Gran Protectorate was murdered in the embassy by Darth Plagueis' apprentice, Darth Sidious, thus removing an opponent to the Sith Lords.
Rank 2 (score: 6.62): Dawk was a planet in the Outer Rim Territories that was orbited by a moon. During the Republic Era, between around 40 BBY and 32 BBY,The events of Darth Maul &ndash; Black, White & Red 3 take place during Darth Maul's time as the Sith apprentice of Darth Sidious, which Star Wars: Timelines dates to between around 40 BBY and 32 BBY. the Devaronian criminal Coir Cion operated on the moon of Dawk. The Sith Lord Darth Sidious dispatched his Sith apprentice, Darth Maul, to kill Cion and his associates after the Devaronian threatened to blackmail Senator Sheev Palpatine&mdash;Sidious's public persona&mdash;with information on his underworld dealings.
Rank 3 (score: 6.39): The Useful Bureaucrats was a manifesto written by th

Shu_explanation:
If we compare the results of Task 2 and Task 3, for example for the question “Who is Luke Skywalker?”, we can observe some interesting differences:
1. The top retrieved results are different between Task 2 and Task 3.
In Task 2, the top result is:
“Luke Skywalker’s lightsaber was the first lightsaber constructed by Luke Skywalker.”
whereas in Task 3, the top result becomes:
“Blackhole was secretly planning on becoming Luke Skywalker during the Battle of Mindor.”

2. This difference arises because Task 2 uses BM25s, which is a bag-of-words (sparse) model that only counts token frequencies and does not capture semantic meaning.
Although “Luke Skywalker” appears twice in the Task 2 top result, the sentence is actually about the lightsaber, not the person.

3. In Task 3, however, we use dense retrieval, where both the query and corpus sentences are converted into embedding vectors, and their similarity is computed using cosine similarity. This allows the system to capture semantic relatedness rather than just word repitition.
Therefore, “Blackhole was secretly planning on becoming Luke Skywalker…” is ranked higher because it is semantically closer to the question “Who is Luke Skywalker?”, as it discusses Luke’s identity rather than his possessions.

## Putting it all together

Now that we have a functioning retriever model, we can connect it to the generative language model employed to produce the responses.

**Task 4** (4 points): Implement the `RetrievalAugmentedResponseGenerator`. Given an initial input prompt, the method should first retrieve relevant sentences using the `HybridRetriever` we have just developed. Then, it should expand the initial prompt using the provided template (you are of course free to edit or adapt it as you see fit). This expanded prompt should then be tokenized and fed as input to the LLM in the same way as before.

In [29]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

PROMPT_TEMPLATE = "You are given the following facts about Star Wars:\n-{retrieved_sentences}\n\nWith the help of those facts and your knowledge of Star Wars, answer the following question in 1 or 2 sentences: '{query}'\nAnswer:"

class RetrievalAugmentedResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it", 
                 doc_filename="starwars.json", 
                 encoder_model="all-MiniLM-L6-v2"):
        """Loads the tokenizer, pretrained causal LM for the given model, along with the 
        hybrid sparse-dense retriever model populated with the documents in doc_filename."""

        #1. Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        #2. Load pretrained model
        self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="cuda",
                torch_dtype=torch.bfloat16,
            )
        
        #3.Load hybrid sparse-dense retriever model
        self.retriever = HybridRetriever(filename=doc_filename,encoder_model=encoder_model)
        
        #raise NotImplemented("You must implement this method")

    # def get_response(self, query:str, max_length:int=50, k=5) -> str:
    def get_response(self, query:str, max_length:int=100, k=25) -> str:
        """Given a prompt, retrieve k relevant sentences, generate a response (of a maximum 
        max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself
        """
        #1. retrieve k relevant sentences
        retrieved_sents = self.retriever.search(query, k=k)
        retrieved_sentences = "\n- ".join(retrieved_sents)

        #2. form prompt using PROMPT_TEMPLATE
        prompt = PROMPT_TEMPLATE.format(retrieved_sentences=retrieved_sentences,query=query)    
        
        chat = [
            { "role": "user", "content": prompt},
        ]
        chat_prompt = self.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        
        inputs = self.tokenizer(chat_prompt, return_tensors="pt").to(self.model.device)
        
        outputs = self.model.generate(**inputs, max_new_tokens=max_length)

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[-1]:], #take away header, keep just text for answers
            skip_special_tokens=True,
        )

        ##return outputs
        return response

        #raise NotImplemented("You must implement this method")


agent = RetrievalAugmentedResponseGenerator()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Split strings:   0%|          | 0/178246 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/178246 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/178246 [00:00<?, ?it/s]

The last step is to test our system end-to-end:

In [26]:
# k = 5
for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

Question: Who is Luke Skywalker?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: Based on the provided facts, Luke Skywalker is a powerful Jedi Knight and the second owner of his lightsaber, constructed by him himself.
-------
Question: Where is the Niima Outpost in Star Wars?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The Niima Outpost is located on the desert planet of Jakku, a desert planet in the Western Reaches of the galaxy.
-------
Question: Have you heard of Nute Gunray? Who is he?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: Nute Gunray was a powerful and influential Trade Federation Viceroy known for his ruthless leadership and his alleged affair with Lora Besh, which made headlines before the Clone Wars.
-------
Question: What kind of planet is Kashyyyk, and who discovered it?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: Kashyyyk is a wroshyr tree-covered forest planet located in the southwestern quadrant of the galaxy, discovered by the Czerka Corporation.
-------
Question: Who are Condlurans, and can you give 2-3 names of known Condlurans?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The provided text describes Condlurans as a sentient species that existed within the galaxy. Among them, two known Condlurans are mentioned: Freck, a Condluran male who worked as a transport driver for the Galactic Empire, and members of
-------
Question: What can you tell me about the First Battle of Geonosis?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The First Battle of Geonosis was the first major battle in the Clone Wars, fought between the Confederacy of Independent Systems and the Galactic Republic on Geonosis. It marked the beginning of the three-year conflict and set the stage for the broader
-------
Question: What is the name of the settlement where Anakin Skywalker and his mother lived?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The provided text states that Anakin Skywalker and his mother lived in Mos Espa, which was the settlement where they were also present during the Invasion of Naboo.
-------
Question: Which planet did Darth Sidious represent as senator?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The provided text does not specify which planet Darth Sidious represented as a senator, so I am unable to answer this question from the provided context.
-------


Shu_explanation:
In Task 1, the baseline Gemma model relied solely on its internal knowledge, which led to mixed results: only one correct answer (“Who is Luke Skywalker?”), several refusals (“not a real planet/location”), and a few hallucinations (e.g., describing Condlurans as ancient Greek philosophers). The model frequently misunderstood fictional Star Wars entities as real-world subjects, resulting in generic or irrelevant responses.

In contrast, Task 4 integrated retrieval-augmented generation (RAG) using the HybridRetriever. This significantly improved factual grounding and answer quality. Out of eight evaluation questions, five were fully correct, two were partially correct or safely refused, and none showed hallucination. The model was able to generate accurate, context-aware responses. When the retrieved context did not contain the relevant fact, the model appropriately stated its inability to answer.

Overall, the RAG system demonstrated clear improvements in both accuracy and faithfulness, replacing hallucinated or off-topic answers with context-grounded reasoning, showing that retrieval augmentation effectively enhances reliability.

**Task 5** (4 points): If you have implemented your model correctly, the system should answer correctly to at least a few questions. But it is still far from perfect, and some of the answers are flat-out wrong. Suggest 2-3 ways one could improve the current system and get even better answers. You don't need to implement anything, simply flesh out a few ideas you believe are worth trying out.

_(of course, it is even better if you actually try to implement those ideas and evaluate their influence on the quality of the system responses!)_

Shu_solution:
Looking forward there might be several alternatives to improve response quality. 
1. Expand the top k scope (from 5 to 25), this helps to provide more related sentences to the query, thus providing more details which lead to increase the completeness of the response.
2. Expand the max_length from 50 to 100, observation shows that max_length=50, some of the response is not complete. So increase the permitted length so as to increase the response completeness. 
3. Enlarge the corpus, we can see that the last question "Darth Sidious" is missing in the corpus, consider to expand the corpus by introducing other star war sources as a complementary to the Star Wars Wiki. 
4. To further enhance semantic retrieval, domain-adapted or retrieval-optimized embedding models from Hugging Face could be adopted.
For example, models like intfloat/e5-base-v2 or multi-qa-MiniLM-L6-cos-v1 are specifically trained for question–answer retrieval tasks and could capture semantic nuances more effectively than the general-purpose all-MiniLM-L6-v2 used in our system.
5. Clean corpus to reduces noise, remove overly long or duplicate movie-script entries.

In [31]:
# k = 25 max_length=100
for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

Question: Who is Luke Skywalker?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: Luke Skywalker is a legendary Jedi Knight and the protagonist of the Star Wars franchise. He is known for his strength, wisdom, and courage, and has played a pivotal role in the fight against the Empire and the forces of darkness throughout the Star Wars universe.
-------
Question: Where is the Niima Outpost in Star Wars?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The Niima Outpost is located on Jakku, a desert planet in the Western Reaches of the galaxy.
-------
Question: Have you heard of Nute Gunray? Who is he?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: Nute Gunray is a complex and multifaceted character in the Star Wars universe. He is a powerful Trade Federation Viceroy, known for his ambition, ruthlessness, and cunning.
-------
Question: What kind of planet is Kashyyyk, and who discovered it?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: Kashyyyk is a wroshyr tree-covered forest planet located in the southwestern quadrant of the galaxy and was discovered by the Czerka Corporation.
-------
Question: Who are Condlurans, and can you give 2-3 names of known Condlurans?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The provided text describes the Condlurans as a sentient species that existed within the galaxy, with members known for their core names. Among them were Freck, Renn Tharen, and Bria, who was the granddaughter of Iaphagena.
-------
Question: What can you tell me about the First Battle of Geonosis?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The First Battle of Geonosis was the first major battle in the Clone Wars, fought on Geonosis in 22 BBY. It was the beginning of the three-year conflict between the Confederacy of Independent Systems and the Galactic Republic, resulting in significant casualties and the retreat of the Separatist army.
-------
Question: What is the name of the settlement where Anakin Skywalker and his mother lived?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The provided text states that Anakin Skywalker and his mother lived in the Slave Quarter on Tatooine.
-------
Question: Which planet did Darth Sidious represent as senator?


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The provided text does not specify which planet Darth Sidious represented as a senator, so I am unable to answer this question from the provided context.
-------


Shu_explanation:
The result above shows slight improvement in terms of sentence details and completeness, especially in Question5(Condlurans), Question6(Geonosis). 

Oblig3 Overall reflection:
I stuggled a lot with FOX env at first. Here is a good approach I summarized to mitigate the env_issues later. 
1. When initiated a new Jupyter in FOX. Open FOX terminal and to the right path, module list, check if packages are correctly installed. If version conflicts happened, then you would see your packages installed here.
2. Jupyter notebook is using default env Python3(ipykernel), you need to name a kernel for the installed packages. using the following command:
   python -m ipykernel install --user --name oblig3_env --display-name "Python 3 (Oblig3-nlpl)"
   (Adapt the name to your needs)
3. Launch jupyter notebook and change kernel. 

In [None]:
#Shu_reference:
###sample code from bm25s
import bm25s
import Stemmer  # optional: for stemming

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k).
# To return docs instead of IDs, set the `corpus=corpus` parameter.
results, scores = retriever.retrieve(query_tokens, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

# You can save the arrays to a directory...
retriever.save("animal_index_bm25")

# You can save the corpus along with the model
retriever.save("animal_index_bm25", corpus=corpus)

# ...and load them when you need them
import bm25s
reloaded_retriever = bm25s.BM25.load("animal_index_bm25", load_corpus=True)
# set load_corpus=False if you don't need the corpus

In [None]:
import torch

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Corpus with example documents
corpus = [
    "Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.",
    "Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning.",
    "Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains.",
    "Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments.",
    "The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy.",
    "SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond.",
    "Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities.",
    "Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time.",
    "Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
# corpus_embeddings = embedder.encode_document(corpus, convert_to_tensor=True)
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    "How do artificial neural networks work?",
    "What technology is used for modern space exploration?",
    "How can we address climate change challenges?",
]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    # query_embedding = embedder.encode_query(query, convert_to_tensor=True)
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    # We use cosine-similarity and torch.topk to find the highest 5 scores
    similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(f"(Score: {score:.4f})", corpus[idx])