# LLMS for NASA PDS!

The goal of this notebook is to introduce you all to the use of LLMs as a search tool for the NASA Planetary Data system. You will call and use your own API key for a custom LLM deployment. 

In this notebook, we will be working with the old, soon deprecated version of PDS, Atlas III. However, the concepts are still relevant to Atlas IV, which will be our use-case for this project. Follow [this link](https://pds-imaging.jpl.nasa.gov/search/?fq=-ATLAS_THUMBNAIL_URL%3Abrwsnotavail.jpg&q=*%3A*) to navigate to the Atlas III website. It should look like this: 

![Atlas III screenshot](../data/atlas_3_screenshot.png)

Lots of filters on that site, huh? It might be fine for those who know about the facets (and therefore know what they are looking for), but it might not be the most accessible option. What if we were to use the search box at the top? Let's say I was a researcher who wanted to find images of 'Swiss Cheese' and 'Dark Dune' on Mars. Let's also assume in this scenario that I somehow don't know how to use the filters, so I try the search bar instead. I type in "swiss cheese and dark dunes" and get the following as a result: 

![Atlas III Search](../data/atlas_3_search.png)

Not very helpful, isn't it? Thankfully, for a query such as this, we can simply select the filters on the left hand side under 'MRO HiRISE Image Landmarks' and select the relevant facets, so the desired images do in fact exist on the website -- the search bar just doesn't support. This brings our use-case for LLMs; **can they serve as a useful search assistant for the PDS website?**? LLMs make a good candidate for this use case, because of the fact that LLMs work well with unstructured, natural human language, and have a better ability to recognize the nuance in language. After all, no scientist is literally looking for "Swiss Cheese" on Mars - it is a specific term used to describe a unique terrain on the planet's polar ice cap. 

We can demonstrate the potential for LLMs by creating our own chatbot using the OLMo model from the Allen Institute for AI. You don't need to worry too much about calling LLMs through APIs, as we will discuss this later on :) For now, follow these instructions: 

Click on [this link](https://openrouter.ai/allenai/olmo-3.1-32b-think) to the OpenRouter website.

Scroll down to where it says "Create API key" and click on it. It should look like this: 

![Create API key](../data/create_api.png)

Sign in with your GitHub, Gmail, or whatever you choose. Once you've done that, click on "Create" and give your API key a title. 

![Create Key](../data/create_key.png)

Once you've done that, **be sure to copy and paste your key!!** You will NOT be able to access it again. 

![Key Created](../data/new_key.png)

In order to securely use your API key in your workspace, you will need to create a .env file at the workspace root. I have already include an example.env file at this workspace root - all you need to do is to change the name to ".env" only, and paste your API key where indicated. I've already added the `.env` file to the `.gitignore`, so it won't be pushed onto the main site. 

Now that that's all settled, **let's get started with creating our beta chatbot with OLMO!**


In [None]:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("OPENROUTER_API_KEY"):
    raise ValueError("Missing OPENROUTER_API_KEY. Set it in your .env file.")

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY")
)

## Part 1: LLM only (no retrieval)

In this first pass, we ask the model to map a natural-language query to likely Atlas III facets.

The output given the query will just be a return list of filters. Feel free to saw out the query variable and try a few queries like:
- "Find swiss cheese terrain near the south polar region."
- "I want images of dark dunes and impact ejecta."
- "Show me slope streaks on Mars."

In [None]:
MODEL = "allenai/olmo-3.1-32b-think"

# A system prompt is a set of instructions that guides the behavior of the language model. 
# It helps the model understand the context and the expected output format for a given task. 
# In this case, the system prompt instructs the model to act as a PDS search assistant for Atlas III 
# and to return relevant filters based on user queries about Mars images.

BASELINE_SYSTEM_PROMPT = """
You are a PDS search assistant for Atlas III.
Given a user query, return the relevant filter one of the following:

Bright dune, crater, dark dune, other, slope streak, impact ejecta, spider, swiss cheese.
"""

def ask_llm_only(query):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": BASELINE_SYSTEM_PROMPT},
            {"role": "user", "content": query},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()

query = "Find Mars images of swiss cheese terrain and dark dunes."
print(ask_llm_only(query))

swiss cheese, dark dune


## Part 2: Simple RAG over landform classes

In that first example, you saw that the model saw the query, and most likely reasoned that the relevant filters were "swiss cheese" and "dark dune", which is correct! If you swapped out some of the queries, how did it do? Was it better or worse than the search engine on the PDS site? 

We're already doing much better than our original example, but might raise the question/thought of "but this is just doing the same thing a traditional search engine could do. Aren't we cracking a nut with a sledgehammer here by using an LLM?"

This would be correct, as most traditional search engines (like that of Google's) should be able to handle such queries where the key words are already present. Heck, even a simple regex-matching program might even do. If we were only concerned with simple queries that had the keywords present, then it would be a waste to use the computational power of a large language model. 

The problem with planetary science archives like PDS is that users may not know the exact names of the keywords that they need to search. What if a scientist, intending to find images of  on Mars but doesn't know that 

Now we ground the model with a tiny retrieval step using a short list of landform classes in `tutorial/landform_classes.txt`.
This keeps the RAG example focused and easy to follow.

Try a few queries like:
- "Find swiss cheese terrain near the south polar region."
- "I want images of dark dunes and impact ejecta."
- "Show me slope streaks on Mars."

In [None]:
import re
from collections import Counter

def normalize(text):
    return re.findall(r"[a-z0-9]+", text.lower())

# This is the context for the LLM to use in the RAG approach. It is built from the landform classes and their descriptions.
# Sort of like a "reference page" so the model can stick to the domain-specific info and not hallucinate based on its training data.

def load_landform_classes(path="data/landform_classes.txt"):
    classes = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if ":" not in line:
                continue
            name, desc = line.split(":", 1)
            classes.append({
                "name": name.strip(),
                "description": desc.strip(),
            })
    return classes

# basic lotic for formatting the landorm classes 

def build_docs(classes):
    docs = []
    for item in classes:
        text = f"{item['name']} {item['description']}"
        tokens = set(normalize(text))
        docs.append({
            "name": item["name"],
            "description": item["description"],
            "tokens": tokens,
        })
    return docs

# this function retrieves the most relevant landform classes based on token overlap with the query. 
# It's a simple bag-of-words approach, but it helps the LLM focus on the most relevant context when we do RAG.

def retrieve(query, docs, top_k=4):
    q_tokens = Counter(normalize(query))
    scored = []
    for doc in docs:
        score = sum(q_tokens[t] for t in doc["tokens"] if t in q_tokens)
        if score > 0:
            scored.append((score, doc))
    scored.sort(key=lambda x: x[0], reverse=True)
    return [doc for _, doc in scored[:top_k]]

def format_context(docs):
    parts = []
    for doc in docs:
        parts.append(f"Class: {doc['name']} | Description: {doc['description']}")
    return "\n".join(parts)

# this is the new instruction set, telling the LLM to use the retrieved context to make a more informed suggestion about the landform class.

RAG_SYSTEM_PROMPT = """
You are a PDS search assistant for Atlas III.
Use ONLY the provided context to suggest landform classes.
Return a JSON-like list with objects: {class, reason}.
If the context is insufficient, say you need more info.
"""

classes = load_landform_classes()
docs = build_docs(classes)

def ask_llm_rag(query):
    top_docs = retrieve(query, docs, top_k=4)
    context = format_context(top_docs)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": RAG_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuery: {query}"},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()

query = "Find swiss cheese terrain near the south polar region."
print(ask_llm_rag(query))

{
  "results": [
    {
      "class": "Swiss cheese",
      "reason": "The query explicitly requests 'Swiss cheese terrain', which is a defined class characterized by pits formed by sublimation of ice. The context does not exclude this class from the south polar region."
    }
  ]
}


## Part 2b: Implicit class queries (why RAG helps)

Some queries never mention a class name directly. For example, a user might describe a feature ("defrosted dunes") without saying "dark dune."
The RAG context lets the model map those implicit descriptions to the closest class.

In [None]:
implicit_queries = [
    "Images of defrosted dunes on Mars",
    "Pits formed by ice sublimation near the poles",
    "Dark flow-like features on slopes",
]

for q in implicit_queries:
    print("Query:", q)
    print("LLM only:")
    print(ask_llm_only(q))
    print("RAG:")
    print(ask_llm_rag(q))
    print("-" * 40)

Query: Images of defrosted dunes on Mars
LLM only:
dark dune
RAG:
{
  "suggestions": [
    {
      "class": "Dark dune",
      "reason": "The context explicitly defines 'Dark dune' as 'Dunes that are completely defrosted on Mars,' which directly matches the query's focus on 'defrosted dunes.'"
    }
  ]
}
----------------------------------------
Query: Pits formed by ice sublimation near the poles
LLM only:
swiss cheese
RAG:
{
  "suggestions": [
    {
      "class": "Swiss cheese",
      "reason": "Terrain with pits formed by sublimation of ice directly matches the description of pits formed by ice sublimation near the poles."
    }
  ]
}
----------------------------------------
Query: Dark flow-like features on slopes
LLM only:
slope streak
RAG:
{
  "suggestions": [
    {
      "class": "Slope streak",
      "reason": "The context explicitly describes 'Slope streak' as features formed by dark flow-like features on slopes, directly matching the query's description."
    }
  ]
}
-------