Query Flow
- Determine coordinate & species (seperate problem)
- OSM API of coordinate
- DeepSeek query (prompt specific to coordinate & species)

In [1]:
%load_ext autoreload
%autoreload 2

In [12]:
import pandas as pd
import numpy as np
from generate_geollm_prompts_with_csv import get_prompts


In [24]:
# randomly generate 100 coordinates across STL
np.random.seed(42)

# Define bounding box limits
min_lon, max_lon, min_lat, max_lat = -90.68099, -90.09099, 38.45601, 38.88601
random_lons = np.random.uniform(min_lon, max_lon, 100)
random_lats = np.random.uniform(min_lat, max_lat, 100)

coordinates_df = pd.DataFrame({
    'Longitude': random_lons,
    'Latitude': random_lats
})

# generate some random species among those seen in STL. Sampling will be done on [Eastern Gray Squirrel (Sciurus carolinensis), Northern Cardinal (Cardinalis cardinalis), Monarch Butterfly (Danaus plexippus),  White Oak tree (Quercus alba)] 
# species = ["White Oak Tree", "Northern Cardinal", "Monarch Butterfly", "Eastern Gray Squirrel"]
species = ["Eastern Gray Squirrel"]
assigned_species = np.random.choice(species, size=100, replace=True)
coordinates_df['Species'] = assigned_species

coordinates_df.head()

Unnamed: 0,Longitude,Latitude,Species
0,-90.460011,38.469525,Eastern Gray Squirrel
1,-90.120069,38.729666,Eastern Gray Squirrel
2,-90.249114,38.591183,Eastern Gray Squirrel
3,-90.327781,38.674695,Eastern Gray Squirrel
4,-90.588939,38.846264,Eastern Gray Squirrel


****
Uniform random sampling of species. 1% of the total # of samples.
- Randomly across the space
- BAsed on unoconditional distribution

In [8]:
# prompts = get_prompts(coordinates =  list(zip(coordinates_df['Latitude'], coordinates_df['Longitude'])), 
#             output_file = 'sample_prompts.jsonl')

***

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
import json
import re

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

from llm_prompts import squirrel_prompt, squirrel_prompt_nocontext

file_path = 'sample_prompts.jsonl'

model_id = "meta-llama/Meta-Llama-3.1-8B"
# model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
# model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
# model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
# model_id = "deepseek-ai/deepseek-llm-7b-base"

# model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="cuda:0", 
    # device_map="auto",
    torch_dtype=torch.bfloat16,  
    token = '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')

tokenizer = AutoTokenizer.from_pretrained(model_id, token = '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')


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.31it/s]


In [2]:
# In-context learning works best using "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Let's do 100, visualize and see if it is reasonable.

In [5]:
responses = {}
with open(file_path, "r", encoding="utf-8") as file:
    for line in file:
        idx = int(line.split('"text": ')[0].split('"index": ')[-1].split(", ")[0])
        species = "Eastern Gray Squirrel"

        # make it species specific
        amended_prompt = line.split('"text": ')[-1].replace("TASK", f"How likely are you to find a {species} here (On a Scale from 0.0 to 9.9): ")

        # add to in-context prompting
        # input_prompt = squirrel_prompt.replace("{NEW_LOCATION}", amended_prompt)
        # input_prompt = squirrel_prompt_nocontext.replace("{NEW_LOCATION}", amended_prompt)

        input_prompt = amended_prompt
        

        input_ids = tokenizer(input_prompt, return_tensors="pt").to("cuda")
        output = model.generate(**input_ids, max_new_tokens=200, do_sample=False)
        response = (tokenizer.decode(output[0], skip_special_tokens=True)).split(input_prompt)[-1]

        with open("./llm-responses/llama8B_instruct.jsonl", "a") as file:
            file.write(json.dumps({"index": idx, "text": response}) + "\n")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Settin

KeyboardInterrupt: 

In [7]:
import pandas as pd

In [8]:
# get just the responses out
output_file = "/data/cher/geollm-bias/playground/llm-responses/Eastern Gray Squirrel/expert-incontext3a-Meta-Llama-3.1-8B-Instruct.jsonl"
import re

def extract_number(text):
    match = re.search(r'\\boxed\{([\d.]+)\}', text)
    return float(match.group(1)) if match else None

results = {}
with open(output_file, "r", encoding="utf-8") as file:
        for line in file:
            idx = int(line.split('"text": ')[0].split('"index": ')[-1].split(", ")[0])
            text = line.split('"text": ')[-1]

            number = extract_number(text)

            results[idx] = number
              
df = pd.DataFrame(list(results.items()), columns=['index', 'prediction'])



merge outputs & coordinates for mapping

In [28]:
llm_preds = pd.read_csv("/data/cher/geollm-bias/playground/llm-responses/Eastern Gray Squirrel/expert-incontext3a-Meta-Llama-3.1-8B-Instruct-numeric.csv")
llm_sim_deep = pd.read_csv("/data/cher/geollm-bias/playground/llm-sim/Eastern Gray Squirrel/cos-sim-DeepSeek-R1-Distill-Llama-8B.csv", usecols=['similarity', 'normalized_similarity'])
llm_sim_llama = pd.read_csv("/data/cher/geollm-bias/playground/llm-sim/Eastern Gray Squirrel/cos-sim-Meta-Llama-3.1-8B.csv", usecols=['similarity', 'normalized_similarity'])

llm_sim_deep.rename(columns = {'similarity' : 'similarity_deep', 
                               'normalized_similarity' : 'normalized_similarity_deep'},
                               inplace = True)

In [29]:
coordinates_df_w_preds = coordinates_df.merge(llm_preds, left_index = True, right_index = True)
coordinates_df_w_preds = coordinates_df_w_preds.merge(llm_sim_deep, left_index = True, right_index = True)
coordinates_df_w_preds = coordinates_df_w_preds.merge(llm_sim_llama, left_index = True, right_index = True)


In [30]:
llm_sim_llama.columns

Index(['similarity', 'normalized_similarity'], dtype='object')

In [31]:
coordinates_df_w_preds

Unnamed: 0,Longitude,Latitude,Species,index,prediction,similarity_deep,normalized_similarity_deep,similarity,normalized_similarity
0,-90.460011,38.469525,Eastern Gray Squirrel,0,6.2,0.302512,0.762780,0.424993,0.631544
1,-90.120069,38.729666,Eastern Gray Squirrel,1,,0.295081,0.610848,0.409088,0.440569
2,-90.249114,38.591183,Eastern Gray Squirrel,2,,0.272117,0.141323,0.378685,0.075522
3,-90.327781,38.674695,Eastern Gray Squirrel,3,,0.282668,0.357053,0.392791,0.244897
4,-90.588939,38.846264,Eastern Gray Squirrel,4,6.5,0.312551,0.968031,0.455680,1.000000
...,...,...,...,...,...,...,...,...,...
95,-90.389651,38.606170,Eastern Gray Squirrel,95,,0.278095,0.263563,0.391106,0.224660
96,-90.372578,38.768171,Eastern Gray Squirrel,96,,0.279794,0.298281,0.394309,0.263113
97,-90.428741,38.841767,Eastern Gray Squirrel,97,6.2,0.293075,0.569841,0.424850,0.629828
98,-90.665993,38.837457,Eastern Gray Squirrel,98,,0.285357,0.412040,0.417615,0.542957


New idea: Similarity b/w [in-context prompt vector, species vector]
- Pass each species into the LLM, and get the output vector out?
- Pass an in-context prompt that is not specific to species & get vector out?

Not complete idea. Needs to be refined.


#### OH! Do this!!!
Pass the following prompts:
- Think about what species would be found here + COORDINATE
- species

In [None]:
file_path = 'sample_prompts.jsonl'

with open(file_path, "r", encoding="utf-8") as file:
    for line in file:
        idx = int(line.split('"text": ')[0].split('"index": ')[-1].split(", ")[0])
        species = "Eastern Gray Squirrel"

        # make it species specific
        amended_prompt = line.split('"text": ')[-1].replace("TASK", f"How likely are you to find a {species} here (On a Scale from 0.0 to 9.9): ")


In [None]:
coordinate_info = line.split('"text": ')[-1].split('<TASK>')[0]

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

from llm_prompts import coordinate_prompt

# Load the model and tokenizer
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    token=XXX
)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=XXX)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.88s/it]
We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this.


In [None]:
# import pandas as pd
# import math
# unique_species = pd.read_csv('unique_species.csv')['CommonNames'].unique()
# species_list = [x for x in unique_species if x not in [np.nan, 'nan']]

In [15]:
# Function to get the output vector from the model
def get_output_vector(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        # Use the hidden states of the last layer as the output vector
        last_hidden_state = outputs.hidden_states[-1]
        # Average over the sequence dimension to get a single vector
        output_vector = last_hidden_state.mean(dim=1).cpu().numpy()
    return output_vector

# Function to get the hidden state vectors from the model
def get_hidden_states(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        # Use the last hidden states
        last_hidden_state = outputs.hidden_states[-1]  # Shape: [batch_size, seq_length, hidden_dim]
    return last_hidden_state.squeeze(0)  # Remove batch dimension

# Function to compute the dot product matrix and its L2 norm
def compute_dot_product_l2_norm(prompt1, prompt2):
    vec1 = get_hidden_states(prompt1)  # Shape: [seq_length1, hidden_dim]
    vec2 = get_hidden_states(prompt2)  # Shape: [seq_length2, hidden_dim]
    
    # Compute dot product between all token pairs
    dot_product_matrix = torch.matmul(vec1, vec2.T)  # Shape: [seq_length1, seq_length2]
    
    # Compute L2 norm (Frobenius norm) of the dot product matrix
    l2_norm = torch.norm(dot_product_matrix, p='fro')
    
    return dot_product_matrix, l2_norm.item()

species_dict = {"White Oak Tree" : "White Oak Tree",
                "Northern Cardinal" :  "Northern Cardinal",
                "Eastern Gray Squirrel" :"Eastern Gray Squirrel",
                "Cheetah" : "Cheetah",
                'Elephant' : 'Elephant'
                }

in_context_prompt = coordinate_prompt.replace("{NEW_LOCATION}", coordinate_info)

# Get the in-context prompt vector
in_context_vector = get_output_vector(in_context_prompt)

# Process each species and compute similarity
similarities = {}
for species, desc in species_dict.items():

    prompt = f"You are an expert bioligist. Think about where you would find {desc}"

    # Compute cosine similarity between in-context vector and species vector
    species_vector = get_output_vector(prompt) #species
    similarity = cosine_similarity(in_context_vector, species_vector)[0][0]

    # compute l2norm
    l2_norm = compute_dot_product_l2_norm(in_context_prompt, prompt)[-1]

    similarities[species] = [similarity, l2_norm]

# Print the results
print("Similarities between in-context prompt and species:")
for species, similarity in similarities.items():
    print(f"{species}: {similarity:}")

Similarities between in-context prompt and species:
White Oak Tree: [0.27766186, 101283.171875]
Northern Cardinal: [0.26130337, 97332.4453125]
Eastern Gray Squirrel: [0.2741548, 99846.0390625]
Cheetah: [0.23687367, 92885.875]
Elephant: [0.23697752, 92168.2578125]


In [23]:
# Run the cosine similarity for 100 locations & Eastern Gray Squirrel + normalize
import pandas as pd
df = pd.read_json('', lines = True)

What didn't work. Using just the last hidden state
- Adding wiki of animals
- Just passing species


In [65]:
values = np.array(list(similarities.values()))

# Compute softmax
softmax_probs = np.exp(values) / np.sum(np.exp(values))

In [None]:

# Example species and in-context prompts
# species_dict = {"White Oak Tree" : "Quercus alba, the white oak, is one of the preeminent hardwoods of eastern and central North America. It is a long-lived oak, native to eastern and central North America and found from Minnesota, Ontario, Quebec, and southern Maine south as far as northern Florida and eastern Texas.[3] Specimens have been documented to be over 450 years old.[4] Although called a white oak, it is very unusual to find an individual specimen with white bark; the usual colour is a light gray. The name comes from the colour of the finished wood. In the forest it can reach a magnificent height and in the open it develops into a massive broad-topped tree with large branches striking out at wide angles",
#                 "Northern Cardinal" : "The northern cardinal (Cardinalis cardinalis), known colloquially as the common cardinal, red cardinal, or just cardinal, is a bird in the genus Cardinalis. It can be found in southeastern Canada, through the eastern United States from Maine to Minnesota to Texas, New Mexico, southern Arizona, southern California and south through Mexico, Belize, and Guatemala. It is also an introduced species in a few locations such as Bermuda and all major islands of Hawaii since its introduction in 1929. Its habitat includes woodlands, gardens, shrublands, and wetlands. It is the state bird of Illinois, Indiana, Kentucky, North Carolina, Ohio, Virginia, and West Virginia. The northern cardinal is a mid-sized perching songbird with a body length of 21–23 cm (8.3–9.1 in) and a crest on the top of the head. The species expresses sexual dimorphism: Females are a reddish olive color, and have a gray mask around the beak, while males are a vibrant red color, and have a black mask on the face, as well as a larger crest. Juvenile cardinals do not have the distinctive red-orange beak seen in adult birds until they are almost fully mature. On hatching, their beaks are grayish-black and they do not become the trademark orange-red color until they acquire their final adult plumage in the fall.[2] The northern cardinal is mainly granivorous but also feeds on insects and fruit. The male behaves territorially, marking out his territory with song. During courtship, the male feeds seed to the female beak-to-beak. The northern cardinal's clutch typically contains three to four eggs, with two to four clutches produced each year. It was once prized as a pet, but its sale was banned in the United States by the Migratory Bird Treaty Act of 1918. ",
#                 "Eastern Gray Squirrel" : "The eastern gray squirrel (Sciurus carolinensis), also known, particularly outside of the United States, as simply the grey squirrel, is a tree squirrel in the genus Sciurus. It is native to eastern North America, where it is the most prodigious and ecologically essential natural forest regenerator.[6][7] Widely introduced to certain places around the world, the eastern gray squirrel in Europe, in particular, is regarded as an invasive species. In Europe, Sciurus carolinensis is included since 2016 in the list of Invasive Alien Species of Union concern (the Union list).[8] This implies that this species cannot be imported, bred, transported, commercialized, or intentionally released into the environment in the whole of the European Union.[9]",
#                 "Cheetah" : "The cheetah (Acinonyx jubatus) is a large cat and the fastest land animal. It has a tawny to creamy white or pale buff fur that is marked with evenly spaced, solid black spots. The head is small and rounded, with a short snout and black tear-like facial streaks. It reaches 67–94 cm (26–37 in) at the shoulder, and the head-and-body length is between 1.1 and 1.5 m (3 ft 7 in and 4 ft 11 in). Adults weigh between 21 and 65 kg (46 and 143 lb). The cheetah is capable of running at 93 to 105 km/h (58 to 65 mph); it has evolved specialized adaptations for speed, including a light build, long thin legs and a long tail. The cheetah was first described in the late 18th century. Four subspecies are recognised today that are native to Africa and central Iran. An African subspecies was introduced to India in 2022. It is now distributed mainly in small, fragmented populations in northwestern, eastern and southern Africa and central Iran. It lives in a variety of habitats such as savannahs in the Serengeti, arid mountain ranges in the Sahara, and hilly desert terrain. The cheetah lives in three main social groups: females and their cubs, male 'coalitions', and solitary males. While females lead a nomadic life searching for prey in large home ranges, males are more sedentary and instead establish much smaller territories in areas with plentiful prey and access to females. The cheetah is active during the day, with peaks during dawn and dusk. It feeds on small- to medium-sized prey, mostly weighing under 40 kg (88 lb), and prefers medium-sized ungulates such as impala, springbok and Thomson's gazelles. The cheetah typically stalks its prey within 60–100 m (200–330 ft) before charging towards it, trips it during the chase and bites its throat to suffocate it to death. It breeds throughout the year. After a gestation of nearly three months, females give birth to a litter of three or four cubs. Cheetah cubs are highly vulnerable to predation by other large carnivores. They are weaned at around four months and are independent by around 20 months of age. The cheetah is threatened by habitat loss, conflict with humans, poaching and high susceptibility to diseases. The global cheetah population was estimated in 2021 at 6,517; it is listed as Vulnerable on the IUCN Red List. It has been widely depicted in art, literature, advertising, and animation. It was tamed in ancient Egypt and trained for hunting ungulates in the Arabian Peninsula and India. It has been kept in zoos since the early 19th century.",
                # 'Elephant' : "Elephants are the largest living land animals. Three living species are currently recognised: the African bush elephant (Loxodonta africana), the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). They are the only surviving members of the family Elephantidae and the order Proboscidea; extinct relatives include mammoths and mastodons. Distinctive features of elephants include a long proboscis called a trunk, tusks, large ear flaps, pillar-like legs, and tough but sensitive grey skin. The trunk is prehensile, bringing food and water to the mouth and grasping objects. Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging. The large ear flaps assist in maintaining a constant body temperature as well as in communication. African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears and convex or level backs. Elephants are scattered throughout sub-Saharan Africa, South Asia, and Southeast Asia and are found in different habitats, including savannahs, forests, deserts, and marshes. They are herbivorous, and they stay near water when it is accessible. They are considered to be keystone species, due to their impact on their environments. Elephants have a fission–fusion society, in which multiple family groups come together to socialise. Females (cows) tend to live in family groups, which can consist of one female with her calves or several related females with offspring. The leader of a female group, usually the oldest cow, is known as the matriarch. Males (bulls) leave their family groups when they reach puberty and may live alone or with other males. Adult bulls mostly interact with family groups when looking for a mate. They enter a state of increased testosterone and aggression known as musth, which helps them gain dominance over other males as well as reproductive success. Calves are the centre of attention in their family groups and rely on their mothers for as long as three years. Elephants can live up to 70 years in the wild. They communicate by touch, sight, smell, and sound; elephants use infrasound and seismic communication over long distances. Elephant intelligence has been compared with that of primates and cetaceans. They appear to have self-awareness, and possibly show concern for dying and dead individuals of their kind. African bush elephants and Asian elephants are listed as endangered and African forest elephants as critically endangered on the IUCN Red Lists. One of the biggest threats to elephant populations is the ivory trade, as the animals are poached for their ivory tusks. Other threats to wild elephants include habitat destruction and conflicts with local people. Elephants are used as working animals in Asia. In the past, they were used in war; today, they are often controversially put on display in zoos, or employed for entertainment in circuses. Elephants have an iconic status in human culture and have been widely featured in art, folklore, religion, literature, and popular culture."}