<a href="https://colab.research.google.com/github/dimstavkos/RAGThesis/blob/main/RAG_GeoSpatial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Install Libraries

In [1]:
!pip install datasets pandas pymongo sentence_transformers
!pip install transformers torch
!pip install python-dotenv

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting pymongo
  Downloading pymongo-4.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64

#Connect to Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os

# Check files in directory RAG
os.listdir('/content/drive/My Drive/RAG')

['AthensPOI.geojson', 'listings.csv.gz', 'listings.csv', 'Secrets']

#Data Sourcing

In [4]:
from datasets import load_dataset
import pandas as pd

dataset_path = '/content/drive/My Drive/RAG/listings.csv'
df = pd.read_csv(dataset_path)


df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10595,https://www.airbnb.com/rooms/10595,20240626035544,2024-06-26,city scrape,"3 bedrooms, 2 bathrooms, 2nd floor with elevator",The apartment is 3-bedroom apartment with 2-ba...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/hosting/Hosti...,37177,...,4.9,4.58,4.75,2433180,t,7,7,0,0,0.33
1,10990,https://www.airbnb.com/rooms/10990,20240626035544,2024-06-26,city scrape,Athens Quality Apartments - Deluxe Apartment,Athens Quality Apartments - Deluxe apartment i...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/miso/Hosting-...,37177,...,4.9,4.81,4.75,2433169,t,7,7,0,0,0.54
2,10993,https://www.airbnb.com/rooms/10993,20240626035544,2024-06-26,city scrape,Athens Quality Apartments - Studio,The Studio is an <br />-excellent located <br ...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/107309527/848...,37177,...,4.98,4.82,4.78,2433010,t,7,7,0,0,0.67
3,10995,https://www.airbnb.com/rooms/10995,20240626035544,2024-06-26,city scrape,"AQA-No2 1-bedroom, smart tv, fiber connection,","AQA No2 is 1-bedroom apartment (47m2), on the ...",Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/hosting/Hosti...,37177,...,4.87,4.83,4.8,2433153,t,7,7,0,0,0.19
4,27262,https://www.airbnb.com/rooms/27262,20240626035544,2024-06-26,city scrape,Athens Quality Apartments - Ground floor apart...,THE MATTRESS - KING KOIL - Camden Luxury 160x2...,,https://a0.muscache.com/pictures/miso/Hosting-...,37177,...,4.96,4.75,4.71,2433111,t,7,7,0,0,0.17


Preprocess Airbnb Dataset

In [5]:
import re
from sklearn.preprocessing import LabelEncoder

def clean_text(text):
  if isinstance(text, str):
    text = re.sub(r'<br\s*/?>', ' ', text)  #Remove <br> and html tags
    text = re.sub(r'<.*?>', '', text)
    return text.strip()
  return text

#Apply to every string element in the DataFrame
df = df.apply(lambda col: col.apply(lambda x: clean_text(x) if isinstance(x, str) else x) if col.dtype == 'O' else col)


df.head()


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10595,https://www.airbnb.com/rooms/10595,20240626035544,2024-06-26,city scrape,"3 bedrooms, 2 bathrooms, 2nd floor with elevator",The apartment is 3-bedroom apartment with 2-ba...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/hosting/Hosti...,37177,...,4.9,4.58,4.75,2433180,t,7,7,0,0,0.33
1,10990,https://www.airbnb.com/rooms/10990,20240626035544,2024-06-26,city scrape,Athens Quality Apartments - Deluxe Apartment,Athens Quality Apartments - Deluxe apartment i...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/miso/Hosting-...,37177,...,4.9,4.81,4.75,2433169,t,7,7,0,0,0.54
2,10993,https://www.airbnb.com/rooms/10993,20240626035544,2024-06-26,city scrape,Athens Quality Apartments - Studio,The Studio is an -excellent located -close t...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/107309527/848...,37177,...,4.98,4.82,4.78,2433010,t,7,7,0,0,0.67
3,10995,https://www.airbnb.com/rooms/10995,20240626035544,2024-06-26,city scrape,"AQA-No2 1-bedroom, smart tv, fiber connection,","AQA No2 is 1-bedroom apartment (47m2), on the ...",Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/hosting/Hosti...,37177,...,4.87,4.83,4.8,2433153,t,7,7,0,0,0.19
4,27262,https://www.airbnb.com/rooms/27262,20240626035544,2024-06-26,city scrape,Athens Quality Apartments - Ground floor apart...,THE MATTRESS - KING KOIL - Camden Luxury 160x2...,,https://a0.muscache.com/pictures/miso/Hosting-...,37177,...,4.96,4.75,4.71,2433111,t,7,7,0,0,0.17


Select Needed Fields

In [6]:
# Important fields
df_selected = df[[
    'id', 'listing_url', 'description', 'name', 'neighborhood_overview',
    'neighbourhood', 'neighbourhood_cleansed', 'latitude', 'longitude',
    'property_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds',
    'amenities', 'price', 'review_scores_rating'
]]

# Drop rows with missing values in important fields
df_selected = df_selected.dropna(subset=['description', 'name', 'neighborhood_overview', 'latitude', 'longitude', 'amenities'])

df_selected.head()

Unnamed: 0,id,listing_url,description,name,neighborhood_overview,neighbourhood,neighbourhood_cleansed,latitude,longitude,property_type,accommodates,bathrooms,bedrooms,beds,amenities,price,review_scores_rating
0,10595,https://www.airbnb.com/rooms/10595,The apartment is 3-bedroom apartment with 2-ba...,"3 bedrooms, 2 bathrooms, 2nd floor with elevator",Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98863,23.76527,Entire condo,10,2.0,3.0,7.0,"[""Ethernet connection"", ""Room-darkening shades...",$108.00,4.86
1,10990,https://www.airbnb.com/rooms/10990,Athens Quality Apartments - Deluxe apartment i...,Athens Quality Apartments - Deluxe Apartment,Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98903,23.76448,Entire rental unit,4,1.0,1.0,1.0,"[""Ethernet connection"", ""43 inch HDTV"", ""Priva...",$136.00,4.82
2,10993,https://www.airbnb.com/rooms/10993,The Studio is an -excellent located -close t...,Athens Quality Apartments - Studio,Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98888,23.76473,Entire rental unit,2,1.0,0.0,2.0,"[""Ethernet connection"", ""43 inch HDTV"", ""Priva...",$67.00,4.83
3,10995,https://www.airbnb.com/rooms/10995,"AQA No2 is 1-bedroom apartment (47m2), on the ...","AQA-No2 1-bedroom, smart tv, fiber connection,",Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98903,23.76448,Entire rental unit,4,1.0,1.0,2.0,"[""Ethernet connection"", ""Private patio or balc...",$78.00,4.81
5,33945,https://www.airbnb.com/rooms/33945,Apartment located near metro station. Safe nei...,Spacious Cosy aprtm very close to Metro!,Neighbourhood is alive all day and safe all da...,"Athens, Αττική, Greece",ΑΓΙΟΣ ΝΙΚΟΛΑΟΣ,38.00673,23.72775,Entire rental unit,4,1.0,2.0,2.0,"[""Private patio or balcony"", ""Microwave"", ""Fre...",$30.00,4.71


#Get Embeddings

Sample Airbnb Dataset

In [7]:
subset_percentage = 0.1
df_subset = df_selected.sample(frac=subset_percentage, random_state=42)


print(f"Number of rows in the subset: {len(df_subset)}")

Number of rows in the subset: 758


Create Embeddings

In [8]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")

tqdm.pandas() #For progress bar

def get_embedding(text: str) -> list[float]:
  if not text.strip():
      print("Attempted to get embedding for empty text.")
      return []

  embedding = embedding_model.encode(text)

  return embedding.tolist()

def create_embeddings_from_concatenated_columns(df):
# Concatenate text columns we need into a single string
  df['concatenated_text'] = df[['description', 'name']].fillna('').agg(' '.join, axis=1)

  # Generate embedding for concat text
  df['combined_embedding'] = df['concatenated_text'].progress_apply(get_embedding)

  return df

create_embeddings_from_concatenated_columns(df_subset)
# df_subset["description_embedding"] = df_subset["description"].progress_apply(get_embedding)
# df_selected["name_embedding"] = df_selected["name"].apply(get_embedding)
# df_selected["neighborhood_overview_embedding"] = df_selected["neighborhood_overview"].apply(get_embedding)
# df_selected["amenities_embedding"] = df_selected["amenities"].apply(get_embedding)




df_selected.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

100%|██████████| 758/758 [21:00<00:00,  1.66s/it]


Unnamed: 0,id,listing_url,description,name,neighborhood_overview,neighbourhood,neighbourhood_cleansed,latitude,longitude,property_type,accommodates,bathrooms,bedrooms,beds,amenities,price,review_scores_rating
0,10595,https://www.airbnb.com/rooms/10595,The apartment is 3-bedroom apartment with 2-ba...,"3 bedrooms, 2 bathrooms, 2nd floor with elevator",Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98863,23.76527,Entire condo,10,2.0,3.0,7.0,"[""Ethernet connection"", ""Room-darkening shades...",$108.00,4.86
1,10990,https://www.airbnb.com/rooms/10990,Athens Quality Apartments - Deluxe apartment i...,Athens Quality Apartments - Deluxe Apartment,Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98903,23.76448,Entire rental unit,4,1.0,1.0,1.0,"[""Ethernet connection"", ""43 inch HDTV"", ""Priva...",$136.00,4.82
2,10993,https://www.airbnb.com/rooms/10993,The Studio is an -excellent located -close t...,Athens Quality Apartments - Studio,Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98888,23.76473,Entire rental unit,2,1.0,0.0,2.0,"[""Ethernet connection"", ""43 inch HDTV"", ""Priva...",$67.00,4.83
3,10995,https://www.airbnb.com/rooms/10995,"AQA No2 is 1-bedroom apartment (47m2), on the ...","AQA-No2 1-bedroom, smart tv, fiber connection,",Ampelokipi district is nice multinational and ...,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,37.98903,23.76448,Entire rental unit,4,1.0,1.0,2.0,"[""Ethernet connection"", ""Private patio or balc...",$78.00,4.81
5,33945,https://www.airbnb.com/rooms/33945,Apartment located near metro station. Safe nei...,Spacious Cosy aprtm very close to Metro!,Neighbourhood is alive all day and safe all da...,"Athens, Αττική, Greece",ΑΓΙΟΣ ΝΙΚΟΛΑΟΣ,38.00673,23.72775,Entire rental unit,4,1.0,2.0,2.0,"[""Private patio or balcony"", ""Microwave"", ""Fre...",$30.00,4.71


#Connect To MongoDB

Load Credentials

In [9]:
from dotenv import load_dotenv
import os

# Path to .env file in Google Drive
env_path = "/content/drive/My Drive/RAG/Secrets/secrets.env"
load_dotenv(env_path)

# Credentials from .env file
username = os.getenv("MONGO_USERNAME")
password = os.getenv("MONGO_PASSWORD")

# MongoDB URI
mongo_uri = f"mongodb+srv://{username}:{password}@cluster0.hi5uz.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"


Establish Connection

In [10]:
!pip install pymongo
import pymongo
from google.colab import userdata
import os


# Connect to MongoDB
def get_mongo_client(mongo_uri):
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

# Check mongo_uri
if 'MONGO_USERNAME' not in os.environ or 'MONGO_PASSWORD' not in os.environ:
    print("MONGO_USER or MONGO_PASSWORD not set in environment variables")
else:
    mongo_client = get_mongo_client(mongo_uri)



Connection to MongoDB successful


In [11]:
if mongo_client:
  db = mongo_client['airbnb_db']

  collections = db.list_collection_names()
  print("Collections in database:", collections)

Collections in database: ['listings', 'athenspois']


#Injest Data to DB

In [12]:
def prepare_json_for_ingestion(df):

    json_data = []

    for _, row in df.iterrows():
        document = {
            "id": row['id'],
            "description": row['description'],
            "name": row['name'],
            "neighborhood_overview": row['neighborhood_overview'],
            "latitude": row['latitude'],
            "longitude": row['longitude'],
            # "description_embedding": row['description_embedding'],
            # "name_embedding" : row["name_embedding"],
            # "neighborhood_overview_embedding" : row["neighborhood_overview_embedding"],
            # "amenities_embedding" : row["amenities_embedding"]
            "combined_embedding": row['combined_embedding'],

        }
        json_data.append(document)

    return json_data

json_data = prepare_json_for_ingestion(df_subset)

In [13]:
print(json_data[0]) #1024 embedding dimensions

{'id': 18145719, 'description': 'The Room at the Top: Your Private Athenian Escape  Discover "The Room at the Top" at Athens Soho Lofts, a cozy sanctuary designed for two, offering a unique blend of intimacy and openness. This charming space is not just a room; it\'s a gateway to a private rooftop experience, where the Athenian skyline stretches out before you.', 'name': 'Athens Soho Lofts - A Room At The Top', 'neighborhood_overview': 'A Rare Luxury: Outdoor Space and Privacy  In a city where outdoor space is a luxury, "The Room at the Top" offers you an entire rooftop to enjoy. It\'s one of the few places in Athens where you can bask in the sun and enjoy the freedom of being outdoors. Whether you\'re sunbathing, enjoying a leisurely breakfast, or simply relaxing with a book, this rooftop is your private slice of paradise.  Embrace the Outdoors in Comfort  "The Room at the Top" isn\'t just a place to stay; it\'s an experience. Enjoy the outdoor space and the privacy it offers, making 

In [14]:
db = mongo_client['airbnb_db']
collection = db['listings']



# Delete any existing records in the collection
collection.delete_many({})

try:
  collection.insert_many(json_data)
  print("Data inserted successfully")
except Exception as e:
  print(f"Error inserting data: {e}")

Data inserted successfully


#Load and Clean POIs Dataset

In [15]:
import json
import random

db = mongo_client['airbnb_db']
collection = db['athenspois']


# Delete any existing records in the collection
collection.delete_many({})

# Load the geojson
geojson_path = "/content/drive/My Drive/RAG/AthensPOI.geojson"
with open(geojson_path, 'r') as f:
    geojson_data = json.load(f)

# If the data is a feature collection, extract features
if 'features' in geojson_data:
    documents = geojson_data['features']  # Each feature is a document
else:
    documents = [geojson_data]  # In case it's a single GeoJSON object

cleaned_documents = []

for document in tqdm(documents, desc="Cleaning POI documents", unit="document"):
    properties = document.get('properties', {})

    # Handle None values
    class_text = properties.get('fclass')
    name_text = properties.get('name')
    coordinates = document.get('geometry', {}).get('coordinates', [])

    # Ensure the values are strings (if they exist), then strip spaces
    class_text = str(class_text).strip() if class_text is not None else None
    name_text = str(name_text).strip() if name_text is not None else None

    # Keep only valid documents
    if class_text not in ["", "0", None] and name_text not in ["", "0", None] and coordinates:
        cleaned_documents.append(document)

print(f"Original documents: {len(documents)}, Cleaned documents: {len(cleaned_documents)}")




Cleaning POI documents: 100%|██████████| 36282/36282 [00:00<00:00, 204985.21document/s]

Original documents: 36282, Cleaned documents: 20507





In [16]:
# Randomly sample 30% of the documents
sample_size = int(len(cleaned_documents) * 0.1)
sampled_documents = random.sample(cleaned_documents, sample_size)

# Add embeddings for 'class' and 'name' fields
for document in tqdm(sampled_documents, desc="Processing POI documents", unit="document"):
    properties = document.get('properties', {})
    class_text = properties.get('fclass', "")
    name_text = properties.get('name', "")

    # Generate embeddings for POI Name and Class
    document['name_embedding'] = get_embedding(name_text) if name_text else None
    document['class_embedding'] = get_embedding(class_text) if class_text else None



# Insert documents into MongoDB
try:
    collection.insert_many(sampled_documents)
    print(f"Successfully inserted {len(sampled_documents)} GeoJSON documents into MongoDB")
except Exception as e:
    print(f"Error inserting data: {e}")

Processing POI documents: 100%|██████████| 2050/2050 [22:32<00:00,  1.52document/s]


Successfully inserted 2050 GeoJSON documents into MongoDB


In [17]:
print(f"Original documents: {len(documents)}")
print(f"Cleaned documents: {len(cleaned_documents)}")
print(f"Sampled documents: {len(sampled_documents)}")


Original documents: 36282
Cleaned documents: 20507
Sampled documents: 2050


#Loading Gemma

Login To HugginFace

In [18]:
from huggingface_hub import login

# Token to login on huggingface
hf_token = os.getenv("HUGGINGFACE_TOKEN")

login(token=hf_token)


Load Gemma

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]



#Entity Extraction

In [75]:
# Example query
user_query = "Find Airbnb 3000 meters from Pizza Fan"

In [58]:
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM


import re
def extract_entities(user_query):
    # prompt for the LLM to extract the entities
    prompt = f" As an expert in named entity recognition machine learning models, I will give you a sentence from which I would like you to extract the distance. The distance needs to be a number expressed in meters. I would like the result to be expressed in JSON with the following fields: 'distance_in_meters'. Only return the JSON. Here is the sentence: '{user_query}'."

    # Tokenize and generate response
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

    # Generate predictions
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100, num_return_sequences=1)

    # Decode output
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("Model response:", response)

    try:
        # We expect the JSON to be between code blocks (```)
        json_part = response.split("```")[1]  # Get the middle part, which contains the JSON
        print("Extracted JSON block:", json_part)
        return json_part  # Return the raw JSON string
    except IndexError:
        print("Error: Could not extract JSON block")
        return None



extracted_entities = extract_entities(user_query)
print("Extracted Entities as JSON:", extracted_entities)




Model response:  As an expert in named entity recognition machine learning models, I will give you a sentence from which I would like you to extract the distance. The distance needs to be a number expressed in meters. I would like the result to be expressed in JSON with the following fields: 'distance_in_meters'. Only return the JSON. Here is the sentence: 'Find Airbnb 3000 meters from a Pizza Fan'.

**JSON Output:**

```json
{
  "distance_in_meters": 3000
}
```
Extracted JSON block: json
{
  "distance_in_meters": 3000
}

Extracted Entities as JSON: json
{
  "distance_in_meters": 3000
}



#Parse Entities

In [59]:
def extract_json_block(response):
    try:
        # Find the part after "json" keyword
        json_start = response.find('{')  # Find the first curly brace
        json_part = response[json_start:]  # Extract from the first curly brace to the end
        print("Extracted JSON block:", json_part)

        # Parse the JSON string into a Python dictionary
        parsed_entities = json.loads(json_part)
        print("Parsed Entities:", parsed_entities)
        return parsed_entities
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return None

# Get parsed json
parsed_json = extract_json_block(extracted_entities)

Extracted JSON block: {
  "distance_in_meters": 3000
}

Parsed Entities: {'distance_in_meters': 3000}


In [60]:
distance_to_POI = parsed_json['distance_in_meters']

#POIs Vector Search

In [24]:
def POIclass_vector_search(user_query, collection):
    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if not query_embedding:
        return "Embedding generation failed."

    # Define the vector search pipeline for class vector search
    pipeline = [
        {
            "$vectorSearch": {
                "index": "poi_index",
                "queryVector": query_embedding,
                "path": "class_embedding",  # embeddings for POI class
                "numCandidates": 150,  # candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,
                "properties.name": 1,
                "geometry.coordinates": 1,
                "properties.fclass": 1 , # Include the class field
                "score": { "$meta": "vectorSearchScore" } # get the similarity score
            }
        },
    ]

    # Execute search
    try:
        results = collection.aggregate(pipeline)
        print(results)
        return list(results)

    except Exception as e:
        return f"An error occurred during search: {str(e)}"


In [25]:
def POIname_vector_search(user_query, collection):
    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)


    if not query_embedding:
        return "Invalid query or embedding generation failed."

    # Define the Name vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "POI_search",
                "queryVector": query_embedding,
                "path": "name_embedding",  #  embeddings
                "numCandidates": 150,
                "limit": 4,

            }
        },
        {
            "$project": {
                "_id": 0,
                "properties.name": 1,
                "properties.fclass":1,
                "geometry.coordinates": 1,
               "score": { "$meta": "vectorSearchScore" }

            }
        },
    ]

    # Execute the search
    try:

        results = collection.aggregate(pipeline)
        print(results)
        return list(results)

    except Exception as e:
        return f"An error occurred during search: {str(e)}"


In [26]:
db = mongo_client['airbnb_db']
collection = db['athenspois']

In [74]:
def combined_POI_search(user_query, collection):
    # Perform vector search for name
    name_results = POIname_vector_search(user_query, collection)
    name_tagged = [{'poi': poi, 'type': 'name'} for poi in name_results]  # Tag as 'name'

    # Perform vector search for class
    class_results = POIclass_vector_search(user_query, collection)
    class_tagged = [{'poi': poi, 'type': 'class'} for poi in class_results]  # Tag as 'class'

    # Combine both lists
    all_results = name_tagged + class_tagged

    # Dictionary to track unique POIs by coordinates
    unique_pois = {}

    for entry in all_results:
        poi = entry['poi']
        coords = tuple(poi['geometry']['coordinates'])  # Convert to tuple for dictionary key
        score = poi.get('score', 0)

        if coords in unique_pois:
            # If duplicate POI instance exists, keep the one with the highest score
            if score > unique_pois[coords]['poi'].get('score', 0):
                unique_pois[coords] = entry
        else:
            unique_pois[coords] = entry

    # Sort unique POIs by highest similarity score
    sorted_results = sorted(unique_pois.values(), key=lambda x: x['poi'].get('score', 0), reverse=True)

    if not sorted_results:
        return [], None  # Return empty list and None if no results

    # Determine top POI type
    top_poi_type = sorted_results[0]['type']  # Either name or class

    # Extract only POIs for output
    all_pois = [entry['poi'] for entry in sorted_results]

    return all_pois, top_poi_type


In [76]:
db = mongo_client['airbnb_db']
collection = db['athenspois']


search_results, top_poi_tag = combined_POI_search(user_query, collection)

print("Search Results:", search_results)
print("Top POI Type:", top_poi_tag)



<pymongo.synchronous.command_cursor.CommandCursor object at 0x794b458afbd0>
<pymongo.synchronous.command_cursor.CommandCursor object at 0x794b46930b10>
Search Results: [{'properties': {'fclass': 'fast_food', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.6883828, 38.0120442]}, 'score': 0.9416906237602234}, {'properties': {'fclass': 'restaurant', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.7200224, 38.009503]}, 'score': 0.9416906237602234}, {'properties': {'fclass': 'restaurant', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.7533212, 37.8921193]}, 'score': 0.9416906237602234}, {'properties': {'fclass': 'fast_food', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.7640367, 38.0387453]}, 'score': 0.9416906237602234}, {'properties': {'fclass': 'tourist_info', 'name': 'Αρχαιοι Οδικοι Εξονεϛ - Ι.Ν. Αγιασ Παρασκευηϛ'}, 'geometry': {'coordinates': [23.7283567, 37.9701408]}, 'score': 0.8961631059646606}, {'properties': {'fclass': 'guesthouse', 'name': 'Acro&Polis'}

In [77]:
print("Results:")
for result in search_results:
  print(result)  # Print the entire result document


Results:
{'properties': {'fclass': 'fast_food', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.6883828, 38.0120442]}, 'score': 0.9416906237602234}
{'properties': {'fclass': 'restaurant', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.7200224, 38.009503]}, 'score': 0.9416906237602234}
{'properties': {'fclass': 'restaurant', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.7533212, 37.8921193]}, 'score': 0.9416906237602234}
{'properties': {'fclass': 'fast_food', 'name': 'Pizza Fan'}, 'geometry': {'coordinates': [23.7640367, 38.0387453]}, 'score': 0.9416906237602234}
{'properties': {'fclass': 'tourist_info', 'name': 'Αρχαιοι Οδικοι Εξονεϛ - Ι.Ν. Αγιασ Παρασκευηϛ'}, 'geometry': {'coordinates': [23.7283567, 37.9701408]}, 'score': 0.8961631059646606}
{'properties': {'fclass': 'guesthouse', 'name': 'Acro&Polis'}, 'geometry': {'coordinates': [23.7289656, 37.9729632]}, 'score': 0.8959318399429321}
{'properties': {'fclass': 'hotel', 'name': 'Το Αkrogali'}, 'geometry': {'coordi

#2d Sphere Creation

In [31]:
db = mongo_client['airbnb_db']
collection = db['listings']
collection.update_many(
    {},  # Empty filter to update all documents
    [
        {
            "$set": {
                "location": {
                    "type": "Point",
                    "coordinates": ["$longitude", "$latitude"]  # Combine longitude and latitude
                }
            }
        }
    ]
)



UpdateResult({'n': 758, 'electionId': ObjectId('7fffffff0000000000000368'), 'opTime': {'ts': Timestamp(1739113505, 772), 't': 872}, 'nModified': 758, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1739113505, 772), 'signature': {'hash': b'\xc4\xc1\t\n =\xba\x7f\xc5fANq\xc9\x9f\x001\x8d\xaf\x96', 'keyId': 7434334956240764942}}, 'operationTime': Timestamp(1739113505, 772), 'updatedExisting': True}, acknowledged=True)

In [32]:
# Create the 2dsphere index on the airbnb 'location' field
collection.create_index([('location', '2dsphere')])

'location_2dsphere'

#Geospatial Search

In [33]:
def get_top_poi(search_results):


    #Get the top POI and determine if it has multiple instances.

    if not search_results:
        return None, []

    # First result is the top one
    top_poi_name = search_results[0]['properties']['name']

    # Gather all instances of the top POI name
    top_poi_instances = [
        poi for poi in search_results if poi['properties']['name'] == top_poi_name
    ]

    # Check if there are multiple instances of the top POI
    has_multiple_instances = len(top_poi_instances) > 1

    return top_poi_name, top_poi_instances, has_multiple_instances

In [73]:
def geospatial_search(collection, poi_instances, radius):

    #Perform geospatial search for Airbnbs near the given POI instances.

    results = []

    for poi in poi_instances:
        poi_name = poi['properties']['name']
        poi_coordinates = poi['geometry']['coordinates']

        # Perform geospatial query
        airbnb_results = collection.aggregate([
            {
                "$geoNear": {
                    "near": {
                        "type": "Point",
                        "coordinates": poi_coordinates
                    },
                    "distanceField": "distance",
                    "maxDistance": radius,
                    "spherical": True
                }
            },
            {
                "$project": {
                    "_id": 0,
                    "name": 1,
                    "description": 1,
                    "distance": 1,
                    "neighborhood_overview": 1
                }
            }
        ])

        # Collect results for this POI
        airbnb_list = [
            {
                "airbnb_name": airbnb['name'],
                "distance_meters": airbnb['distance'],
                "description": airbnb['description'],
                "neighborhood_overview": airbnb['neighborhood_overview']
            }
            for airbnb in airbnb_results
        ]

        results.append({
            "poi_name": poi_name,
            "poi_coordinates": poi_coordinates,
            "airbnbs": airbnb_list
        })

    return results



collection = db['listings']


In [78]:
collection = db['listings']
radius = distance_to_POI



print(f"Searching Airbnbs inside a radius of {distance_to_POI} meters")
if not search_results:
    print("No POIs found.")
else:
    top_poi_type = top_poi_tag # Check if it's name or class-based search

    poi_instances_to_search = []  # Store POIs for geospatial_search

    if top_poi_type == "name":
        # Get the top POI and its instances
        top_poi_name, top_poi_instances, has_multiple_instances = get_top_poi(search_results)

        if has_multiple_instances:
            print(f"Top POI '{top_poi_name}' has multiple instances ({len(top_poi_instances)}).")
            poi_instances_to_search = top_poi_instances
        else:
            print(f"Top POI '{top_poi_name}' is unique.")
            poi_instances_to_search = [top_poi_instances[0]]  # Single instance

    elif top_poi_type == "class":
        # If it's a class search, get the top 3 unique POIs of the same class
        top_poi_class = search_results[0]['properties']['fclass']
        class_pois = [poi for poi in search_results if poi['properties']['fclass'] == top_poi_class][:3]

        print(f"Top class-based POI search: Found {len(class_pois)} POIs of class '{top_poi_class}'.")
        poi_instances_to_search = class_pois  # Store top 3 POIs of the class

    # Perform geospatial search with extracted POI instances
    spatial_results = geospatial_search(collection, poi_instances_to_search, radius)

    # Output results
    for result in spatial_results:
        print(f"Airbnbs near POI '{result['poi_name']}' (Coordinates: {result['poi_coordinates']}):")

        top_airbnbs = result['airbnbs'][:3]  # Get top 3 Airbnbs

        if not top_airbnbs:
            print("- No Airbnbs found.")
        else:
            for airbnb in top_airbnbs:
                print(f"- {airbnb['airbnb_name']} (Distance: {airbnb['distance_meters']} meters)")

Searching Airbnbs inside a radius of 3000 meters
Top POI 'Pizza Fan' has multiple instances (4).
Airbnbs near POI 'Pizza Fan' (Coordinates: [23.6883828, 38.0120442]):
- Apartment by metro with WiFi, kitchen and AC (Distance: 2307.040402704296 meters)
- The Athens Heart  Best Destination Apartment ! (Distance: 2694.3752634166995 meters)
- Luxury apartment close to metro. (Distance: 2701.5687271822894 meters)
Airbnbs near POI 'Pizza Fan' (Coordinates: [23.7200224, 38.009503]):
- ATHENIAN HOME -Mini suite - parking (Distance: 188.77952582981956 meters)
- Athenian Bo-Home (Distance: 325.827305293109 meters)
- Vintage 8 (Distance: 610.7469816511572 meters)
Airbnbs near POI 'Pizza Fan' (Coordinates: [23.7533212, 37.8921193]):
- No Airbnbs found.
Airbnbs near POI 'Pizza Fan' (Coordinates: [23.7640367, 38.0387453]):
- Rosemarie's apartment (Distance: 1861.4242569010269 meters)


In [71]:
def LLMprocess_geospatial_results(results, user_query, poi_type):
    """
    Processes the geospatial results using LLM. This handles both name and class-based POIs.

     results: The geospatial results (list of POIs with Airbnbs).
     user_query: The original user query.
     poi_type: The type of POI ("name" or "class").
    IT returns: List of LLM responses summarizing Airbnbs near the POIs.
    """

    llm_responses = []

    # Loop through each POI result
    for result in results:
        poi_name = result['poi_name']
        poi_coordinates = result['poi_coordinates']
        airbnbs = result['airbnbs']

        if airbnbs:
            top_airbnbs = airbnbs[:3]  # Top 3 Airbnbs

            # Create structured input for LLM based on whether it's a name or class-based search
            airbnb_sentences = "\n\n".join([
                f"{airbnb['airbnb_name']} is located {airbnb['distance_meters']:.2f} meters from {poi_name}. "
                f"{' '.join(airbnb['description'].split('.')[:2])}. The neighborhood is known for "
                f"{' '.join(airbnb['neighborhood_overview'].split('.')[:2])}."
                for airbnb in top_airbnbs
            ])

            # Prepare the LLM prompt
            prompt = f"""
            The user is searching for Airbnbs based on this query: "{user_query}".

            Below are details about Airbnbs near '{poi_name}' (Coordinates: {poi_coordinates}).
            Provide a concise, natural summary tailored to the user's query.
            Highlight distance, key amenities, and neighborhood features in a helpful manner.

            {airbnb_sentences}
            """

            # If it's a class-based search, mention that in the prompt
            if poi_type == "class":
                prompt = f"Note: The POI is of class '{poi_type}' (instead of a specific name-based search).\n" + prompt

            # Tokenize and generate response using the LLM model
            inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.generate(**inputs, max_new_tokens=200, num_return_sequences=1)

            response = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Store the LLM response
            llm_responses.append({
                "poi_name": poi_name,
                "poi_coordinates": poi_coordinates,
                "summary": response
            })

    # If no results were found, return a default response
    if not llm_responses:
        llm_responses.append({
            "poi_name": "No POIs with Airbnbs found",
            "summary": f"No Airbnbs were found near any of the POI instances that match your query: '{user_query}'. Consider expanding the search radius."
        })

    return llm_responses



In [79]:
llm_responses = LLMprocess_geospatial_results(spatial_results, user_query, top_poi_tag)


In [80]:
#Output LLM responses
for response in llm_responses:
  print(f"Summary for POI '{response['poi_name']}':")
  print(response['summary'])

Summary for POI 'Pizza Fan':

            The user is searching for Airbnbs based on this query: "Find Airbnb 3000 meters from Pizza Fan".

            Below are details about Airbnbs near 'Pizza Fan' (Coordinates: [23.6883828, 38.0120442]). 
            Provide a concise, natural summary tailored to the user's query. 
            Highlight distance, key amenities, and neighborhood features in a helpful manner.

            Apartment by metro with WiFi, kitchen and AC is located 2307.04 meters from Pizza Fan. Studio near metro (500m) Metro Sepolia just 5 minutes from Syntagma Acropolis Refurbished Private bathroom and kitchen 1st floor parking On the first floor of a modern apartment building Independent quiet apartment with parking on the floating 500m from the Sepolia metro on the first floor Late check in fee 10 euro (for check-ins after 9pm). The neighborhood is known for Swimming pool Gas station Supermarket Bakery Coffee shop nearby.

The Athens Heart  Best Destination Apartment 