# Hotel Room Semantic Search with HuggingFace Transformers
# 
# This notebook implements semantic search for hotel room descriptions using a `sentence-transformers` model from HuggingFace.
# 
# **Steps:**
# 1. **Setup:** Install necessary libraries.
# 2. **Configuration:** Define model names and file paths.
# 3. **Load Model:** Load the pre-trained Sentence Transformer model.
# 4. **Helper Functions:** Define functions to create text descriptors and get embeddings.
# 5. **Build Embeddings:**
#    - Load pre-extracted features from `hotel_features.json`.
#    - Generate a text descriptor for each room.
#    - Compute embeddings for each descriptor using the HuggingFace model.
#    - Save these embeddings to `hotel_embeddings_hf.json`.
# 6. **Perform Semantic Search:**
#    - Load the saved embeddings.
#    - Embed a user query.
#    - Calculate cosine similarity between the query embedding and all room embeddings.
#    - Return the top N matching rooms."

In [1]:
# Step 1: Setup - Install necessary libraries
# In a Kaggle/Colab notebook, you might run this cell with a ! prefix:
# !pip install sentence-transformers torch numpy scikit-learn python-dotenv

import os
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

2025-05-19 15:32:27.064173: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747668747.261122      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747668747.316898      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Step 2: Configuration

In [3]:
# Input file from the vision model (contains extracted features)
FEATURES_FILE   = "/kaggle/input/hotel-features/hotel_features.json" 
# Output file for HuggingFace embeddings
EMBEDDINGS_FILE_HF = "hotel_embeddings_hf.json" 

# HuggingFace Sentence Transformer model
# Other good options: 'multi-qa-MiniLM-L6-cos-v1', 'all-mpnet-base-v2'
EMBED_MODEL_HF  = "sentence-transformers/all-MiniLM-L6-v2"

## Step 3: Load Sentence Transformer Model
# This will download the model on the first run if it's not already cached.
# For a dataset of 25 items, CPU is fine. GPU would be overkill but sentence-transformers will use it if available and PyTorch is set up for CUDA.

In [4]:
print(f"Loading HuggingFace model: {EMBED_MODEL_HF}...")
# You can specify device='cuda' if you want to force GPU, or device='cpu'
# By default, it will try to use GPU if available.
try:
    model = SentenceTransformer(EMBED_MODEL_HF) # device='cuda' or 'cpu'
    print("HuggingFace model loaded successfully.")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Please ensure you have an internet connection for the first download,")
    print("and that PyTorch is correctly installed if using GPU.")
    model = None # Set model to None if loading fails

Loading HuggingFace model: sentence-transformers/all-MiniLM-L6-v2...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

HuggingFace model loaded successfully.


## Step 4: Helper Functions

In [5]:
def make_descriptor(features: dict) -> str:
    """Build a short English sentence describing the room based on its features."""
    parts = []
    if not isinstance(features, dict):
        return "Generic hotel room description." # Fallback for malformed features

    if features.get("room_type"):
        parts.append(f"{features['room_type']} room")
    if features.get("capacity") is not None:
        parts.append(f"with a capacity for {features['capacity']} people")
    if features.get("view_type") and features['view_type'].lower() not in ["none", "none visible", "not specified"]:
        parts.append(f"offering a {features['view_type']}")
    
    amenities = features.get("amenities", [])
    if amenities and isinstance(amenities, list):
        if len(amenities) > 0:
            amenities_str = ", ".join(amenities)
            parts.append(f"equipped with {amenities_str}")
            
    description = ". ".join(parts)
    if not description:
        return "A standard hotel room." # Fallback if no features make it to description
    return description + "."

def get_embedding_hf(text: str, sf_model: SentenceTransformer) -> list:
    """Generate an embedding using the loaded HuggingFace model."""
    if sf_model is None:
        raise ValueError("SentenceTransformer model is not loaded.")
    # The model.encode() method returns a NumPy array directly
    embedding = sf_model.encode(text)
    return embedding.tolist() # Convert to list for JSON serialization

## Step 5: Build and Save Embeddings
# This step reads the `hotel_features.json` (output from your image analysis step),
# creates a textual description for each room, generates an embedding for that description,
# and saves everything to `hotel_embeddings_hf.json`.

In [6]:
def build_embeddings_hf(sf_model: SentenceTransformer):
    if sf_model is None:
        print("Cannot build embeddings: SentenceTransformer model not loaded.")
        return

    # First, ensure hotel_features.json exists and is not empty
    if not os.path.exists(FEATURES_FILE):
        print(f"Error: {FEATURES_FILE} not found. Please create it first (output from vision model).")
        # Create a dummy file for demonstration if it doesn't exist
        print("Creating a dummy hotel_features.json for demonstration purposes.")
        dummy_features = {
            "https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_1.jpg": {
                "room_type": "double", "capacity": 2, "view_type": "mountain view", 
                "amenities": ["balcony", "air conditioning"],
                "original_url": "https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_1.jpg"
            },
            "https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_2.jpg": {
                "room_type": "single", "capacity": 1, "view_type": "city view", 
                "amenities": ["desk", "tv"],
                "original_url": "https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_2.jpg"
            }
        }
        with open(FEATURES_FILE, "w") as f:
            json.dump(dummy_features, f, indent=2)
        print(f"Dummy {FEATURES_FILE} created. Please replace with your actual data.")
        
    with open(FEATURES_FILE, "r") as f:
        try:
            feature_database = json.load(f)
        except json.JSONDecodeError:
            print(f"Error: Could not decode JSON from {FEATURES_FILE}. Ensure it's valid.")
            return

    if not feature_database:
        print(f"Error: {FEATURES_FILE} is empty. No features to process.")
        return

    embeddings_database = []
    count = 0
    total_items = len(feature_database)
    print(f"Starting embedding generation for {total_items} items...")
    for original_url, features_dict in feature_database.items():
        count += 1
        # The 'original_url' might be nested if the key is the local filename
        # Let's ensure we get the actual URL for storage
        actual_url = features_dict.get("original_url", original_url)

        descriptor = make_descriptor(features_dict)
        print(f"({count}/{total_items}) Embedding: {descriptor[:100]}...") # Print progress
        
        try:
            embedding_vector = get_embedding_hf(descriptor, sf_model)
            embeddings_database.append({
                "url": actual_url,
                "descriptor": descriptor, # Store the descriptor for reference
                "features": features_dict, # Store original features too
                "embedding": embedding_vector
            })
        except Exception as e:
            print(f"Error embedding item {actual_url}: {e}")
            continue # Skip to next item

    with open(EMBEDDINGS_FILE_HF, "w") as f:
        json.dump(embeddings_database, f, indent=2)
    print(f"→ Successfully saved HuggingFace embeddings for {len(embeddings_database)} items to {EMBEDDINGS_FILE_HF}")

# %% [code]
# Run the embedding generation process
# This only needs to be run once, or when hotel_features.json changes, or if you change the embedding model.
if model: # Only run if model loaded successfully
    build_embeddings_hf(model)
else:
    print("Skipping embedding generation as the model was not loaded.")

Starting embedding generation for 25 items...
(1/25) Embedding: double room. with a capacity for 2 people. offering a mountain view. equipped with balcony, air cond...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(2/25) Embedding: double room. with a capacity for 2 people. equipped with desk, TV, air conditioning, wardrobe, night...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(3/25) Embedding: double room. with a capacity for 2 people. offering a city view. equipped with balcony....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(4/25) Embedding: double room. with a capacity for 2 people. offering a city view. equipped with desk, TV, chair, tabl...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(5/25) Embedding: double room. with a capacity for 2 people. offering a sea view. equipped with balcony....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(6/25) Embedding: double room. with a capacity for 3 people. offering a sea view. equipped with desk, balcony, mirror....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(7/25) Embedding: suite room. with a capacity for 4 people. offering a sea view. equipped with jacuzzi, balcony, TV, d...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(8/25) Embedding: single room. with a capacity for 1 people. offering a city view. equipped with desk, balcony....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(9/25) Embedding: double room. with a capacity for 2 people. offering a sea view. equipped with balcony, TV is not vis...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(10/25) Embedding: double room. with a capacity for 2 people. offering a city view. equipped with desk, balcony, TV....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(11/25) Embedding: double room. with a capacity for 2 people. equipped with desk, air conditioning, TV, phone, coffee m...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(12/25) Embedding: triple room. with a capacity for 3 people. equipped with desk, TV, towels, kettle....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(13/25) Embedding: double room. with a capacity for 2 people. offering a garden view. equipped with balcony, TV....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(14/25) Embedding: double room. with a capacity for 2 people. offering a sea view. equipped with balcony, TV, desk....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(15/25) Embedding: double room. with a capacity for 4 people. offering a city view. equipped with balcony....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(16/25) Embedding: double room. with a capacity for 2 people. offering a sea view. equipped with desk, balcony, air con...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(17/25) Embedding: double room. with a capacity for 3 people. offering a garden view. equipped with desk, TV, nightstan...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(18/25) Embedding: double room. with a capacity for 2 people. offering a mountain view. equipped with balcony, TV, desk...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(19/25) Embedding: double room. with a capacity for 2 people. offering a city view. equipped with balcony, TV (not visi...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(20/25) Embedding: suite room. with a capacity for 4 people. equipped with desk, balcony, TV, air conditioning, minibar...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(21/25) Embedding: double room. with a capacity for 2 people. equipped with desk, TV, air conditioning, chair, stool, l...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(22/25) Embedding: quadruple room. with a capacity for 4 people. equipped with desk, TV, air conditioning....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(23/25) Embedding: triple room. with a capacity for 3 people. equipped with desk, balcony, TV, air conditioning, lamps ...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(24/25) Embedding: triple room. with a capacity for 3 people. equipped with desk, air conditioning, TV, telephone, ward...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(25/25) Embedding: triple room. with a capacity for 3 people. equipped with bedside lamps, telephones, wardrobe....


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

→ Successfully saved HuggingFace embeddings for 25 items to hotel_embeddings_hf.json


## Step 6: Perform Semantic Search
# This function loads the saved embeddings and performs a search based on cosine similarity.

In [7]:
def semantic_search_hf(query: str, sf_model: SentenceTransformer, top_k: int = 5) -> list:
    if sf_model is None:
        print("Cannot perform search: SentenceTransformer model not loaded.")
        return []
        
    if not os.path.exists(EMBEDDINGS_FILE_HF):
        print(f"Error: Embeddings file {EMBEDDINGS_FILE_HF} not found. Please run 'build_embeddings_hf' first.")
        return []

    with open(EMBEDDINGS_FILE_HF, "r") as f:
        try:
            embeddings_database = json.load(f)
        except json.JSONDecodeError:
            print(f"Error: Could not decode JSON from {EMBEDDINGS_FILE_HF}.")
            return []
    
    if not embeddings_database:
        print("Embeddings database is empty. No items to search.")
        return []

    print(f"\nSearching for: \"{query}\"")
    query_embedding_list = get_embedding_hf(query, sf_model)
    query_embedding = np.array(query_embedding_list).reshape(1, -1) # Reshape for cosine_similarity

    # Ensure all stored embeddings are numpy arrays for cosine_similarity
    # And handle potential errors if 'embedding' key is missing or not a list
    room_embeddings_matrix = []
    valid_rooms_data = []
    for item in embeddings_database:
        if "embedding" in item and isinstance(item["embedding"], list):
            room_embeddings_matrix.append(item["embedding"])
            valid_rooms_data.append(item) # Keep track of items that have valid embeddings
        else:
            print(f"Warning: Skipping item {item.get('url', 'Unknown URL')} due to missing or invalid embedding.")
            
    if not room_embeddings_matrix:
        print("No valid embeddings found in the database to compare against.")
        return []
        
    room_embeddings_matrix = np.array(room_embeddings_matrix)
    
    similarities = cosine_similarity(query_embedding, room_embeddings_matrix)[0]

    # Get top_k indices
    # argsort sorts in ascending order, so we take the last top_k and reverse them
    top_k_indices = np.argsort(similarities)[-top_k:][::-1]

    results = []
    for i in top_k_indices:
        # Ensure index is within bounds of valid_rooms_data
        if i < len(valid_rooms_data):
            results.append({
                "url": valid_rooms_data[i]["url"],
                "descriptor": valid_rooms_data[i]["descriptor"],
                "score": float(similarities[i])
            })
    return results

In [8]:
if model: # Only run if model loaded successfully
    search_query1 = "a quiet double room with a beautiful sea view and a balcony for fresh air"
    results1 = semantic_search_hf(search_query1, model, top_k=3)
    print("\n--- Search Results 1 ---")
    if results1:
        for r in results1:
            print(f"URL: {r['url']}\nScore: {r['score']:.4f}\nDescriptor: {r['descriptor']}\n---")
    else:
        print("No matches found.")

    search_query2 = "a room suitable for a business traveler, needs a desk and good lighting"
    results2 = semantic_search_hf(search_query2, model, top_k=3)
    print("\n--- Search Results 2 ---")
    if results2:
        for r in results2:
            print(f"URL: {r['url']}\nScore: {r['score']:.4f}\nDescriptor: {r['descriptor']}\n---")
    else:
        print("No matches found.")
        
    search_query3 = "cheap room for 4 people"
    results3 = semantic_search_hf(search_query3, model, top_k=3)
    print("\n--- Search Results 3 ---")
    if results3:
        for r in results3:
            print(f"URL: {r['url']}\nScore: {r['score']:.4f}\nDescriptor: {r['descriptor']}\n---")
    else:
        print("No matches found.")
else:
    print("Skipping semantic search examples as the model was not loaded.")


Searching for: "a quiet double room with a beautiful sea view and a balcony for fresh air"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Search Results 1 ---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_13.jpg
Score: 0.8024
Descriptor: double room. with a capacity for 2 people. offering a sea view. equipped with balcony.
---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_21.jpg
Score: 0.7608
Descriptor: double room. with a capacity for 2 people. offering a sea view. equipped with balcony, TV, desk.
---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_23.jpg
Score: 0.7522
Descriptor: double room. with a capacity for 2 people. offering a sea view. equipped with desk, balcony, air conditioning, TV.
---

Searching for: "a room suitable for a business traveler, needs a desk and good lighting"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Search Results 2 ---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_16.jpg
Score: 0.5704
Descriptor: single room. with a capacity for 1 people. offering a city view. equipped with desk, balcony.
---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_14.jpg
Score: 0.5515
Descriptor: double room. with a capacity for 3 people. offering a sea view. equipped with desk, balcony, mirror.
---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_10.jpg
Score: 0.5462
Descriptor: double room. with a capacity for 2 people. equipped with desk, TV, air conditioning, wardrobe, nightstand, lamp.
---

Searching for: "cheap room for 4 people"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Search Results 3 ---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_22.jpg
Score: 0.6211
Descriptor: double room. with a capacity for 4 people. offering a city view. equipped with balcony.
---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_6.jpg
Score: 0.6118
Descriptor: quadruple room. with a capacity for 4 people. equipped with desk, TV, air conditioning.
---
URL: https_s3.eu-central-1.amazonaws.com_static.obilet.com_CaseStudy_HotelImages_4.jpg
Score: 0.6111
Descriptor: suite room. with a capacity for 4 people. equipped with desk, balcony, TV, air conditioning, minibar, safe.
---
