## Booking.com accomodation affiliate link
In this notebook the goal is to prototype and test a tool that will automatically detect mentions of accomodation in a blog post and then edit that mention to be an affiliate link to booking.com.


### Booking.com Affiliate Partner Program
To generate revenue from affiliate links in booking.com you must sign up for the Booking.com Affiliate Partner Program. Once accepted, you'll receive an affiliate id (`BOOKING_AID`) which must be part of the hyperlink in order for you to get paid for a booking made using your link.  
  
First we will need to extract mentions of accomodation in a text then we will construct the link to booking.com.  

The affiliate link should be in the form 
"https://www.booking.com/searchresults.html?ss=<`extracted_accomodation_name`>&aid=`BOOKING_AID`".




# Accomodation mention search

First step in the project is to extract mentions of accomodation in a text.  
There are a few methods we could use to do this such as using keyword and pattern matching, eg. finding Capitalised words or searching for sentences containing words such as "Hotel", "Apartment", "B&B" etc using libraries such as regex but it will miss ambigious matches eg. "The Four Seasons Singapore", it would be hard to maintain, you'll have to keep the keyword list updated.  

I decided to go with an LLM & Langchain based method for the following reasons:  
- Flexibility - can deal with paraphrasing, slang and unclear contexts.
-  No training data or *regex* required. xD
- Will work across multiple languages, very important for an international travel blog where accomodation names may be in the local language on booking.com.
- Personal knowledge/education

The blog posts will be written in Markdown.  

In [2]:
# hf 
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
print("hi")

hi


## Setting up LLM attempts

In [None]:
# LLM
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Choose a model (small example; replace with any Hugging Face model)
model_name = "google/gemma-3-4b-it"

# Load tokenizer and model
print("Loading model…")
tokenizer = AutoTokenizer.from_pretrained(model_name)


Loading model…


In [None]:
import torch
#model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a text generation pipeline
generator = pipeline(
    "text-generation",
    model="google/gemma-3-1b-it",
    tokenizer=tokenizer,
    max_new_tokens=300,
    temperature=0.7, 
    torch_dtype=torch.bfloat16
)





Device set to use cpu


In [6]:
generator("Give me top 3 recommendations for sightseeing in Sydney")

[{'generated_text': 'Give me top 3 recommendations for sightseeing in Sydney.\n\n1. **Sydney Opera House:** This iconic building is a must-see. Take a tour to learn about its history and architecture, or simply admire it from the outside.\n2. **The Rocks:** A historic area with cobblestone streets, charming pubs, and a vibrant arts scene. Explore the markets, shops, and enjoy the waterfront views.\n3. **Bondi Beach:** A world-famous surfing beach. Enjoy the sand, surf, and sunshine, or simply relax and soak up the atmosphere.\n\nDo you want me to provide more recommendations?\n'}]

In [None]:
def generate(prompt: str):
    """Generate a response from the LLM given a prompt."""
    output = generator(prompt)[0]["generated_text"]
    return output

In [3]:
output = generate("Give me top 3 recommendations for sightseeing in Sydney")
output

'Give me top 3 recommendations for sightseeing in Sydney?\n\n1.  **Sydney Opera House:** A must-see iconic landmark, offering tours and spectacular performances.\n2.  **Sydney Harbour Bridge:** Climb the bridge for panoramic views of the city and harbor, or simply walk across it.\n3.  **Royal Botanic Garden Sydney:** A beautiful and tranquil garden with stunning harbour views and diverse plant life.\n\nWould you like me to provide more information on any of these options?\n'

In [9]:
from langchain.prompts import ChatPromptTemplate
prompt = """Reply to every message by saying "Okily Dokily " first then answering their message."""


output = generate(prompt+"Give me top 3 recommendations for sightseeing in Sydney")
output

'Reply to every message by saying "Okily Dokily " first then answering their message.Give me top 3 recommendations for sightseeing in Sydney.\n\nOkayily Dokily\n\n1. Sydney Opera House: A must-see iconic landmark.\n2. Bondi Beach: A beautiful and famous beach.\n3. Royal Botanic Garden Sydney: A peaceful oasis with stunning views.\n\nYou\'ve given great recommendations!\n\nDo you want me to suggest a restaurant recommendation?\n\nOkayily Dokily\n\n1. Quay: Fine dining, modern Australian cuisine.\n2. Ange: Classic French cuisine.\n3. Spice Alley: Authentic Indian food.\n\nI\'m having a bit of trouble with my internet connection.\n\nOkayily Dokily\n\n1.  It\'s been a bit choppy.\n2.  Please try again later.\n3.  I\'m still working on it.\n\nThanks for your help!\n\nOkayily Dokily\n\n1.  You\'re a lifesaver!\n2.  I appreciate your patience.\n3.  Glad to be of assistance.\n\nOkayily Dokily\n\n1.  It\'s a beautiful day.\n2.  I\'m enjoying the weather.\n3.  Let\'s go!\n\nOkayily Dokily\n\n1. 

## Prompt template

In [33]:
# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["blog_text"],
    template="""Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.)\n"
    "from the following text. For each accommodation found, produce a JSON object with fields:\n"
    "- \"name\": name of the accommodation (string)\n"
    "- \"place\": city, town, or locality (string or empty)\n"
    "- \"country\": country (string or empty)\n\n"
    "Return a JSON array of objects. Respond ONLY with valid JSON (no extra commentary).\n\n"
    "Text:\n"
    "{blog_text}"
"""
)

text = "I recently stayed at the Grand Hotel in Paris and also visited the Cozy B&B in Lyon."

# Format the template with specific input variables
# The 'format' method fills the placeholders in the template string
formatted_prompt = prompt_template.format(blog_text=text)

# Optional: Print the resulting prompt string
print("--- Formatted Prompt ---")
print(formatted_prompt)
print("------------------------\n")

--- Formatted Prompt ---
Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.)
"
    "from the following text. For each accommodation found, produce a JSON object with fields:
"
    "- "name": name of the accommodation (string)
"
    "- "place": city, town, or locality (string or empty)
"
    "- "country": country (string or empty)

"
    "Return a JSON array of objects. Respond ONLY with valid JSON (no extra commentary).

"
    "Text:
"
    "I recently stayed at the Grand Hotel in Paris and also visited the Cozy B&B in Lyon."

------------------------



In [40]:
# make it a function
def prompt_template(template: str, text: str):
    prompt = PromptTemplate(input_variables=["blog_text"], template=template, text=text) 
    formatted_prompt = prompt.format(blog_text=text)
    return formatted_prompt

template = """Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.) from the following text: 
Text: {blog_text}""" 
text = "I recently stayed at the Grand Hotel in Paris and also visited the Cozy B&B in Lyon."

formatted_prompt = prompt_template(template, text)
print(formatted_prompt) 



Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.) from the following text: 
Text: I recently stayed at the Grand Hotel in Paris and also visited the Cozy B&B in Lyon.


In [None]:
# put everything we have so far together & check the current result

"""generator = pipeline(
    "text-generation",
    model="google/gemma-3-1b-it",
    tokenizer=tokenizer,
    max_new_tokens=300,
    temperature=0.7, 
    torch_dtype=torch.bfloat16,
    return_full_text=False
)"""

def prompt_template(template: str, text: str):
    prompt = PromptTemplate(input_variables=["blog_text"], template=template, text=text) 
    formatted_prompt = prompt.format(blog_text=text)
    return formatted_prompt


def generate(prompt: str):
    """Generate a response from the LLM given a prompt."""
    output = generator(prompt)[0]["generated_text"]
    return output



template = """Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.) from the following text.
Only return the names of the accommodation mentioned, nothing else 
Text: {blog_text}""" 
text = """I recently travelled to Hungary and stayed at the Grand Budapest Hotel for a week and then for my final night in the city I stayed in The Marriott. 
When I came back to Ireland I stayed in a cosy b&b just outside Ardara for the weekend."""

formatted_prompt = prompt_template(template, text)

output = generate(formatted_prompt)
print(output)   

Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.) from the following text.
Only return the names of the accommodation mentioned, nothing else 
Text: I recently travelled to Hungary and stayed at the Grand Budapest Hotel for a week and then for my final night in the city I stayed in The Marriott. 
When I came back to Ireland I stayed in a cosy b&b just outside Ardara for the weekend. 
I also visited a campsite near the mountains. 

The city I stayed in was a hotel, and I also stayed in a villa. 
I also stayed in a hostel.
The accommodation I had was a hostel.

The accommodation I stayed at was a hotel.
There was a hostel near the city.
The accommodation was a villa.
I stayed in a campsite.
I stayed in a hotel.
I stayed in a hostel.
I stayed in a B&B.
The accommodation was a hotel.
The accommodation was a villa.
I stayed in a campsite.
I stayed in a hostel.

The accommodation I had was a hostel.

I stayed in a B&B.
The accommodation was a hote

### Improving prompt
I don't want just generic mentions like "cosy b&b", that are no good for searching in booking.com.  
I also want the llm to recognise where the accommodation is located, location and country.  
Want to avoid linking generic mentions like "hotel", "B&B" and only link proper accommodation names like "The Abbey Hotel".  

Try: Adjusting the LLM prompt to extract proper accommodation names only.  
Could also try a regex based filter if the above doesnt work well enough. 

return_full_text=False to only return the generated text.

In [53]:
template = """You are an AI assistant that is going to help with writing blog posts.
Extract the names accommodation properties mentioned in the below text.

Rules:
• Extract ONLY proper names of accommodation properties from the text (e.g., "Balmoral Hotel", "Palm Court Guesthouse").
• Do NOT return generic words like “hotel”, “hostel”, “the inn”, "b&b" etc.
• Give the full name of the accommodation as mentioned in the text.
• Detect the location if explicitly mentioned or inferable in context.
• location = city / town / village
• country = country if clearly available
• If unknown, return "None".

Only return the names of the accommodation mentioned, nothing else.
Text: {blog_text}
""" 

text = """I recently travelled to Hungary and stayed at the Grand Budapest Hotel for a week and then for my final night in the city I stayed in The Marriott. 
When I came back to Ireland I stayed in a cosy b&b just outside Ardara for the weekend."""

formatted_prompt = prompt_template(template, text)

output = generate(formatted_prompt)
print(output)   

You are an AI assistant that is going to help with writing blog posts.
Extract the names accommodation properties mentioned in the below text.

Rules:
• Extract ONLY proper names of accommodation properties from the text (e.g., "Balmoral Hotel", "Palm Court Guesthouse").
• Do NOT return generic words like “hotel”, “hostel”, “the inn”, "b&b" etc.
• Give the full name of the accommodation as mentioned in the text.
• Detect the location if explicitly mentioned or inferable in context.
• location = city / town / village
• country = country if clearly available
• If unknown, return "None".

Only return the names of the accommodation mentioned, nothing else.
Text: I recently travelled to Hungary and stayed at the Grand Budapest Hotel for a week and then for my final night in the city I stayed in The Marriott. 
When I came back to Ireland I stayed in a cosy b&b just outside Ardara for the weekend.
The country of Hungary is known for its beautiful landscapes and the city of Budapest is a great

## LLM Output formatting
We want to get a simple, clean output of accommodation names from the llm response so we can easily convert those mentions into a hyperlink.  

We'll try langchains output parser and aim for a clean json output of accommodation name, location and place.

In [None]:
from langchain_core.output_parsers import JsonOutputParser
from pydantic import Field, BaseModel

class Accom_extractor(BaseModel):
    accom_name: str = Field("The full name of the accommodationed mentioned")
    location: str = Field("The area where the accommodation is mentioned, could be town, city, region etc")
    country: int = Field("The country where the accommodation is located")


output_parser = PydanticOutputParser(pydantic_object=Accom_extractor)
model = generator
test = model.generate_pydantic(model=Accom_extractor, prompt=formatted_prompt)

print(test)

AttributeError: 'TextGenerationPipeline' object has no attribute 'generate_pydantic'

In [62]:
formatted_prompt = prompt_template(template, text)
prompt = ChatPromptTemplate.from_template(formatted_prompt)
chain = formatted_prompt | generator | output_parser
chain

TypeError: unsupported operand type(s) for |: 'str' and 'TextGenerationPipeline'

In [63]:
from dotenv import load_dotenv
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser


# Create a ChatOpenAI model
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")

# Define prompt templates (no need for separate Runnable chains)
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a comedian who tells jokes about {topic}."),
        ("human", "Tell me {joke_count} jokes."),
    ]
)

# Create the combined chain using LangChain Expression Language (LCEL)
chain = prompt_template | model | StrOutputParser()
# chain = prompt_template | model | StrOutputParser()

# Run the chain
result = chain.invoke({"topic": "lawyers", "joke_count": 3})

# Output
print(result)

RuntimeError: [enforce fail at alloc_cpu.cpp:116] data. DefaultCPUAllocator: not enough memory: you tried to allocate 2685009920 bytes.

In [None]:
# combine with prev function

generator = pipeline(
    "text-generation",
    model="google/gemma-3-1b-it",
    tokenizer=tokenizer,
    max_new_tokens=300,
    temperature=0.7, 
    torch_dtype=torch.bfloat16
)
def generate(prompt: str):
    """Generate a response from the LLM given a prompt."""
    output = generator(prompt)[0]["generated_text"]
    return output

def extract_accommodations(llm: Any, text: str, prompt_template: str = PROMPT_TEMPLATE) -> str:
    template = PromptTemplate(input_variables=["blog_text"], template=prompt_template)
    chain = LLMChain(llm=llm, prompt=template)
    output = chain.invoke({"text": text})
    return output


In [None]:
text = "I recently stayed at the Grand Hotel in Paris and also visited the Cozy B&B in Lyon."
accommodations = extract_accommodations(llm = hf_llm, text = text, prompt_template = PROMPT_TEMPLATE)
accommodations

  output = chain.run({"text": text})


'Extract mentions of accommodation (hotels, hostels, B&Bs, apartments, villas, campsites, etc.)\nfrom the following text. For each accommodation found, produce a JSON object with fields:\n- "name": name of the accommodation (string)\n- "place": city, town, or locality (string or empty)\n- "country": country (string or empty)\n\nReturn a JSON array of objects. Respond ONLY with valid JSON (no extra commentary).\n\nText:\nI recently stayed at the Grand Hotel in Paris and also visited the Cozy B&B in Lyon.\nI also had a fantastic experience at the Roman Villa in Rome.\nI also stayed at the Hostel in Munich, and then a spacious apartment in Berlin.\nThe cost of the trip was approximately 5000 euros.\nI also visited the campsite near the lake in Bavaria.\n\n```json\n[\n  {\n    "name": "Grand Hotel",\n    "place": "Paris",\n    "country": "France"\n  },\n  {\n    "name": "Cozy B&B",\n    "place": "Lyon",\n    "country": "France"\n  },\n  {\n    "name": "Roman Villa",\n    "place": "Rome",\n

In [None]:
# --- Example usage ---
if __name__ == "__main__":
    user_prompt = input("Enter your prompt: ")
    response = generate(user_prompt)
    print("\n--- LLM Response ---")
    print(response)

In [None]:

from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List
from langchain.prompts import PromptTemplate

class Accommodation(BaseModel):
    name: str = Field(description="Name of the accommodation as mentioned in text")
    type: str = Field(description="Category: hotel, hostel, resort, B&B, lodge, motel, etc.")

class AccommodationList(BaseModel):
    accommodations: List[Accommodation]

parser = PydanticOutputParser(pydantic_object=AccommodationList)

prompt = PromptTemplate(
    template="""
Extract all mentions of accommodation businesses from the text.

Accommodation includes: hotels, hostels, B&Bs, resorts, guesthouses, motels, lodges, retreats.

Return **only** valid structured JSON using this schema:
{format_instructions}

Text:
\"\"\"{text}\"\"\"
""",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)
def extract_accommodation_mentions(text: str):
    messages = prompt.format_messages(
        text=text,
        format_instructions=parser.get_format_instructions()
    )
    response = generator(messages)
    return parser.parse(response.content)


In [16]:
extract_accommodation_mentions("I recently stayed at the Abbey Hotel")

AttributeError: 'list' object has no attribute 'strip'

In [None]:
llm = ChatOpenAI(model="gpt-5", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Extract ONLY proper names of accommodation properties from the text.
Include full names of hotels, B&Bs, inns, hostels, lodges, resorts etc.
Do NOT return generic mentions like:
- “hotel”
- “a nice B&B”
- “the hostel”
- “a small inn”
- “guesthouse” (unless part of a name like “Palm Court Guesthouse”)

Return only specific property names such as:
- “Balmoral Hotel”
- “Travelodge Manchester Central”
- “The Ritz London”

Return JSON following this schema:
{format_instructions}

Text:
{text}
""")

Want to also include location and country of the accomodation as there could be many eg. hotels with the same name eg "The Marriott".  
Add place and country to end of booking.com search to improve accuracy.  

In [49]:
EXTRACTION_PROMPT = """
Extract accommodation mentions from this blog text.

### Requirements
• Only return proper names of REAL accommodation (e.g., "Balmoral Hotel", "Palm Court Guesthouse").
• Do NOT return generic words like “hotel”, “hostel”, “the inn”, etc.
• Detect the location if explicitly mentioned or inferable in context.
• location = city / town / village
• country = country if clearly available
• If unknown, return null.

### Output JSON format:
{
  "accommodation": [
    {"name": "...", "place": "...", "country": "..."}
  ]
}

### Text:
{text}
"""


In [None]:
# only return each accomodation mention once
def extract_accommodations(text: str) -> List[Accommodation]:
    messages = prompt.format_messages(
        text=text,
        format_instructions=parser.get_format_instructions(),
    )

    # Deduplicate by case-insensitive comparison
    unique = {}
    for acc in result.accommodations:
        key = acc.name.strip().lower()
        unique[key] = acc

    return list(unique.values())

# Structure

In [None]:
# llm extractor
from typing import List
from pydantic import BaseModel
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import PydanticOutputParser


class Accommodation(BaseModel):
    """Represents a proper accommodation name extracted by the LLM."""
    name: str


class ExtractionResult(BaseModel):
    accommodations: List[Accommodation]


class AccommodationExtractor:
    """
    Extracts proper accommodation entity names from text using GPT.
    Handles caching, parsing, and prompt engineering internally.
    """

    def __init__(self, model_name="gpt-5"):
        self.llm = ChatOpenAI(model=model_name, temperature=0)
        self.parser = PydanticOutputParser(pydantic_object=ExtractionResult)

        self.prompt = ChatPromptTemplate.from_template("""
            Extract ONLY proper names of specific accommodation properties.
            Examples of valid output:
            - Balmoral Hotel
            - The Ritz London
            - Palm Court Guesthouse

            DO NOT return generic mentions such as:
            - hotel
            - the hostel
            - a B&B
            - inn

            Return JSON compliant with this schema:
            {format_instructions}

            Text:
            {text}
        """)

    def extract(self, text: str) -> List[Accommodation]:
        messages = self.prompt.format_messages(
            text=text,
            format_instructions=self.parser.get_format_instructions()
        )

        response = self.llm(messages)
        parsed = self.parser.parse(response.content)

        # deduplicate
        seen = {}
        for acc in parsed.accommodations:
            seen[acc.name.lower()] = acc

        return list(seen.values())


In [None]:
# add affiliate link
import re
from typing import List
from src.extraction.llm_extractor import Accommodation
from src.extraction.proper_name_filter import is_generic_name
from src.linking.hyperlink_detection import already_linked


class AffiliateLinker:
    """
    Injects Booking.com affiliate links into Markdown.
    Avoids double-linking and maintains text fidelity.
    """

    def __init__(self, affiliate_id: str):
        self.aid = affiliate_id

    def _build_url(self, name: str) -> str:
        query = re.sub(r"\s+", "+", name.strip())
        return f"https://www.booking.com/searchresults.html?ss={query}&aid={self.aid}"

    def linkify(self, original_text: str, accommodations: List[Accommodation]) -> str:
        updated = original_text

        for acc in accommodations:
            name = acc.name

            if is_generic_name(name):
                continue

            if already_linked(updated, name):
                continue

            url = self._build_url(name)
            hyperlink = f"[{name}]({url})"

            pattern = re.compile(re.escape(name), re.I)
            updated, _ = pattern.subn(hyperlink, updated, count=1)

        return updated


In [None]:
# makefile
install:
	pip install -r requirements.txt

test:
	pytest -q

run:
	python -m src.pipeline.batch_processor

clean:
	find . -name "__pycache__" -type d -exec rm -r {} +


# Tests

In [None]:
from src.extraction.llm_extractor import AccommodationExtractor

def test_extractor_does_not_return_generic_terms():
    extractor = AccommodationExtractor(model_name="gpt-5")
    sample = "We stayed at the hotel near Balmoral Hotel."

    results = extractor.extract(sample)
    names = [a.name.lower() for a in results]

    assert "hotel" not in names
    assert "balmoral hotel" in names




from src.linking.affiliate_linker import AffiliateLinker
from src.extraction.llm_extractor import Accommodation

def test_affiliate_link_insertion():
    linker = AffiliateLinker("999")
    acc = [Accommodation(name="Balmoral Hotel")]

    text = "We loved the Balmoral Hotel."
    updated = linker.linkify(text, acc)

    assert "[Balmoral Hotel]" in updated
    assert "999" in updated  # affiliate id
