# NOTEBOOK 02: MAPPING SUPPLIER ROOMS TO OUR INTERNAL ROOMS
In this notebook, I will explore a multistep approach to linking supplier rooms to our internal reference data. It will include several steps, from simpler to more complex, so that we can filter out obvious non-matchs faster.

In [2]:
EXAMPLE_HOTELS_JSON = "../output/hotels_with_rooms.json"

In [3]:
import pandas as pd
import json
import unicodedata
import re
from rapidfuzz import fuzz

## 1. Load data
In `notebook 01`, I prepared a json with examples from different hotels. That will allow us to use a smaller file without having to load the entire file to github. That way, this notebook will be able to be used by anyone downloading the repo that doesn't have access to the source files. 

In [4]:
with open(EXAMPLE_HOTELS_JSON, "r", encoding="utf-8") as f:
    data = json.load(f)

for record in data:
    print(record["hotel_id"])
    print(record["reference_rooms"])
    print(record["rooms_in_supplier_data"])
    print("-------------------------------------------------------------------------")
    print("-------------------------------------------------------------------------")



lp42bfe
['Junior Suite with City View', 'Junior suite with Plunge Pool & Sea view', 'Junior suite with outdoor Hot tub & Sea view', 'Standard Suite', 'Junior Suite with Private Pool', 'Family Suite', 'Junior Suite']
['Classic Room, 2 Twin Beds', 'Classic Room, 1 King Bed', 'Suite, 1 Twin Bed (Master)', 'Suite, 1 Bedroom', 'Classic Room', 'Club Room, 1 King Bed, Business Lounge Access', 'Premium Room, 2 Double Beds, Tower (Main Tower)', 'Premium Room, 1 King Bed, Tower (Main Tower)', 'Premium Room, 1 King Bed']
-------------------------------------------------------------------------
-------------------------------------------------------------------------
lp4cd34
['Apartment', 'Twin Room with Balcony', 'Twin Room with Balcony', 'One-Bedroom Apartment', 'Deluxe Apartment', 'Deluxe Apartment']
['Room, 2 Queen Beds', 'Suite, 1 Bedroom', 'Suite, 1 Bedroom', 'Room, 1 King Bed (Hearing Accessible)', 'Room, 2 Queen Beds (Hearing Accessible)', 'Room, 1 King Bed (Mobility Accessible, Roll-In Sh

# 2. Text normalization
This is going to be a bit too specific for the examples I handpicked. However, it would be easy to add more rules to the normalization step by looking through more examples. Normalization steps:
1) Lowercase and remove accents
2) translate number words to figures
3) some basic replacements. For example, *king bed* to *king*.
4) tokenize, removing some stopwords


In [5]:
NUMBER_WORDS = {
    "one": "1", "two": "2", "three": "3", "four": "4", "five": "5",
    "six": "6", "seven": "7", "eight": "8", "nine": "9"
}

STOPWORDS = {
    "and", "with", "the", "a", "an", "of", "for", "in", "at", "to", "by", "on"
}

REPLACEMENTS = [
    (r"&", " and "),
    (r"\+", " and "),
    (r"\bking bed(s)?\b", "king bed"),
    (r"\btwin bed(s)?\b", "twin bed"),
    (r"\bdouble bed(s)?\b", "double bed"),
    (r"\bqueen bed(s)?\b", "queen bed"),
]

def strip_accents(text):
    return "".join(
        ch for ch in unicodedata.normalize("NFKD", text)
        if not unicodedata.combining(ch)
    )

def normalize(text):
    normalized_text = text.strip().lower()
    normalized_text = strip_accents(normalized_text)
    for word, digit in NUMBER_WORDS.items():
        normalized_text = re.sub(rf"\b{word}\b", digit, normalized_text)
    for pat, rep in REPLACEMENTS:
        normalized_text = re.sub(pat, rep, normalized_text)

    normalized_text = re.sub(r"[^\w\s]", " ", normalized_text)

    normalized_text = re.sub(r"\s+", " ", normalized_text).strip()  
    return normalized_text

def tokenize(text):
    tokens = normalize(text).split()
    return [t for t in tokens if t not in STOPWORDS]

In [6]:
rows = []

for record in data:
    internal_list = record["reference_rooms"]
    supplier_list = record["rooms_in_supplier_data"]
    hotel_id = record["hotel_id"]

    for origin, lst in [("internal", internal_list), ("supplier", supplier_list)]:
        for s in lst:
            rows.append({
                "lp_id":hotel_id,
                "source": origin,
                "original": s,
                "normalized": normalize(s),
                "tokens": tokenize(s)
            })

df = pd.DataFrame(rows)
df

Unnamed: 0,lp_id,source,original,normalized,tokens
0,lp42bfe,internal,Junior Suite with City View,junior suite with city view,"[junior, suite, city, view]"
1,lp42bfe,internal,Junior suite with Plunge Pool & Sea view,junior suite with plunge pool and sea view,"[junior, suite, plunge, pool, sea, view]"
2,lp42bfe,internal,Junior suite with outdoor Hot tub & Sea view,junior suite with outdoor hot tub and sea view,"[junior, suite, outdoor, hot, tub, sea, view]"
3,lp42bfe,internal,Standard Suite,standard suite,"[standard, suite]"
4,lp42bfe,internal,Junior Suite with Private Pool,junior suite with private pool,"[junior, suite, private, pool]"
5,lp42bfe,internal,Family Suite,family suite,"[family, suite]"
6,lp42bfe,internal,Junior Suite,junior suite,"[junior, suite]"
7,lp42bfe,supplier,"Classic Room, 2 Twin Beds",classic room 2 twin bed,"[classic, room, 2, twin, bed]"
8,lp42bfe,supplier,"Classic Room, 1 King Bed",classic room 1 king bed,"[classic, room, 1, king, bed]"
9,lp42bfe,supplier,"Suite, 1 Twin Bed (Master)",suite 1 twin bed master,"[suite, 1, twin, bed, master]"


# 3. Extract features for blocking clear non-matches
This step is useful because it's easier to compute and allows to reduce the number of possibilities offered to the similarity algorithm.

### 3.1. Define room type

In [7]:
df["room_type"] = df["normalized"].str.extract(r"\b(suite|apartment|loft|room)\b")
df


Unnamed: 0,lp_id,source,original,normalized,tokens,room_type
0,lp42bfe,internal,Junior Suite with City View,junior suite with city view,"[junior, suite, city, view]",suite
1,lp42bfe,internal,Junior suite with Plunge Pool & Sea view,junior suite with plunge pool and sea view,"[junior, suite, plunge, pool, sea, view]",suite
2,lp42bfe,internal,Junior suite with outdoor Hot tub & Sea view,junior suite with outdoor hot tub and sea view,"[junior, suite, outdoor, hot, tub, sea, view]",suite
3,lp42bfe,internal,Standard Suite,standard suite,"[standard, suite]",suite
4,lp42bfe,internal,Junior Suite with Private Pool,junior suite with private pool,"[junior, suite, private, pool]",suite
5,lp42bfe,internal,Family Suite,family suite,"[family, suite]",suite
6,lp42bfe,internal,Junior Suite,junior suite,"[junior, suite]",suite
7,lp42bfe,supplier,"Classic Room, 2 Twin Beds",classic room 2 twin bed,"[classic, room, 2, twin, bed]",room
8,lp42bfe,supplier,"Classic Room, 1 King Bed",classic room 1 king bed,"[classic, room, 1, king, bed]",room
9,lp42bfe,supplier,"Suite, 1 Twin Bed (Master)",suite 1 twin bed master,"[suite, 1, twin, bed, master]",suite


### 3.2. Get bed type

In [8]:
def extract_bed_type(tokens):
    bed_types = {"king", "twin", "double", "queen"}
    for i, tok in enumerate(tokens):
        if tok in bed_types and tokens[i+1]=="bed":
            return tok
    return None

df["bed_type"] = df["tokens"].apply(extract_bed_type)
df

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type
0,lp42bfe,internal,Junior Suite with City View,junior suite with city view,"[junior, suite, city, view]",suite,
1,lp42bfe,internal,Junior suite with Plunge Pool & Sea view,junior suite with plunge pool and sea view,"[junior, suite, plunge, pool, sea, view]",suite,
2,lp42bfe,internal,Junior suite with outdoor Hot tub & Sea view,junior suite with outdoor hot tub and sea view,"[junior, suite, outdoor, hot, tub, sea, view]",suite,
3,lp42bfe,internal,Standard Suite,standard suite,"[standard, suite]",suite,
4,lp42bfe,internal,Junior Suite with Private Pool,junior suite with private pool,"[junior, suite, private, pool]",suite,
5,lp42bfe,internal,Family Suite,family suite,"[family, suite]",suite,
6,lp42bfe,internal,Junior Suite,junior suite,"[junior, suite]",suite,
7,lp42bfe,supplier,"Classic Room, 2 Twin Beds",classic room 2 twin bed,"[classic, room, 2, twin, bed]",room,twin
8,lp42bfe,supplier,"Classic Room, 1 King Bed",classic room 1 king bed,"[classic, room, 1, king, bed]",room,king
9,lp42bfe,supplier,"Suite, 1 Twin Bed (Master)",suite 1 twin bed master,"[suite, 1, twin, bed, master]",suite,twin


### 3.3. Get number of beds

In [9]:
def extract_beds(tokens):
    bed_types = {"king", "twin", "double", "queen"}
    for i, tok in enumerate(tokens):
        if tok in bed_types and i > 0 and tokens[i-1].isdigit():
            return int(tokens[i-1])
    return None

df["num_beds"] = df["tokens"].apply(extract_beds)
df

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds
0,lp42bfe,internal,Junior Suite with City View,junior suite with city view,"[junior, suite, city, view]",suite,,
1,lp42bfe,internal,Junior suite with Plunge Pool & Sea view,junior suite with plunge pool and sea view,"[junior, suite, plunge, pool, sea, view]",suite,,
2,lp42bfe,internal,Junior suite with outdoor Hot tub & Sea view,junior suite with outdoor hot tub and sea view,"[junior, suite, outdoor, hot, tub, sea, view]",suite,,
3,lp42bfe,internal,Standard Suite,standard suite,"[standard, suite]",suite,,
4,lp42bfe,internal,Junior Suite with Private Pool,junior suite with private pool,"[junior, suite, private, pool]",suite,,
5,lp42bfe,internal,Family Suite,family suite,"[family, suite]",suite,,
6,lp42bfe,internal,Junior Suite,junior suite,"[junior, suite]",suite,,
7,lp42bfe,supplier,"Classic Room, 2 Twin Beds",classic room 2 twin bed,"[classic, room, 2, twin, bed]",room,twin,2.0
8,lp42bfe,supplier,"Classic Room, 1 King Bed",classic room 1 king bed,"[classic, room, 1, king, bed]",room,king,1.0
9,lp42bfe,supplier,"Suite, 1 Twin Bed (Master)",suite 1 twin bed master,"[suite, 1, twin, bed, master]",suite,twin,1.0


### 3.4. Get number of bedrooms

In [10]:
def extract_bedroom_count(tokens):
    for i, tok in enumerate(tokens):
        if tok == "bedroom" and i > 0 and tokens[i-1].isdigit():
            return int(tokens[i-1])
    return None

df["num_bedrooms"] = df["tokens"].apply(extract_bedroom_count)
df

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms
0,lp42bfe,internal,Junior Suite with City View,junior suite with city view,"[junior, suite, city, view]",suite,,,
1,lp42bfe,internal,Junior suite with Plunge Pool & Sea view,junior suite with plunge pool and sea view,"[junior, suite, plunge, pool, sea, view]",suite,,,
2,lp42bfe,internal,Junior suite with outdoor Hot tub & Sea view,junior suite with outdoor hot tub and sea view,"[junior, suite, outdoor, hot, tub, sea, view]",suite,,,
3,lp42bfe,internal,Standard Suite,standard suite,"[standard, suite]",suite,,,
4,lp42bfe,internal,Junior Suite with Private Pool,junior suite with private pool,"[junior, suite, private, pool]",suite,,,
5,lp42bfe,internal,Family Suite,family suite,"[family, suite]",suite,,,
6,lp42bfe,internal,Junior Suite,junior suite,"[junior, suite]",suite,,,
7,lp42bfe,supplier,"Classic Room, 2 Twin Beds",classic room 2 twin bed,"[classic, room, 2, twin, bed]",room,twin,2.0,
8,lp42bfe,supplier,"Classic Room, 1 King Bed",classic room 1 king bed,"[classic, room, 1, king, bed]",room,king,1.0,
9,lp42bfe,supplier,"Suite, 1 Twin Bed (Master)",suite 1 twin bed master,"[suite, 1, twin, bed, master]",suite,twin,1.0,


### 3.5. Putting it all together
Now that all the features have been extracted, we can use a couple examples to understand how it would work. 

In [11]:
example_room = "Design Loft, 3 Bedrooms, Non Smoking"
target_row = df[(df["original"]==example_room)&(df["source"]=="supplier")].iloc[0]

target_hotel_code = target_row["lp_id"]

potential_matches = df[(df["lp_id"]==target_hotel_code)&(df["source"]=="internal")]
potential_matches

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms
40,lp10037d,internal,"Design Loft, 3 Bedrooms, Non Smoking",design loft 3 bedrooms non smoking,"[design, loft, 3, bedrooms, non, smoking]",loft,,,
41,lp10037d,internal,"Room, Multiple Beds, Non Smoking",room multiple beds non smoking,"[room, multiple, beds, non, smoking]",room,,,


In this example, we are able to discard the second option. Because the room we're trying to match, is a *loft*, not a *room*.

In [12]:
def eq_or_unknown(series, value):
    if pd.isna(value):
        return pd.Series(True, index=series.index)
    return (series == value) | series.isna()   

def match_candidates(row, df):
    mask = pd.Series(True, index=df.index)
    for col in ["room_type", "bed_type", "num_beds", "num_bedrooms"]:
        mask &= eq_or_unknown(df[col], row[col])
    return df[mask]

matches = match_candidates(target_row, potential_matches)
matches


Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms
40,lp10037d,internal,"Design Loft, 3 Bedrooms, Non Smoking",design loft 3 bedrooms non smoking,"[design, loft, 3, bedrooms, non, smoking]",loft,,,


#### 3.5.1. Another example

In [13]:
example_room = "Room, 1 King Bed (Hearing Accessible)"
target_row = df[(df["original"]==example_room)&(df["source"]=="supplier")].iloc[0]

target_hotel_code = target_row["lp_id"]

potential_matches = df[(df["lp_id"]==target_hotel_code)&(df["source"]=="internal")]
potential_matches

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms
16,lp4cd34,internal,Apartment,apartment,[apartment],apartment,,,
17,lp4cd34,internal,Twin Room with Balcony,twin room with balcony,"[twin, room, balcony]",room,,,
18,lp4cd34,internal,Twin Room with Balcony,twin room with balcony,"[twin, room, balcony]",room,,,
19,lp4cd34,internal,One-Bedroom Apartment,1 bedroom apartment,"[1, bedroom, apartment]",apartment,,,1.0
20,lp4cd34,internal,Deluxe Apartment,deluxe apartment,"[deluxe, apartment]",apartment,,,
21,lp4cd34,internal,Deluxe Apartment,deluxe apartment,"[deluxe, apartment]",apartment,,,


In this case, it could work already. At least these are the best two options. There is nothing that I can see that makes them uncompatible with the room we're trying to match. 

They are different rooms with different ids in our dataset. However, they share the same name. 

In [14]:
matches = match_candidates(target_row, potential_matches)
matches

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms
17,lp4cd34,internal,Twin Room with Balcony,twin room with balcony,"[twin, room, balcony]",room,,,
18,lp4cd34,internal,Twin Room with Balcony,twin room with balcony,"[twin, room, balcony]",room,,,


## 4. Finding similarity between close candidates
The most accurate option for this task would be a transformer-based model such as all-MiniLM from SentenceTransformers, because it can capture semantic similarity between room names (e.g. “Room with City View” vs “Room with Tower View”, or “Ocean View” vs “Sea View”). However, for the sake of simplicity, and to avoid the heavy installation of `SentenceTransformers` (which requires downloading PyTorch), I will use RapidFuzz instead. RapidFuzz does not capture semantic similarity, but since I am already blocking obvious non-matches, it should work well enough for this exercise.

In [22]:
def score_against_column(target, df, scorer=fuzz.token_set_ratio):
    scores = df["normalized"].apply(lambda x: scorer(target, x))
    out = df.copy()
    out["similarity"] = scores
    return out.sort_values("similarity", ascending=False)

*Rapidfuzz* measures distance between two strings from 0-100. In this example, since the strings wording of the supplier and the internal name are the exact same. The score is 100.
In this case, `token_set_ratio` is useful because it doesn't take into account the order of the words nor it penalizes differnces in length due to extra modifiers. 

In [23]:
supplier_room = "Design Loft, 3 Bedrooms, Non Smoking"
target_row = df[(df["original"]==supplier_room)&(df["source"]=="supplier")].iloc[0]
normalized_supplier_room = target_row["normalized"]

target_hotel_code = target_row["lp_id"]

potential_matches = df[(df["lp_id"]==target_hotel_code)&(df["source"]=="internal")]

pre_filtered_matches = match_candidates(target_row, potential_matches)


matches_scored = score_against_column(normalized_supplier_room, pre_filtered_matches)
matches_scored

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms,similarity
40,lp10037d,internal,"Design Loft, 3 Bedrooms, Non Smoking",design loft 3 bedrooms non smoking,"[design, loft, 3, bedrooms, non, smoking]",loft,,,,100.0


### 4.2. Another example
In the second example, we observed that the candidates are scored much lower, since the strings are less similar. 

In [24]:
supplier_room = "Room, 1 King Bed (Hearing Accessible)"
target_row = df[(df["original"]==supplier_room)&(df["source"]=="supplier")].iloc[0]
normalized_supplier_room = target_row["normalized"]

target_hotel_code = target_row["lp_id"]

potential_matches = df[(df["lp_id"]==target_hotel_code)&(df["source"]=="internal")]

pre_filtered_matches = match_candidates(target_row, potential_matches)


matches_scored = score_against_column(normalized_supplier_room, pre_filtered_matches)
matches_scored

Unnamed: 0,lp_id,source,original,normalized,tokens,room_type,bed_type,num_beds,num_bedrooms,similarity
17,lp4cd34,internal,Twin Room with Balcony,twin room with balcony,"[twin, room, balcony]",room,,,,42.857143
18,lp4cd34,internal,Twin Room with Balcony,twin room with balcony,"[twin, room, balcony]",room,,,,42.857143
