#### Final Project Option 2:
##### Using text embeddings for matching the airport businesses

#### Part 1: Setup

##### Create the BQ datasets


In [None]:
from google.cloud import bigquery

project_id = "cs378-fa2024"
dataset = "fin_air_travel"
region = "us-central1"

bq_client = bigquery.Client()

dataset_id = bigquery.Dataset(f"{project_id}.{dataset}")
dataset_id.location = region
resp = bq_client.create_dataset(dataset_id, exists_ok=True)
print("Created dataset {}.{}".format(bq_client.project, resp.dataset_id))

Created dataset cs378-fa2024.fin_air_travel


In [None]:
from google.cloud import bigquery

project_id = "cs378-fa2024"
dataset = "ai_models"
region = "us-central1"

bq_client = bigquery.Client()

dataset_id = bigquery.Dataset(f"{project_id}.{dataset}")
dataset_id.location = region
resp = bq_client.create_dataset(dataset_id, exists_ok=True)
print("Created dataset {}.{}".format(bq_client.project, resp.dataset_id))

Created dataset cs378-fa2024.ai_models


##### Register the embeddings model with BQ
##### Before running the next cell, create a remote connection to Vertex AI and then grant the service account associated with the connection the "Vertex AI User" role

In [None]:
%%bigquery
create or replace model ai_models.text_embedding
remote with connection `projects/cs378-fa2024/locations/us-central1/connections/remote-connection`
options (endpoint = 'text-embedding-004');

Query is running:   0%|          |

#### Part 2: Review the input data

##### The objective is to match up the airport_businesses such that any two retail stores in any airport that correspond to the same business or entity get assigned a common key. For example, 'iStore' and 'iStore Boutique' are the same entity, so they would be assigned the same key and similarly for 'CNBC' and 'CNBC News'.

In [3]:
%%bigquery
select business from air_travel_stg.airport_businesses order by business

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,business
0,12th Fairway Bar & Grill
1,13th Street Pub and Grill
2,3 Daughters Brewing at PIE
3,49 Mile Market
4,4th Vine
...,...
1569,iCandy
1570,iStore
1571,iStore
1572,iStore


#### Part 3: Create the embeddings

#### Create the embeddings on the retail store names.
###### More details on the `ml.generate_embedding()`: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding#text-embedding

In [9]:
%%bigquery
create or replace table fin_air_travel.tmp_airport_businesses_embedding as (
select
  null as key,
  business,
  ml_generate_embedding_result as embedding
from
  ml.generate_embedding(
    model ai_models.text_embedding,
    (select business, business as content from air_travel_stg.airport_businesses
    where business is not null),
    struct('CLUSTERING' as task_type)
  )
);

Query is running:   0%|          |

##### Note: the embedding gets created on the `content` field. Having a `content` field is required when calling `ml.generate_embedding()`

In [10]:
%%bigquery
select * from fin_air_travel.tmp_airport_businesses_embedding

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,key,business,embedding
0,,G2,"[-0.039680756628513336, 0.013781944289803505, ..."
1,,AED,"[0.008763805963099003, -0.031321775168180466, ..."
2,,AED,"[0.008763805963099003, -0.031321775168180466, ..."
3,,AED,"[0.008763805963099003, -0.031321775168180466, ..."
4,,AED,"[0.008763805963099003, -0.031321775168180466, ..."
...,...,...,...
1569,,Southwest Airlines Bag Claim Office,"[0.05750828981399536, 0.032303329557180405, 0...."
1570,,Convenience Store Entertainment News,"[-0.025413580238819122, 0.008757878094911575, ..."
1571,,Costa Terraza Restaurant & Tapas Bar,"[-0.011614787392318249, -0.007871636189520359,..."
1572,,American Express - The Centurion Lounge,"[0.028600113466382027, -0.04945085942745209, -..."


#### Part 4: Match similar businesses
###### More details on the `vector_search()` function: https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#**vector_search**

In [37]:
%%bigquery
create or replace table fin_air_travel.tmp_airport_businesses_nearest_neighbor as
select query.business as business, base.business as nearest_neighbor, distance
from
  vector_search(
    table fin_air_travel.tmp_airport_businesses_embedding,
    'embedding',
    table fin_air_travel.tmp_airport_businesses_embedding,
    'embedding',
    top_k => 2,
    distance_type => 'COSINE')
where query.business != base.business
order by distance

Query is running:   0%|          |

In [38]:
%%bigquery
select * from fin_air_travel.tmp_airport_businesses_nearest_neighbor
order by distance

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,business,nearest_neighbor,distance
0,Steak 'n' Shake,Steak 'n Shake,0.002979
1,Steak 'n Shake,Steak 'n' Shake,0.002979
2,Chili's Grill and Bar,Chili's Grill & Bar,0.003711
3,CNBC SmartShop,CNBC Smartshop,0.004278
4,CNBC Smartshop,CNBC SmartShop,0.004278
...,...,...,...
877,Amazon One,United,0.563424
878,COVID-19 Testing,Curbside Check-In,0.567132
879,The Raider Image,SRAA Board Room,0.569848
880,Engine Co No. 28,Ford's Filling Station,0.572618


In [26]:
%%bigquery
select * from fin_air_travel.tmp_airport_businesses_nearest_neighbor
where business != nearest_neighbor
and business like 'CNBC%'
order by distance

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,business,nearest_neighbor,distance
0,CNBC SmartShop,CNBC Smartshop,0.004278
1,CNBC Smartshop,CNBC SmartShop,0.004278
2,CNBC Smart Shop,CNBC Smartshop,0.02026
3,CNBC Newstand,CNN Newstand,0.1386
4,CNBC Express,CNBC,0.228352
5,CNBC Kiosk,CNBC Newstand,0.239353
6,CNBC News Quad Cities,CNBC News,0.2874


#### Part 5: Assign a common key to each cluster of nearest neighbors

##### For example, map 'CNBC SmartShop', 'CNBC Smartshop', and 'CNBC Smart Shop' to the same key

##### But first, look at the data to see what is the maximum distance we should accept to consider two business names as the same entity

In [29]:
%%bigquery
select * from fin_air_travel.tmp_airport_businesses_nearest_neighbor
where distance between 0.1 and 0.2
order by distance desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,business,nearest_neighbor,distance
0,SRAA Admin Office,SRAA Board Room,0.199755
1,SRAA Board Room,SRAA Admin Office,0.199755
2,SOUTHWEST AIRLINES TIKET,Southwest Airlines Ticketing,0.199230
3,Sky Lounge Steakhouse and Rawbar,SkyDine Lounge,0.199192
4,Event Room,Meeting Room,0.198862
...,...,...,...
119,Espresso Bar,Coffee Bar,0.102414
120,Nursing Mother Room,Nursing Room,0.102148
121,DFS Duty Free Fashion & Watches,DFS Duty Free,0.101770
122,Illy,Illy Coffee,0.101626


##### It appears that 0.18 would make a decent cutoff, so we'll consider all names whose distance is < 0.18 to its nearest neighbor to be same business

In [23]:
import pandas
import pandas_gbq
from google.cloud import bigquery

project_id = "cs378-fa2024"
region = "us-central1"
output_table = "fin_air_travel.tmp_airport_businesses_key"

base_query = """select business, nearest_neighbor
from fin_air_travel.tmp_airport_businesses_nearest_neighbor
where distance < 0.18 order by business
"""

def find_business_key(records, name):
    for tup in records:
        if name in tup[0]:
            return tup[1]

bq_client = bigquery.Client()
rows = bq_client.query_and_wait(base_query)
key = 0
records = []
unique_businesses = set()

for row in rows:
    business = row["business"]
    nearest_neighbor = row["nearest_neighbor"]

    if business in unique_businesses:
        # business has been assigned a key
        # print(f"{business} is in unique_businesses")

        if nearest_neighbor not in unique_businesses:
            # print(f"{nearest_neighbor} is not in unique_businesses")

            # look up the key and assign it to its nearest neighbor
            key = find_business_key(records, business)
            unique_businesses.add(nearest_neighbor)
            records.append((nearest_neighbor, key))

            print(f"assigned nearest neighbor {nearest_neighbor} same key as business {business}")

    elif nearest_neighbor in unique_businesses:
        # nearest_neighbor has been assigned a key
        # print(f"{nearest_neighbor} is in unique_businesses")

        if business not in unique_businesses:
            #print(f"{business} not in unique_businesses")

            # look up the key and assign it to its nearest neighbor
            key = find_business_key(records, nearest_neighbor)
            unique_businesses.add(business)
            records.append((business, key))

            print(f"assigned business {business} same key as nearest neighbor {nearest_neighbor}")
    else:
        # both the business and its nearest neighbor have not been seen before
        key += 1
        unique_businesses.add(business)
        unique_businesses.add(nearest_neighbor)
        records.append((business, key))
        records.append((nearest_neighbor, key))


df = pandas.DataFrame.from_records(records, columns=['business_name', 'business_key'])
print(df)

pandas_gbq.to_gbq(df, output_table, project_id=project_id, if_exists="replace")

assigned business Administration Offices same key as nearest neighbor Administrative Offices
assigned business American Air Lines same key as nearest neighbor American Airline
assigned nearest neighbor American Airlines same key as business American Airline
assigned business American Airlines Customer Service same key as nearest neighbor American Airlines
assigned business American Airlines Inc same key as nearest neighbor American Airlines
assigned business American Airlines Ticketing same key as nearest neighbor Alaska Airlines Ticketing
assigned business Baggage Service same key as nearest neighbor Baggage Service Office
assigned business Baggage Service Offices same key as nearest neighbor Baggage Service Office
assigned business Breeze Airlines same key as nearest neighbor Breeze Airways
assigned business Bussines Center same key as nearest neighbor Business Center
assigned business C Security Check Point same key as nearest neighbor B Security Check Point
assigned business CNBC S

100%|██████████| 1/1 [00:00<00:00, 7269.16it/s]


In [24]:
%%bigquery
select * from fin_air_travel.tmp_airport_businesses_key order by business_key

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,business_name,business_key
0,AA Admirals Club,1
1,American Airlines Admirals Club,1
2,American Air Lines,1
3,American Airlines,1
4,American Airlines Customer Service,1
...,...,...
212,Sun Country Airlines,57
213,Sweet Jill Bakery,58
214,Sweet Jill's Bakery,58
215,Tech On The Go,59


##### Prepare the final output table
##### Note that this table would not be the final table in the intermediate layer. It is still just a temp table because there is more work that would happen after this stage to decompose it into a `Business` table (to represent the set of unique businesses) and an `Airport_Businesses` table to represent the many-to-many relationship between `Airport` and `Business`.

In [25]:
%%bigquery
create or replace table fin_air_travel.tmp_airport_businesses_merged as
   select airport_code, terminal, business as business_name, null as business_key,
      category, location, menu_items, _data_source, _load_time
   from air_travel_stg.airport_businesses

Query is running:   0%|          |

##### Merge `tmp_airport_businesses_key` into `tmp_airport_businesses_merged`

In [26]:
%%bigquery
merge fin_air_travel.tmp_airport_businesses_merged m
using fin_air_travel.tmp_airport_businesses_key k
on m.business_name = k.business_name
when matched and m.business_key is null then
  update set business_key = k.business_key

Query is running:   0%|          |

#### Part 6: Evaluation

In [27]:
%%bigquery
select business_key, count(*) as cluster_size
from fin_air_travel.tmp_airport_businesses_merged
where business_key is not null
group by business_key
order by cluster_size desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,business_key,cluster_size
0,1,30
1,11,15
2,16,14
3,17,14
4,18,14
5,52,12
6,15,10
7,29,10
8,32,10
9,6,10


In [28]:
%%bigquery
select *
from fin_air_travel.tmp_airport_businesses_merged
where business_key = 1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,airport_code,terminal,business_name,business_key,category,location,menu_items,_data_source,_load_time
0,lax,4,AA Admirals Club,1,Lounge,Gate 49A,,airportguide,2024-08-25 15:16:58.467182+00:00
1,lax,5,American Airlines Admirals Club,1,Lounge,Near Gate 50B,,airportguide,2024-08-25 15:16:58.467182+00:00
2,sfo,1,American Airlines Admirals Club,1,Lounge,Near Gates B2 - B4,,airportguide,2024-08-25 15:16:58.467182+00:00
3,oaj,1,American Airlines,1,Airlines,Gate 3,,airportguide,2024-08-25 15:16:58.467182+00:00
4,pwm,1,American Airlines,1,Airlines,Gate 4,,airportguide,2024-08-25 15:16:58.467182+00:00
5,ege,1,American Airlines,1,Airlines,Gate 4,,airportguide,2024-08-25 15:16:58.467182+00:00
6,vps,1,American Airlines,1,Airlines,Gate 4,,airportguide,2024-08-25 15:16:58.467182+00:00
7,avl,1,American Air Lines,1,Airlines,Gate 5,,airportguide,2024-08-25 15:16:58.467182+00:00
8,hdn,1,American Airlines,1,Airlines,Gate 6,,airportguide,2024-08-25 15:16:58.467182+00:00
9,cwa,1,American Airlines,1,Airlines,Gate 7,,airportguide,2024-08-25 15:16:58.467182+00:00


#### Part 7: Performance comparison

##### With the prompting approach, how many clusters of businesses did we find and how many standalone businesses were left over?

##### Note that it's not an apples-to-apples comparison because the baseline prompting approach did not seek to produce common business keys, it instead transformed the names of the businesses to make them more standard. We will use these generated names as a proxy for the common keys (i.e. when we find two or more results that share the same business names, we will treat them as a cluster).

In [47]:
%%bigquery
with clustered_businesses as
    (select sum(count) as prompting_sum_clusters from (
        select business, count(*) as count
        from air_travel_int.Airport_Businesses
        group by business
        having count(*) > 1)),

standalone_businesses as
    (select sum(count) as prompting_sum_standalone from (
        select business, count(*) as count
        from air_travel_int.Airport_Businesses
        group by business
        having count(*) = 1))

select prompting_sum_clusters, prompting_sum_standalone
from clustered_businesses cross join standalone_businesses

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,prompting_sum_clusters,prompting_sum_standalone
0,349,685


In [48]:
%%bigquery
with clustered_businesses as
    (select count(*) as embeddings_sum_clusters
     from fin_air_travel.tmp_airport_businesses_merged
     where business_key is not null),

standalone_businesses as
    (select count(*) as embeddings_sum_standalone
     from fin_air_travel.tmp_airport_businesses_merged
     where business_key is null)

select embeddings_sum_clusters, embeddings_sum_standalone
from clustered_businesses cross join standalone_businesses

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,embeddings_sum_clusters,embeddings_sum_standalone
0,366,1208


##### **Conclusion**: Although we ended up with 17 additional clusters under the embeddings approach, there is a long tail of businesses which were not matched with any others (1208). In contast, the prompting approach produced almost twice as few unmatches (685 versus 1208). That said, there are some significant efficiency gains for the embeddings approach which took a small fraction of the time to run as compared to the prompting one.

##### In the future, it would be interesting to evaluate whether combining the two methods can yield greater accuracy and scalability as compared to each method by on its own. This would mean running the embeddings strategy and then using prompting to cluster the left over records. For this, we may benefit from multiple different prompts that are more targeted to specific scenarios.  