# Approach 2: The Semantic Detective 🕵️‍♀️

# **Section 1: Preprocessing**
+ In this section, we deal with all preprocessing steps required for the rest of this notebook, including importing libraries, installing necessary packages, initializing client module for Google BigQuery etc.
+ Our main tools for this project are `pandas` and `bigquery` from `google.cloud`.
+ Google Cloud's `bigframes` library is uninstalled because of version crashing in Kaggle's default environment.

In [1]:
# Install google-cloud-bigquery-storage for running BigQuery SQL without error
# Add -q to suppress verbose for the sake of readability 
!pip uninstall -q -y bigframes
!pip install -q google-cloud-bigquery-storage

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [4]:
# Import all libraries required for this project
import pandas as pd

from google.cloud import bigquery
from datetime import datetime, timedelta

## **Define project and dataset ids**
+ To create a database client for BigQuery, a project id is needed: `analog-delight-470708-d0`.
+ We also define dataset and table ids that have been imported from Google Cloud Buckets (GCB) to BigQuery. Please refer to our blog for details of the selected datasets.
    + `steam_game_list` contains the inventory of games available on Steam platform. There are some textual features such as `short description` and `tags` assigned to each game.
    + `review_data` contains players' reviews for some games on Steam.
+ The embedding model `llm_steam` is based on `text-embedding-004`. No further fine-tuning has been performed.

In [6]:
# Initialize BigQuery client with Google Cloud's project id
project_id = 'analog-delight-470708-d0'
client = bigquery.Client(project=project_id)

# We also define dataset and table ids
dataset_id = 'steam'
game_list_data = 'steam_game_list'
review_data = 'steam_reviews'

# We also define the name of text embedding models
embedding_model_name = 'llm_steam'

## **Create primary keys for datasets in BigQuery**


+ This helper function takes table and column names as arguments to check for the existence of a column in the schema of a table.
+ It is required to avoid repeating costly operations on BigQuery.

In [16]:
# Check whether a column exist in the table schema
def check_column_exists(dataset_id, table_id, name):
    # Given dataset_id and table_id, we retrieve its schema
    table_ref = client.dataset(dataset_id).table(table_id)
    table_schema = client.get_table(table_ref).schema

    # Loop through each field in the schema to determine whether a column exists.
    for field in table_schema:
        if field.name == name:
            return True
    return False

+ We convert *App ID* in `steam.steam_game_list` from string to integer as a new column called *app_id*.
+ This step facilitates table joining with steam's review data `steam.steam_reviews` in BigQuery.

In [4]:
# Check whether the primary key for game_list_data exists
game_list_data_pk = 'app_id'
exist_app_id = check_column_exists(dataset_id, game_list_data, game_list_data_pk)
print('Does the primary key exist? ' + str(exist_app_id))

# If it does not exist, generate it
if not exist_app_id:
    query = f"""
    alter table {project_id}.{dataset_id}.{game_list_data}
    add column if not exists {game_list_data_pk} integer;
    
    update {project_id}.{dataset_id}.{game_list_data}
    set {game_list_data_pk} = cast(`App ID` as integer)
    where true;
    
    alter table {project_id}.{dataset_id}.{game_list_data}
    add primary key ({game_list_data_pk}) not enforced;
    """
    result_pk = client.query(query)
    print(result_pk.result())

Does the primary key exist? True


# **Section 2: Generate Embeddings and Create Vector Indices**

## Generate embeddings for vector search
+ Two embedding vectors are created for the table `steam.steam_game_list` to improve search accuracy. They are based on two textual columns: `short description` and `tags`.
    + `short description` contains the description of the game e.g. "Plunger Knight is a platformer game about a lonely knight who lived in the Middle Age and has lost almost everything he had but had to face a new challenge: robots. In the series of his adventures, he travels the world and time to battle against them."
    + `tags` contains the tags assigned to each game on Steam e.g. `Action: 22, Adventure: 20, Casual: 20, Indie: 20, RPG: 19, Platformer: 12, Funny: 11, Puzzle-Platformer: 10`.

In [18]:
# This function generate embeddings using AI model if not exists
def create_embeddings(embeddings_name, embeddings_model_name, table_name, column_name):
    query = f"""
    alter table `{project_id}.{dataset_id}.{table_name}`
    add column if not exists {embeddings_name} array<float64>;

    update `{project_id}.{dataset_id}.{table_name}` as t
    set t.{embeddings_name} = e.ml_generate_embedding_result
    from (
        select distinct
            ml_generate_embedding_result,
            content
        from ml.generate_embedding(
            model `{project_id}.{dataset_id}.{embeddings_model_name}`,
            (select ifnull({column_name}, ' ') as content
              from `{project_id}.{dataset_id}.{table_name}`
            )
        )
    ) e
    where ifnull(t.{column_name}, ' ') = e.content
    """
    return client.query(query)

For time,

In [None]:
# Create text embeddings for 'short description' of each game available on steam
exist_desc = check_column_exists(dataset_id, game_list_data, "desc_embeddings")
print('Does the embeddings for short description of game exist? ' + str(exist_desc))
if not exist_desc:
    result_desc = create_embeddings("desc_embeddings", embedding_model_name, game_list_data, "`Short Description`")
    print(result_desc.result())

# Create text embeddings for 'tags' fof each game available on steam 
exist_tags = check_column_exists(dataset_id, game_list_data, "tags_embeddings")
print('Does the embeddings for tags of game exist? ' + str(exist_tags))
if not exist_tags:
    result_tags = create_embeddings("tags_embeddings", embedding_model_name, game_list_data, "tags")
    print(result_tags.result())

Does the embeddings for short description of game exist? True
Does the embeddings for tags of game exist? True


In [None]:
def create_table_reviews_embeddings(out_table_name, in_table_name):
    query = f"""
    create or replace table `{project_id}.{dataset_id}.{out_table_name}` as
    select *
    from ml.generate_embeddings(
    MODEL `{project_id}.{dataset_id}.llm_steam`,
    (SELECT Review as content, app_id FROM `{project_id}.{dataset_id}.{in_table_name}`),
    struct(
        true AS flatten_json_output, 
        'RETRIEVAL_DOCUMENT' as task_type
        )
    );
    """
    

# **Section 3: Usecases of Google BigQuery AI in Product Positioning**

## Usecase 3.1 - Search a list of similar Steam games given a user query on game characteristics

The function `get_list_of_games` wraps the SQL query for retrieving games which are close to the user query in semantic. 
For example, a user query can be "`I would like to find a multi-person strategic game on farming in an open-world setting.`". 
To enhance accuracy, this SQL can select games with both description and tags close to the user query via an inner join on `app_id`.

*Parameters to tune:*
+ `number_of_games`: This parameter controls `top_k` in vector search on both embeddings. For the result to be accurate, the same semantic meaning should appear on both description and tags fields.
+ `min_reviews`: This parameter controls the popularity of games by the number of reviews they received, regardless of positive or negative.

*Expected results:*
A dataframe should be returned from `get_list_of_games` containing all columns in `steam.steam_game_list` for which the semantic meaning is close to the user query.

In [6]:
def get_list_of_games(user_input, number_of_games, min_reviews):
    embeddings = ["desc_embeddings", "tags_embeddings"]
    query = f"""
    select a.base.*
    from vector_search(
        (select {embeddings[0]}, name, app_id, `short description`, tags, `positive reviews`, `negative reviews` 
        from `{project_id}.{dataset_id}.{game_list_data}`
        where (`positive reviews` > {min_reviews}) or (`negative reviews` > {min_reviews})),
        '{embeddings[0]}',
        (select ml_generate_embedding_result, content as query 
        from ml.generate_embedding(
        model `{project_id}.{dataset_id}.{embedding_model_name}`,
            (select '{user_input}' as content))
        ),
        top_k => {number_of_games},
        distance_type => 'COSINE') as a
    inner join 
    vector_search(
        (select {embeddings[1]}, name, app_id, `short description`, tags, `positive reviews`, `negative reviews` 
        from `{project_id}.{dataset_id}.{game_list_data}`
        where (`positive reviews` > {min_reviews}) or (`negative reviews` > {min_reviews})),
        '{embeddings[1]}',
        (select ml_generate_embedding_result, content as query 
        from ml.generate_embedding(
        model `{project_id}.{dataset_id}.{embedding_model_name}`,
            (select '{user_input}' as content))
        ),
        top_k => {number_of_games},
        distance_type => 'COSINE') as b
        on a.base.app_id = b.base.app_id
    """
    df = client.query(query).to_dataframe()
    return df

In [15]:
def store_selected_games(app_ids: list):
    table_name = f'{dataset_id}.temp'
    query = f"""
        create or replace table {table_name} as 
        select * 
        from `{project_id}.{dataset_id}.{review_data}`
        where app_id in ({','.join(app_ids)})
    """
    df = client.query(query).to_dataframe()
    return df

In [None]:
number_of_games = 100
min_reviews = 1000

user_input = "I would like to find a multi-person strategic game on farming in an open-world setting."
df_retrieve = get_list_of_games(user_input)
df_store = store_selected_games(df_retrieve['app_id'].values.tolist)

In [None]:
# Result
df['odd'] = df['positive reviews']/df['negative reviews']
df_sort = df.iloc[:, 1:].sort_values('odd', ascending=False)
df_sort

## Usecase 3.2 - Search relevant reviews for the list of games given a query on product features

In [27]:
exist_reviews = check_column_exists(dataset_id, "temp", "reviews_embeddings")
print('Does the embeddings for reviews of game exist? ' + str(exist_reviews))
if not exist_reviews:
    result_reviews = create_embeddings("reviews_embeddings", embedding_model_name, "temp", "review")
    print(result_reviews.result())

Does the embeddings for reviews of game exist? False
<google.cloud.bigquery.table._EmptyRowIterator object at 0x7e4e7d2bb0d0>


In [13]:
number_of_reviews = 20
embeddings = ["reviews_embeddings"]
user_input = 'Is this game easy to play for elderly?'

query = f"""
select a.base.*
from vector_search(
    (select {embeddings[0]}, review, app_name, app_id
    from `{project_id}.{dataset_id}.temp`),
    '{embeddings[0]}',
    (select ml_generate_embedding_result, content as query 
    from ml.generate_embedding(
    model `{project_id}.{dataset_id}.{embedding_model_name}`,
        (select '{user_input}' as content))
    ),
    top_k => {number_of_reviews},
    distance_type => 'COSINE') as a
"""
df = client.query(query).to_dataframe()



In [14]:
# Strength and weakness analysis
for i in df['review']:
    print(i)

It is a fun game but it may be difficult for beginners.
I love this game. Totally worth the money and its fun. for an old lady. :0)
it's a fun game and easy to play
Great game really easy to play
Ein richtig schönes und entspanntes Spiel für zwischendurch für alle Altersklassen geeignet.
It is a nice and relaxing game. It may appear hard at first but you will get the hang of it pretty quickly.
super toll auch für etwas ältere spieler (45 jahre )
cute little game for young and old

I am not a very good player , so I am looking for games that I can have fun, and this game is one of them!!
Easy to play or Hard if you want
This is a fun game that anyone could like.
It's a fun game to play. It's relaxing and is easy to pick up.
Amazing game. Easy to play
This game has a steep learning curve and is probably not a good choice for younger gamers but for those willing to endure the first few hours it is great fun.
Easy to play, lots of fun
Cute little game, easy to play even with limited mobili