# **Section 1: Preprocessing**
+ In this section, we deal with all preprocessing steps required for the rest of this notebook, including importing libraries, installing necessary packages, initializing client module for Google BigQuery etc.
+ Our main tools for this project are `pandas` and `bigquery` from `google.cloud`.
+ Google Cloud's `bigframes` library is uninstalled because of version crashing in Kaggle's default environment.

In [2]:
# Install google-cloud-bigquery-storage for running BigQuery SQL without error
# Add -q to suppress verbose for the sake of readability 
!pip uninstall -q -y bigframes
!pip install -q google-cloud-bigquery-storage

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.6/293.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
# Import all libraries required for this project
import pandas as pd

from google.cloud import bigquery
from datetime import datetime, timedelta

## **Define project and dataset ids**
+ To create a database client for BigQuery, a project id is needed: `analog-delight-470708-d0`.
+ We also define dataset and table ids that have been imported from Google Cloud Buckets (GCB) to BigQuery. Please refer to our blog for details of the selected datasets. 

In [4]:
# Initialize BigQuery client with Google Cloud's project id
project_id = "analog-delight-470708-d0"
client = bigquery.Client(project=project_id)

# We also define dataset and table ids
dataset_id = "steam"
game_list_data = "steam_game_list"
review_data = "steam_reviews"

# We also define the name of text embedding models
embedding_model_name = "llm_steam"

## **Create primary keys for datasets in BigQuery**
+ We convert *App ID* in `steam.steam_game_list` from string to integer as a new column called *app_id*.
+ This step facilitates table joining with steam's review data `steam.steam_reviews` in BigQuery.

In [23]:
# Check whether a column exist in the table schema
def check_column_exists(dataset_id, table_id, name):
    # Given dataset_id and table_id, we retrieve its schema
    table_ref = client.dataset(dataset_id).table(table_id)
    table_schema = client.get_table(table_ref).schema

    # Loop through each field in the schema to determine whether a column exists.
    for field in table_schema:
        if field.name == name:
            return True
    return False

In [25]:
# Generate primary index for game_list_data
game_list_data_pk = 'app_id'
exist_app_id = check_column_exists(dataset_id, game_list_data, game_list_data_pk)

if not exist_app_id:
    query = f"""
    alter table {project_id}.{dataset_id}.{game_list_data}
    add column if not exists {game_list_data_pk} integer;
    
    update {project_id}.{dataset_id}.{game_list_data}
    set {game_list_data_pk} = cast(`App ID` as integer)
    where true;
    
    alter table {project_id}.{dataset_id}.{game_list_data}
    add primary key ({game_list_data_pk}) not enforced;
    """
    result_pk = client.query(query)
    print(result_pk.result())

# **Section 2: Generate Embeddings and Create Vector Indices**

In [35]:
# This function 
def create_embeddings(embeddings_name, embeddings_model_name, table_name, column_name):
    query = f"""
    alter table `{project_id}.{dataset_id}.{table_name}`
    add column if not exists {embeddings_name} array<float64>;

    update `{project_id}.{dataset_id}.{table_name}` as t
    set t.{embeddings_name} = e.ml_generate_embedding_result
    from (
        select distinct
            ml_generate_embedding_result,
            content
        from ml.generate_embedding(
            model `{project_id}.{dataset_id}.{embeddings_model_name}`,
            (select ifnull({column_name}, ' ') as content
              from `{project_id}.{dataset_id}.{table_name}`
            )
        )
    ) e
    where ifnull(t.{column_name}, ' ') = e.content
    """
    return client.query(query)

# Create text embeddings for 'short description' of each game available on steam
exist_desc = check_column_exists(dataset_id, game_list_data, "desc_embeddings")
if not exist_desc:
    result_desc = create_embeddings("desc_embeddings", embedding_model_name, game_list_data, "`Short Description`")
    print(result_desc.result())

# Create text embeddings for 'tags' fof each game available on steam 
exist_tags = check_column_exists(dataset_id, game_list_data, "tags_embeddings")
if not exist_tags:
    result_tags = create_embeddings("tags_embeddings", embedding_model_name, game_list_data, "tags")
    print(result_tags.result())

# **Section 3: Usecases of Google BigQuery AI in Product Positioning**

## Usecase 3.1 - Search a list of similar Steam games given a user query on game characteristics

In [33]:
user_input = "I would like to find a multi-person strategic game on farming in an open-world setting."
number_of_games = 100
min_reviews = 1000
embeddings = ["desc_embeddings", "tags_embeddings"]

query = f"""
select a.base.*
from vector_search(
    (select {embeddings[0]}, name, app_id, `short description`, tags, `positive reviews`, `negative reviews` 
    from `{project_id}.{dataset_id}.{game_list_data}`
    where (`positive reviews` > {min_reviews}) or (`negative reviews` > {min_reviews})),
    '{embeddings[0]}',
    (select ml_generate_embedding_result, content as query 
    from ml.generate_embedding(
    model `{project_id}.{dataset_id}.{embedding_model_name}`,
        (select '{user_input}' as content))
    ),
    top_k => {number_of_games},
    distance_type => 'COSINE') as a
inner join 
vector_search(
    (select {embeddings[1]}, name, app_id, `short description`, tags, `positive reviews`, `negative reviews` 
    from `{project_id}.{dataset_id}.{game_list_data}`
    where (`positive reviews` > {min_reviews}) or (`negative reviews` > {min_reviews})),
    '{embeddings[1]}',
    (select ml_generate_embedding_result, content as query 
    from ml.generate_embedding(
    model `{project_id}.{dataset_id}.{embedding_model_name}`,
        (select '{user_input}' as content))
    ),
    top_k => {number_of_games},
    distance_type => 'COSINE') as b
    on a.base.app_id = b.base.app_id
"""
df = client.query(query).to_dataframe()

In [40]:
# Result
df['odd'] = df['positive reviews']/df['negative reviews']
df_sort = df.iloc[:, 1:].sort_values('odd', ascending=False)
df_sort

Unnamed: 0,name,app_id,short description,tags,positive reviews,negative reviews,odd
1,Farm Together,673950,"Grow your own farm all by yourself, or coopera...","Agriculture: 362, Multiplayer: 338, Simulation...",17160,1070,16.037383
8,Farming Simulator 19,787860,The best-selling franchise takes a giant leap ...,"Simulation: 834, Farming Sim: 738, Multiplayer...",60886,3847,15.826878
0,Farming Simulator 2013 Titanium Edition,220260,"Animal husbandry, crops, sales… It's up to you...","Simulation: 513, Farming Sim: 509, Multiplayer...",4529,396,11.436869
11,Garden Paws,840010,You have inherited your grandparents farm as t...,"Exploration: 184, Sandbox: 178, Agriculture: 1...",1970,184,10.706522
4,Sun Haven,1432860,Build your farm and relationships with townsfo...,"Early Access: 362, Farming Sim: 329, Pixel Gra...",2672,251,10.645418
3,Farming Simulator 22,1248130,Create your farm and let the good times grow! ...,"Simulation: 644, Co-op: 628, Farming Sim: 625,...",33215,3797,8.747696
2,Staxel,405710,"Grow your farm, meet the villagers, and join y...","Farming Sim: 264, Cute: 248, Character Customi...",3451,860,4.012791
9,Farm Manager 2021,1123830,Get ready for a logistic challenge in the new ...,"Simulation: 79, Strategy: 66, Indie: 55, Agric...",1257,335,3.752239
6,Farmer's Dynasty,678900,Live – Build – Farm: Enjoy a unique mix of far...,"Simulation: 200, Farming Sim: 194, Life Sim: 1...",2424,796,3.045226
5,Pure Farming 2018,534370,Use the latest technology and state-of-the-art...,"Farming Sim: 164, Simulation: 155, Open World:...",1007,382,2.636126


In [65]:
# Get the best and worst games
app_id_max = df[df['odd'] == df['odd'].max()].app_id.values[0]
app_id_min = df[df['odd'] == df['odd'].min()].app_id.values[0]

'Grow your own farm all by yourself, or cooperate with your friends in this unique, relaxing farming experience!'

## Usecase 3.2 - Search relevant reviews for the list of games given a user query 

## Usecase 3.3 - Search 