# **Section 1: Preprocessing**
+ In this section, we deal with all preprocessing steps required for the rest of this notebook, including importing libraries, installing necessary packages, initializing client module for Google BigQuery etc.
+ Our main tools for this project are `pandas` and `bigquery` from `google.cloud`.
+ Google Cloud's `bigframes` library is uninstalled because of version crashing in Kaggle's default environment.

In [1]:
# Install google-cloud-bigquery-storage for running BigQuery SQL without error
# Add -q to suppress verbose for the sake of readability 
!pip uninstall -q -y bigframes
!pip install -q google-cloud-bigquery-storage

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.6/293.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25h

In [2]:
# Import all libraries required for this project
import pandas as pd

from google.cloud import bigquery
from datetime import datetime, timedelta

## **Define project and dataset ids**
+ To create a database client for BigQuery, a project id is needed: `analog-delight-470708-d0`.
+ We also define dataset and table ids that have been imported from Google Cloud Buckets (GCB) to BigQuery. Please refer to our blog for details of the selected datasets. 

In [8]:
# Initialize BigQuery client with Google Cloud's project id
project_id = "analog-delight-470708-d0"
client = bigquery.Client(project=project_id)

# We also define dataset and table ids
dataset_id = "steam"
game_list_data = "steam_game_list"
review_data = "steam_reviews"

## **Create primary keys for datasets in BigQuery**
+ We convert *App ID* in `steam.steam_game_list` from string to integer as a new column called *app_id*.
+ This step facilitates table joining with steam's review data `steam.steam_reviews` in BigQuery.

In [23]:
# Check whether a column exist in the table schema
def check_column_exists(dataset_id, table_id, name):
    table_ref = client.dataset(dataset_id).table(table_id)
    table_schema = client.get_table(table_ref).schema
    for field in table_schema:
        if field.name == name:
            return True
    return False

In [25]:
# Generate primary index for game_list_data
game_list_data_pk = 'app_id'
exist_app_id = check_column_exists(dataset_id, game_list_data, game_list_data_pk)

if not exist_app_id:
    query = f"""
    alter table {project_id}.{dataset_id}.{game_list_data}
    add column if not exists {game_list_data_pk} integer;
    
    update {project_id}.{dataset_id}.{game_list_data}
    set {game_list_data_pk} = cast(`App ID` as integer)
    where true;
    
    alter table {project_id}.{dataset_id}.{game_list_data}
    add primary key ({game_list_data_pk}) not enforced;
    """
    result_pk = client.query(query)
    print(result_pk.result())

# **Section 2: Generate Embeddings and Create Vector Indices**

In [35]:
embedding_steam = "llm_steam"
def create_embeddings(embeddings_name, column_name):
    query = f"""
    alter table `{project_id}.{dataset_id}.{game_list_data}`
    add column if not exists {embeddings_name} array<float64>;

    update `{project_id}.{dataset_id}.{game_list_data}` as t
    set t.{embeddings_name} = e.ml_generate_embedding_result
    from (
        select distinct
            ml_generate_embedding_result,
            content
        from ml.generate_embedding(
            model `{project_id}.{dataset_id}.{embedding_steam}`,
            (select ifnull({column_name}, ' ') as content
              from `{project_id}.{dataset_id}.{game_list_data}`
            )
        )
    ) e
    where ifnull(t.{column_name}, ' ') = e.content
    """
    return client.query(query)

exist_desc = check_column_exists(dataset_id, game_list_data, "desc_embeddings")
if not exist_desc:
    result_desc = create_embeddings("desc_embeddings", "`Short Description`")
    print(result_desc.result())
    
exist_tags = check_column_exists(dataset_id, game_list_data, "tags_embeddings")
if not exist_tags:
    result_tags = create_embeddings("tags_embeddings", "tags")
    print(result_tags.result())

In [None]:
CREATE TABLE my_dataset.my_table(id INT64, embedding ARRAY <FLOAT64>);

CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
OPTIONS (index_type = 'IVF');

# **Section 3: Usecases of Google BigQuery AI in Product Positioning**

## Usecase 1 - Search a list of similar Steam games given a user query on game characteristics

In [33]:
user_input = "What are first-person horror games without zombies that ?"
number_of_games = 10

query = f"""
SELECT *

FROM VECTOR_SEARCH(

   (SELECT * from `{project_id}.{table_steam_game_list}`

   -- You can pre-filter your query here, eg. for rows of specific users

   -- WHERE some-clause

   ),

   'embeddings',

   (SELECT ml_generate_embedding_result, content AS query

     FROM ML.GENERATE_EMBEDDING(

         MODEL `{project_id}.{embedding_steam}`,

         (SELECT '{user_input}' AS content))

   ),

   top_k => {number_of_games},

   distance_type => 'COSINE')
"""
query2 = f"""
SELECT array_length(ml_generate_embedding_result), content AS query
    FROM ML.GENERATE_EMBEDDING(
         MODEL `{project_id}.{embedding_steam}`,
         (SELECT '{user_input}' AS content)
    )
"""
# df = client.query(query2).to_dataframe()
# print(df)

   f0_                                              query
0  768  What are first-person horror games without zom...
