# Amazon Product Reviews Semantic EDA

CS329E Data Project
Vincent Nguyen Vnn332

## Descriptive Data

This data was taken from multiple sources from research labs to kaggle. I then cleaned and transformed the data toegther to logical entities and stored in Google's Big Query. Throughout this notebook you will see SQL to get the data needed to run the ML models.

Please reference the repository to find the ERD and the data being pulled.

The Product_Reviews, Users, User_Info, Membership_Info, Product_Image_data are the tables I will be using the most. There are 3777 Users with more than half being Prime Users. Each user has around 1-5 reviews.

## Task/Objective

I will use Machine learning techniques to answer business case questions.

Hypothesis:
1) How do Prime memberships affect review sentiment?
2) How does User demographics affect review sentiment?


Use Case:
    Stakeholder: Customer Retention Team
    Why it matters: If Amazon can identify behavior or review patterns linked to longer memberships, it can better predict churn, tailor loyalty strategies, or offer targeted perks to high-retention users.


In [None]:
%load_ext google.cloud.bigquery

project_id = "amazon-product-reviews-452322"
dataset = "product_data_int"
region = "us-central1"
connection_id = "vertex-ai-connection"
embedding_model_endpoint = "text-embedding-005"

from google.cloud import bigquery

bq_client = bigquery.Client(project=project_id)

dataset_ref = bigquery.Dataset(f"{project_id}.{dataset}")
dataset_ref.location = region

try:
    created_dataset = bq_client.create_dataset(dataset_ref, exists_ok=True)
    print(f"Dataset {created_dataset.project}.{created_dataset.dataset_id} ensured to exist in {created_dataset.location}")
except Exception as e:
    print(f"Error ensuring dataset exists: {e}")

!bq show --connection --project_id=amazon-product-reviews-452322 --location=us-central1 vertex-ai-connection

!bq mk --connection --location=us-central1 --project_id=amazon-product-reviews-452322 \
    --connection_type=CLOUD_RESOURCE vertex-ai-connection

!gcloud projects add-iam-policy-binding amazon-product-reviews-452322 \
  --member='serviceAccount:bqcx-668151151224-smf0@gcp-sa-bigquery-condel.iam.gserviceaccount.com' \
  --role='roles/aiplatform.user' \
  --no-user-output-enabled




Dataset amazon-product-reviews-452322.product_data_int ensured to exist in us-central1
Connection amazon-product-reviews-452322.us-central1.vertex-ai-connection

                      name                        friendlyName   description    Last modified         type        hasCredential                                            properties                                            
 ----------------------------------------------- -------------- ------------- ----------------- ---------------- --------------- ----------------------------------------------------------------------------------------------- 
  668151151224.us-central1.vertex-ai-connection                                21 Apr 22:01:23   CLOUD_RESOURCE   False           {"serviceAccountId": "bqcx-668151151224-smf0@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}  

BigQuery error in mk operation: Already Exists: Connection projects/668151151224/locations/us-
central1/connections/vertex-ai-connection


## Copy tables from product_data_int to fin_product_data ##

In [None]:
%%bigquery --project amazon-product-reviews-452322

CREATE SCHEMA IF NOT EXISTS `amazon-product-reviews-452322.fin_product_data`
OPTIONS(
  location="us-central1"
);

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.Product_Reviews` AS
SELECT * FROM `amazon-product-reviews-452322.product_data_int.Product_Reviews`;

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.embedding_model_ref`
  REMOTE WITH CONNECTION `projects/amazon-product-reviews-452322/locations/us-central1/connections/vertex-ai-connection`
  OPTIONS (endpoint = 'text-embedding-005');

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.Product_Reviews` AS
SELECT * FROM `amazon-product-reviews-452322.product_data_int.Product_Reviews`;

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.User_Reviews` AS
SELECT * FROM `amazon-product-reviews-452322.product_data_int.User_Reviews`;

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.User` AS
SELECT * FROM `amazon-product-reviews-452322.product_data_int.User`;

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS
SELECT * FROM `amazon-product-reviews-452322.product_data_int.Membership_Info`;

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.User_Info` AS
SELECT * FROM `amazon-product-reviews-452322.product_data_int.User_Info`;

Query is running:   0%|          |

### Generate and Store Embeddings in the 'final' dataset ###
Reads from Product_Reviews in the *final* dataset.

Writes results to review_embeddings in the *final* dataset.

In [None]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.review_embeddings` AS
SELECT
  review_id,
  ml_generate_embedding_result AS embedding
FROM
  ML.GENERATE_EMBEDDING(
    MODEL `amazon-product-reviews-452322.fin_product_data.embedding_model_ref`,
      SELECT
        review_id,
        text AS content
      FROM
        `amazon-product-reviews-452322.fin_product_data.Product_Reviews`
      WHERE
        text IS NOT NULL AND text != ''
    ),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
  );

ALTER TABLE `amazon-product-reviews-452322.fin_product_data.review_embeddings`
ADD PRIMARY KEY (review_id) NOT ENFORCED;



Query is running:   0%|          |

## Create the NL Connection ##

In [None]:
!bq mk --connection --location=US --project_id=amazon-product-reviews-452322 \
--connection_type=CLOUD_RESOURCE cloud-nl-connection


!bq show --connection --project_id=amazon-product-reviews-452322 --location=US cloud-nl-connection

!gcloud projects add-iam-policy-binding amazon-product-reviews-452322 \
--member='serviceAccount:bqcx-668151151224-qb9f@gcp-sa-bigquery-condel.iam.gserviceaccount.com' \
--role='roles/serviceusage.serviceUsageConsumer' \
--no-user-output-enabled


BigQuery error in mk operation: Already Exists: Connection
projects/668151151224/locations/us/connections/cloud-nl-connection
Connection amazon-product-reviews-452322.US.cloud-nl-connection

                 name                   friendlyName   description    Last modified         type        hasCredential                                            properties                                            
 ------------------------------------- -------------- ------------- ----------------- ---------------- --------------- ----------------------------------------------------------------------------------------------- 
  668151151224.us.cloud-nl-connection                                23 Apr 23:53:47   CLOUD_RESOURCE   False           {"serviceAccountId": "bqcx-668151151224-qb9f@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}  



## Create BQML Model Reference for Sentiment Analysis (Cloud NL API) ##


In [None]:
!pip install google-cloud-bigquery pandas pandas-gbq db-dtypes tqdm transformers torch sentencepiece -q

In [None]:
import os
import time
from google.cloud import bigquery
import pandas as pd
import pandas_gbq
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm.notebook import tqdm
import math

try:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    print("Transformers and PyTorch loaded.")
except ImportError:
    print("Please install transformers and torch: pip install transformers torch sentencepiece")
    raise

PROJECT_ID = "amazon-product-reviews-452322"
BQ_LOCATION = "us-central1"
BQ_DATASET_ID = "fin_product_data"
SOURCE_TABLE_ID = f"{PROJECT_ID}.{BQ_DATASET_ID}.Product_Reviews"
DESTINATION_TABLE_ID = f"{BQ_DATASET_ID}.review_sentiments_distilbert"

CHUNK_SIZE_BQ = 5000
HF_BATCH_SIZE = 100
MAX_WORKERS = 8

SENTIMENT_MODEL_NAME = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

# Client init for final
try:
    bq_client = bigquery.Client(project=PROJECT_ID)

    # Force CPU usage
    device = torch.device("cpu")
    print(f"Using device: {device}")

    sentiment_tokenizer = AutoTokenizer.from_pretrained(SENTIMENT_MODEL_NAME)
    sentiment_model = AutoModelForSequenceClassification.from_pretrained(SENTIMENT_MODEL_NAME)
    sentiment_model.to(device)
    sentiment_model.eval()
    print(f"Sentiment model '{SENTIMENT_MODEL_NAME}' loaded onto {device}.")

except Exception as e:
    print(f"Error initializing clients or loading model: {e}")
    raise

def analyze_sentiment_hf_batch(texts, review_ids):
    results = []
    label_map = {0: 'NEGATIVE', 1: 'POSITIVE'}

    valid_texts = []
    valid_review_ids = []
    invalid_results = []

    for i in range(len(texts)):
        text = texts[i]
        review_id = review_ids[i]
        if isinstance(text, str) and text.strip():
            valid_texts.append(text)
            valid_review_ids.append(review_id)
        else:
            invalid_results.append({
                 'review_id': review_id,
                 'hf_sentiment_label': 'INVALID_INPUT',
                 'hf_sentiment_score': None
            })

    if not valid_texts:
        return invalid_results

    # Batch processing with the model distil
    try:
        inputs = sentiment_tokenizer(
            valid_texts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)

        with torch.no_grad():
            outputs = sentiment_model(**inputs)
            logits = outputs.logits
            probabilities = torch.softmax(logits, dim=-1)
            predictions = torch.argmax(probabilities, dim=-1)

        # Process/format result
        for i in range(len(valid_texts)):
            predicted_label_id = predictions[i].item()
            score = probabilities[i][predicted_label_id].item()
            label = label_map.get(predicted_label_id, 'UNKNOWN')

            results.append({
                'review_id': valid_review_ids[i],
                'hf_sentiment_label': label,
                'hf_sentiment_score': score
            })

    except Exception as e:
        print(f"Error during batch sentiment analysis: {e}")
        for review_id in valid_review_ids:
             results.append({
                 'review_id': review_id,
                 'hf_sentiment_label': 'ANALYSIS_FAILED',
                 'hf_sentiment_score': None
             })

    results.extend(invalid_results)
    return results

# Chunk and calculate remaining rows
count_query = f"SELECT COUNT(review_id) as total_rows FROM `{SOURCE_TABLE_ID}` WHERE text IS NOT NULL AND text != ''"
try:
    total_rows = bq_client.query(count_query, location=BQ_LOCATION).to_dataframe()['total_rows'][0]
    print(f"Total valid reviews to process: {total_rows}")
except Exception as e:
    print(f"Error getting row count: {e}")
    raise

num_chunks = math.ceil(total_rows / CHUNK_SIZE_BQ)
print(f"Processing in {num_chunks} chunks of up to {CHUNK_SIZE_BQ} rows each.")

table_schema = [
    {'name': 'review_id', 'type': 'STRING'},
    {'name': 'hf_sentiment_label', 'type': 'STRING'},
    {'name': 'hf_sentiment_score', 'type': 'FLOAT'},
]

processed_count = 0
start_offset = 0

try:
   dest_table_ref = f"{PROJECT_ID}.{DESTINATION_TABLE_ID}"
   table = bq_client.get_table(dest_table_ref)
   start_offset = table.num_rows
   print(f"Resuming: Found {start_offset} rows already processed in destination table.")
   num_chunks = math.ceil((total_rows - start_offset) / CHUNK_SIZE_BQ)
   print(f"Adjusted: Processing remaining {total_rows - start_offset} rows in {num_chunks} chunks.")
except Exception:
   start_offset = 0
   print("Starting from the beginning.")

# Loop through the calculated number of chunks
for i in range(num_chunks):
    current_offset = start_offset + (i * CHUNK_SIZE_BQ)
    print(f"\n--- Processing Chunk {i+1}/{num_chunks} (Offset: {current_offset}, Limit: {CHUNK_SIZE_BQ}) ---")

    query = f"""
        SELECT
            review_id,
            text
        FROM `{SOURCE_TABLE_ID}`
        WHERE text IS NOT NULL AND text != ''
        ORDER BY review_id
        LIMIT {CHUNK_SIZE_BQ} OFFSET {current_offset}
    """
    try:
        print("Reading chunk from BigQuery...")
        chunk_df = bq_client.query(query, location=BQ_LOCATION).to_dataframe()
        print(f"Read {len(chunk_df)} reviews for this chunk.")
        if chunk_df.empty:
            print("No more rows to process in this chunk.")
            continue
    except Exception as e:
        print(f"Error reading chunk {i+1} from BigQuery: {e}")
        continue

    # Process the batches
    all_chunk_results = []
    hf_batches = [chunk_df[j:j + HF_BATCH_SIZE] for j in range(0, chunk_df.shape[0], HF_BATCH_SIZE)]
    print(f"Analyzing chunk in {len(hf_batches)} HF batches of size up to {HF_BATCH_SIZE}...")

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = [executor.submit(analyze_sentiment_hf_batch, batch['text'].tolist(), batch['review_id'].tolist()) for batch in hf_batches]
        for future in tqdm(as_completed(futures), total=len(futures), desc=f"Analyzing Chunk {i+1}"):
            try:
                batch_results = future.result()
                all_chunk_results.extend(batch_results)
            except Exception as e:
                print(f"Error processing an HF batch result within chunk {i+1}: {e}")

    # Turn into df
    chunk_results_df = pd.DataFrame(all_chunk_results)
    chunk_results_df = chunk_results_df[~chunk_results_df['hf_sentiment_label'].isin(['ANALYSIS_FAILED', 'INVALID_INPUT'])]
    chunk_results_df['hf_sentiment_score'] = pd.to_numeric(chunk_results_df['hf_sentiment_score'], errors='coerce')
    chunk_results_df = chunk_results_df.dropna(subset=['hf_sentiment_score'])

    # Load to BQ
    if not chunk_results_df.empty:
        print(f"Loading {len(chunk_results_df)} results for chunk {i+1} to BigQuery table: {PROJECT_ID}.{DESTINATION_TABLE_ID}...")
        try:
            pandas_gbq.to_gbq(
                chunk_results_df,
                destination_table=DESTINATION_TABLE_ID,
                project_id=PROJECT_ID,
                location=BQ_LOCATION,
                if_exists='append',
                table_schema=table_schema
            )
            processed_count += len(chunk_results_df)
            print(f"Successfully loaded chunk {i+1}. Total processed so far: {processed_count}")
        except Exception as e:
            print(f"Error loading chunk {i+1} data to BigQuery: {e}")
    else:
        print(f"No valid results generated for chunk {i+1}.")

    time.sleep(2)

Transformers and PyTorch loaded.
Using Sentiment Model: distilbert/distilbert-base-uncased-finetuned-sst-2-english
BigQuery Client Initialized.
Using device: cpu
Sentiment model 'distilbert/distilbert-base-uncased-finetuned-sst-2-english' loaded onto cpu.
Total valid reviews to process: 382538
Processing in 77 chunks of up to 5000 rows each.
Starting from the beginning.

--- Processing Chunk 1/77 (Offset: 0, Limit: 5000) ---
Reading chunk from BigQuery...
Read 5000 reviews for this chunk.
Analyzing chunk in 50 HF batches of size up to 100...


Analyzing Chunk 1:   0%|          | 0/50 [00:00<?, ?it/s]

Loading 4995 results for chunk 1 to BigQuery table: amazon-product-reviews-452322.fin_product_data.review_sentiments_distilbert...


100%|██████████| 1/1 [00:00<00:00, 5419.00it/s]


Successfully loaded chunk 1. Total processed so far: 4995

--- Processing Chunk 2/77 (Offset: 5000, Limit: 5000) ---
Reading chunk from BigQuery...
Read 5000 reviews for this chunk.
Analyzing chunk in 50 HF batches of size up to 100...


Analyzing Chunk 2:   0%|          | 0/50 [00:00<?, ?it/s]

## Create Final Enriched View in the 'final' dataset ##


Joins the tables *within* the 'fin_product_data' dataset.

In [None]:
%%bigquery --project amazon-product-reviews-452322
CREATE OR REPLACE TABLE `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS
SELECT
  r.user_id,
  r.review_id,
  r.rating,
  r.title,
  r.text,
  r.images,
  r.asin,
  r.parent_asin,
  r.review_date,
  r.helpful_vote,
  r.verified_purchase,
  r.details,
  r.videos,
  r._data_source,
  r._load_time,
  r.row_num,
  e.embedding,
  s.hf_sentiment_label,
  s.hf_sentiment_score
FROM
  `amazon-product-reviews-452322.fin_product_data.Product_Reviews` AS r
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.review_embeddings` AS e ON r.review_id = e.review_id
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.review_sentiments_distilbert` AS s ON r.review_id = s.review_id
WHERE
  r.text IS NOT NULL AND r.text != '';

Query is running:   0%|          |

## Effect of Membership (Prime) on Sentiment Score ##
-- Goal: Quantify the impact of membership on hf_sentiment_score, controlling for embedding.


-- Target: hf_sentiment_score (numerical)




In [None]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_membership_reg`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['hf_sentiment_score']
) AS
SELECT
  r.hf_sentiment_score,
  r.embedding,
  IF(m.user_id IS NOT NULL, 1, 0) AS is_member
FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS m ON r.user_id = CAST(m.user_id AS STRING)
WHERE
  r.hf_sentiment_score IS NOT NULL
  AND r.embedding IS NOT NULL;


Query is running:   0%|          |

In [23]:
%%bigquery --project amazon-product-reviews-452322

SELECT * FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_membership_reg`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,0.028656,0.004,0.001218,0.014802,0.084959,0.085155


In [None]:
%%bigquery --project amazon-product-reviews-452322
SELECT * FROM ML.WEIGHTS(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_membership_reg`) ORDER BY weight DESC;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,processed_input,weight,category_weights
0,__INTERCEPT__,0.97491,[]
1,is_member,-0.000148,[]
2,embedding,,"[{'category': '272', 'weight': 0.0607023300041..."


## Model 2: Effect of Demographics on Sentiment Score ##
-- Goal: Quantify the impact of gender on hf_sentiment_score, controlling for embedding.
-- Target: hf_sentiment_score (numerical)
-- Covariates: embedding, age

In [None]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_age_reg`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['hf_sentiment_score']
) AS
WITH ParsedDates AS (
  SELECT
    user_id,
    SAFE.PARSE_DATE('%m/%d/%y', date_of_birth) AS dob_parsed
  FROM
    `amazon-product-reviews-452322.fin_product_data.User_Info`
)
SELECT
  r.hf_sentiment_score,
  r.embedding,
  DATE_DIFF(CURRENT_DATE(), pd.dob_parsed, YEAR) AS user_age
FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
INNER JOIN
  ParsedDates AS pd ON r.user_id = CAST(pd.user_id AS STRING)
WHERE
  r.hf_sentiment_score IS NOT NULL
  AND r.embedding IS NOT NULL
  AND pd.dob_parsed IS NOT NULL;


Query is running:   0%|          |

In [24]:
%%bigquery --project amazon-product-reviews-452322

SELECT * FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_age_reg`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,7.181317,51.997176,2.345867,7.213747,-11806.562487,-95.705591


In [None]:
%%bigquery --project amazon-product-reviews-452322

SELECT * FROM ML.WEIGHTS(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_age_reg`) ORDER BY weight DESC;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,processed_input,weight,category_weights
0,__INTERCEPT__,0.02452795,[]
1,user_age,5.427754e-08,[]
2,embedding,,"[{'category': '40', 'weight': 0.47726998715254..."





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



##  Model 3: Combined Effect of Membership and User Info on Sentiment Score ##


-- Goal: Quantify the impact of multiple factors simultaneously.
-- Target: hf_sentiment_score (numerical)
-- Covariates: embedding, is_member, gender, usage_frequency, engagement_metrics, feedback_ratings


In [None]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_member_userinfo_reg`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['hf_sentiment_score']
) AS
SELECT
  r.hf_sentiment_score,
  r.embedding,
  IF(m.user_id IS NOT NULL, 1, 0) AS is_member,
  ui.usage_frequency,
  ui.engagement_metrics,
  ui.feedback_ratings,
  ui.devices_used,
  ui.customer_support_interactions,
  m.subscription_plan,
  m.payment_information,
  m.renewal_status
FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
JOIN
  `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS m ON r.user_id = CAST(m.user_id AS STRING)
JOIN
  `amazon-product-reviews-452322.fin_product_data.User_Info` AS ui ON r.user_id = CAST(ui.user_id AS STRING)
WHERE
  r.hf_sentiment_score IS NOT NULL
  AND r.embedding IS NOT NULL
  AND ui.usage_frequency IS NOT NULL
  AND ui.engagement_metrics IS NOT NULL
  AND ui.feedback_ratings IS NOT NULL;


Query is running:   0%|          |

In [25]:
%%bigquery --project amazon-product-reviews-452322

SELECT * FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_member_userinfo_reg`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,7.371221,54.77471,2.411182,7.422369,-11922.532323,-94.739835


In [None]:
%%bigquery --project amazon-product-reviews-452322

SELECT * FROM ML.WEIGHTS(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_member_userinfo_reg`) ORDER BY ABS(weight) DESC; -- Order by absolute weight to see strongest effects


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,processed_input,weight,category_weights
0,__INTERCEPT__,0.024538,[]
1,is_member,0.0,[]
2,embedding,,"[{'category': '3', 'weight': -0.52269007658332..."
3,usage_frequency,,"[{'category': 'Occasional', 'weight': 0.024555..."
4,engagement_metrics,,"[{'category': 'Low', 'weight': 0.0245424256417..."
5,feedback_ratings,,"[{'category': '4.4', 'weight': 0.0244765418876..."
6,devices_used,,"[{'category': 'Tablet', 'weight': 0.0245218449..."
7,customer_support_interactions,,"[{'category': '10', 'weight': 0.02449665663724..."
8,subscription_plan,,"[{'category': 'Monthly', 'weight': 0.024513653..."
9,payment_information,,"[{'category': 'Amex', 'weight': 0.024485663565..."


In [None]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_member_userinfo_boosted_tree`
OPTIONS(
  model_type='BOOSTED_TREE_REGRESSOR',
  input_label_cols=['hf_sentiment_score']
) AS
SELECT
  r.hf_sentiment_score,
  r.embedding,
  IF(m.user_id IS NOT NULL, 1, 0) AS is_member,
  ui.usage_frequency,
  ui.engagement_metrics,
  ui.feedback_ratings,
  ui.devices_used,
  ui.customer_support_interactions,
  m.subscription_plan,
  m.payment_information,
  m.renewal_status

FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS m ON r.user_id = CAST(m.user_id AS STRING)
INNER JOIN
  `amazon-product-reviews-452322.fin_product_data.User_Info` AS ui ON r.user_id = CAST(ui.user_id AS STRING)
WHERE
  r.hf_sentiment_score IS NOT NULL
  AND r.embedding IS NOT NULL
  AND ui.usage_frequency IS NOT NULL
  AND ui.engagement_metrics IS NOT NULL
  AND ui.feedback_ratings IS NOT NULL;

Query is running:   0%|          |

In [None]:
%%bigquery --project amazon-product-reviews-452322


SELECT * FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_member_userinfo_boosted_tree`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,0.036957,0.00467,0.001391,0.019827,-0.016635,-0.00043


In [None]:
%%bigquery --project amazon-product-reviews-452322

SELECT * FROM ML.FEATURE_IMPORTANCE(MODEL `amazon-product-reviews-452322.fin_product_data.sentiment_score_vs_member_userinfo_boosted_tree`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,feature,importance_weight,importance_gain,importance_cover
0,embedding,30,0.020412,4.866667
1,is_member,10,0.013461,6.4
2,usage_frequency,1,0.077556,4.0
3,engagement_metrics,8,0.018337,6.75
4,feedback_ratings,3,0.014558,8.666667
5,devices_used,3,0.095461,292.333333
6,customer_support_interactions,1,0.00144,7.0
7,subscription_plan,2,0.015349,6.0
8,payment_information,5,0.065816,6.4
9,renewal_status,0,0.0,0.0


In [10]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_membership_reg`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['rating']
) AS
SELECT
  r.rating,
  r.embedding,
  IF(m.user_id IS NOT NULL, 1, 0) AS is_member

FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS m ON r.user_id = CAST(m.user_id AS STRING)
WHERE
  r.rating IS NOT NULL
  AND r.embedding IS NOT NULL;

Query is running:   0%|          |

In [9]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_age_reg`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['rating']
) AS
WITH ParsedDates AS (
  SELECT
    user_id,
    SAFE.PARSE_DATE('%m/%d/%y', date_of_birth) AS dob_parsed
  FROM
    `amazon-product-reviews-452322.fin_product_data.User_Info`
)
SELECT
  r.rating,
  r.embedding,
  DATE_DIFF(CURRENT_DATE(), pd.dob_parsed, YEAR) AS user_age

FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
INNER JOIN
  ParsedDates AS pd ON r.user_id = CAST(pd.user_id AS STRING)
WHERE
  r.rating IS NOT NULL
  AND r.embedding IS NOT NULL
  AND pd.dob_parsed IS NOT NULL;

Query is running:   0%|          |

In [8]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_member_userinfo_reg`
OPTIONS(
  model_type='LINEAR_REG',
  input_label_cols=['rating']
) AS
SELECT
  r.rating,
  r.embedding,
  IF(m.user_id IS NOT NULL, 1, 0) AS is_member,
  ui.usage_frequency,
  ui.engagement_metrics,
  ui.feedback_ratings,
  ui.devices_used,
  ui.customer_support_interactions,
  m.subscription_plan,
  m.payment_information,
  m.renewal_status
FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS m ON r.user_id = CAST(m.user_id AS STRING)
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.User_Info` AS ui ON r.user_id = CAST(ui.user_id AS STRING)
WHERE
  r.rating IS NOT NULL
  AND r.embedding IS NOT NULL
  AND ui.usage_frequency IS NOT NULL
  AND ui.engagement_metrics IS NOT NULL
  AND ui.feedback_ratings IS NOT NULL
  AND ui.devices_used IS NOT NULL
  AND ui.customer_support_interactions IS NOT NULL
  AND m.subscription_plan IS NOT NULL
  AND m.payment_information IS NOT NULL
  AND m.renewal_status IS NOT NULL;

Query is running:   0%|          |

In [22]:
%%bigquery --project amazon-product-reviews-452322

CREATE OR REPLACE MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_member_userinfo_boosted_tree`
OPTIONS(
  model_type='BOOSTED_TREE_REGRESSOR',
  input_label_cols=['rating']
) AS
SELECT
  r.rating,
  r.embedding,
  IF(m.user_id IS NOT NULL, 1, 0) AS is_member,
  ui.usage_frequency,
  ui.engagement_metrics,
  ui.feedback_ratings,
  ui.devices_used,
  ui.customer_support_interactions,
  m.subscription_plan,
  m.payment_information,
  m.renewal_status
FROM
  `amazon-product-reviews-452322.fin_product_data.reviews_enriched_bqml` AS r
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.Membership_Info` AS m ON r.user_id = CAST(m.user_id AS STRING)
LEFT JOIN
  `amazon-product-reviews-452322.fin_product_data.User_Info` AS ui ON r.user_id = CAST(ui.user_id AS STRING)
WHERE
  r.rating IS NOT NULL
  AND r.embedding IS NOT NULL
  AND ui.usage_frequency IS NOT NULL
  AND ui.engagement_metrics IS NOT NULL
  AND ui.feedback_ratings IS NOT NULL
  AND ui.devices_used IS NOT NULL
  AND ui.customer_support_interactions IS NOT NULL
  AND m.subscription_plan IS NOT NULL
  AND m.payment_information IS NOT NULL
  AND m.renewal_status IS NOT NULL;

Query is running:   0%|          |

In [27]:
%%bigquery --project amazon-product-reviews-452322

SELECT *
FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_membership_reg`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,0.40938,0.399887,0.029128,0.237299,0.751461,0.751462


In [28]:
%%bigquery --project amazon-product-reviews-452322

SELECT *
FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_age_reg`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,32.059058,1037.161271,4.049814,32.365476,-634.333648,-4.744712


In [29]:
%%bigquery --project amazon-product-reviews-452322

SELECT *
FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_member_userinfo_reg`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,33.144485,1108.056998,4.233611,33.378272,-626.611017,-4.380913


In [30]:
%%bigquery --project amazon-product-reviews-452322

SELECT *
FROM ML.EVALUATE(MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_member_userinfo_boosted_tree`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,0.513288,0.659826,0.048264,0.241641,0.62627,0.627541


In [21]:
%%bigquery --project amazon-product-reviews-452322

SELECT *
FROM ML.FEATURE_IMPORTANCE(MODEL `amazon-product-reviews-452322.fin_product_data.rating_vs_member_userinfo_boosted_tree`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,feature,importance_weight,importance_gain,importance_cover
0,embedding,29,0.554196,4.965517
1,is_member,8,1.305033,24.0
2,usage_frequency,7,1.968111,116.714286
3,engagement_metrics,8,4.084991,9.5
4,feedback_ratings,5,1.320305,9.6
5,devices_used,6,2.280839,11.333333
6,customer_support_interactions,5,2.929701,6.4
7,subscription_plan,2,2.043001,31.5
8,payment_information,1,0.637268,5.0
9,renewal_status,3,1.019781,6.333333
