# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [22]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [23]:
PROJECT_ID = 'felipe-sandbox-354619'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [24]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [25]:
print ("hello")

hello


#@bigquery
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

#@bigquery

CREATE OR REPLACE MODEL `bqml_demos.ecommerce_customer_segment_cluster`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `bigquery-public-data.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

#@bigquery
SELECT *
FROM ML.EVALUATE(MODEL `bqml_demos.ecommerce_customer_segment_cluster`)

### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [26]:
%%bigquery centroids
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `bqml_demos.ecommerce_customer_segment_cluster`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id

Query is running:   0%|          |

Downloading:   0%|          |

In [27]:
centroids

Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,43.19,1.23,97.68
1,cluster 2,58.8,3.51,387.13
2,cluster 3,58.68,3.5,94.02
3,cluster 4,46.86,1.22,395.26
4,cluster 5,199.6,1.18,181.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [28]:
centroids.to_string(header=False, index=False)

cluster_info = []
for i, row in centroids.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $43.19, count of orders per person 1.23, days since last order 97.68
cluster 2, average spend $58.8, count of orders per person 3.51, days since last order 387.13
cluster 3, average spend $58.68, count of orders per person 3.5, days since last order 94.02
cluster 4, average spend $46.86, count of orders per person 1.22, days since last order 395.26
cluster 5, average spend $199.6, count of orders per person 1.18, days since last order 181.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [29]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [30]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"
display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $43.19, count of orders per person 1.23, days since last order 97.68
cluster 2, average spend $58.8, count of orders per person 3.51, days since last order 387.13
cluster 3, average spend $58.68, count of orders per person 3.5, days since last order 94.02
cluster 4, average spend $46.86, count of orders per person 1.22, days since last order 395.26
cluster 5, average spend $199.6, count of orders per person 1.18, days since last order 181.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [31]:
display(Markdown('## Response:'))

# Send prompt to LLM
response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
)

display(Markdown(str(response)))

## Response:

8 **Cluster 1: The Occasional Shopper**

This cluster is made up of people who make small, infrequent purchases. They are typically looking for something specific and are not interested in browsing. The average spend in this cluster is $43.19, the count of orders per person is 1.23, and the days since last order is 97.68.

**Brand persona:** The Occasional Shopper is a busy person who is always on the go. They are looking for a quick and easy way to get the things they need without having to spend a lot of time or money.

**Title label:** The Occasional Shopper

**Next marketing step:** The next marketing step for this cluster is to focus on convenience and speed. This could be done by offering free shipping, same-day delivery, or a wider selection of products that can be purchased online.

**Cluster 2: The Loyal Customer**

This cluster is made up of people who make regular, large purchases. They are typically loyal to a particular brand or store and are not interested in trying new things. The average spend in this cluster is $58.80, the count of orders per person is 3.51, and the days since last order is 387.13.

**Brand persona:** The Loyal Customer is a brand enthusiast who is always looking for new ways to show their loyalty. They are typically well-informed about the products they buy and are not afraid to spend money on them.

**Title label:** The Loyal Customer

**Next marketing step:** The next marketing step for this cluster is to focus on building relationships and loyalty. This could be done by offering exclusive rewards, personalized service, or a loyalty program.

**Cluster 3: The Frequent Shopper**

This cluster is made up of people who make multiple, small purchases on a regular basis. They are typically looking for deals and discounts and are not afraid to shop around. The average spend in this cluster is $58.68, the count of orders per person is 3.50, and the days since last order is 94.02.

**Brand persona:** The Frequent Shopper is a bargain hunter who is always looking for the best deal. They are typically well-informed about the products they buy and are not afraid to switch brands if they find a better price.

**Title label:** The Frequent Shopper

**Next marketing step:** The next marketing step for this cluster is to focus on value and savings. This could be done by offering coupons, discounts, or sales.

**Cluster 4: The Disengaged Shopper**

This cluster is made up of people who have not made a purchase in a long time. They are typically not interested in shopping and are not likely to return to the store. The average spend in this cluster is $46.86, the count of orders per person is 1.22, and the days since last order is 395.26.

**Brand persona:** The Disengaged Shopper is a person who has lost interest in the brand. They may have had a bad experience in the past or they may simply be looking for something different.

**Title label:** The Disengaged Shopper

**Next marketing step:** The next marketing step for this cluster is to try to win them back. This could be done by offering a special deal, sending them a personalized email, or reaching out to them on social media.

**Cluster 5: The High-Value Customer**

This cluster is made up of people who make large, infrequent purchases. They are typically looking for something special and are not afraid to spend money. The average spend in this cluster is $199.60, the count of orders per person is 1.18, and the days since last order is 181.58.

**Brand persona:** The High-Value Customer is a luxury shopper who is always looking for the best possible experience. They are typically well-informed about the products they buy and are not afraid to pay a premium for quality.

**Title label:** The High-Value Customer

**Next marketing step:** The next marketing step for this cluster is to focus on exclusivity and personalization. This could be done by offering private shopping events, personalized service, or a concierge program.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [20]:
preamble = "Pretend you're a creative strategist, for each of the below clusters specify their favorite movie and a and a witty e-mail headline for marketing campaign targeted to their group: "

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + response.text,
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, fo each of the below clusters specify their favorite movie and a and a witty e-mail headline for marketing campaign targeted to their group: 




## Response:

**Cluster 1: The Occasional Shopper**

* Favorite movie: The Princess Bride
* Witty email headline: "Fancy a cup of coffee? We'll sweeten the deal with a 20% discount!"

**Cluster 2: The Loyal Customer**

* Favorite movie: The Godfather
* Witty email headline: "We're your family. Treat yourself to a 10% discount on your next purchase."

**Cluster 3: The Impulsive Shopper**

* Favorite movie: The Hangover
* Witty email headline: "You're in for a wild ride! Get 25% off our latest collection of crazy clothes."

**Cluster 4: The Price-Sensitive Shopper**

* Favorite movie: The Shawshank Redemption
* Witty email headline: "You can't buy freedom, but you can buy this TV for 50% off!"

**Cluster 5: The Luxury Shopper**

* Favorite movie: The Great Gatsby
* Witty email headline: "Live like Gatsby with our exclusive collection of luxury goods."

In [21]:
preamble = "Pretend you're a creative strategist, for each the clusters below includes the detail of their favorite movie,and a description of an image that describes scenery related to their favorite movie:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + response.text,
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, for each the clusters below includes the detail of their favorite movie,and a description of an image that describes scenery related to their favorite movie:


cluster 1, average spend $43.19, count of orders per person 1.23, days since last order 97.68
cluster 2, average spend $58.8, count of orders per person 3.51, days since last order 387.13
cluster 3, average spend $58.68, count of orders per person 3.5, days since last order 94.02
cluster 4, average spend $46.86, count of orders per person 1.22, days since last order 395.26
cluster 5, average spend $199.6, count of orders per person 1.18, days since last order 181.5


## Response:

 **Cluster 1: The Occasional Shopper**

* Favorite Movie: The Wizard of Oz
* Scenery Image: A field of poppies, with the Emerald City in the distance.

**Cluster 2: The Loyal Customer**

* Favorite Movie: The Lord of the Rings
* Scenery Image: The Shire, with the mountains of Mordor in the distance.

**Cluster 3: The Impulsive Shopper**

* Favorite Movie: The Fast and the Furious
* Scenery Image: A busy street in Tokyo, with the neon lights of the cityscape in the background.

**Cluster 4: The Price-Sensitive Shopper**

* Favorite Movie: The Hangover
* Scenery Image: A Las Vegas strip club, with the flashing lights and the sounds of music in the background.

**Cluster 5: The Luxury Shopper**

* Favorite Movie: The Great Gatsby
* Scenery Image: A mansion in the Hamptons, with the ocean and the beach in the background.

In [32]:
preamble = "Pretend you're a creative strategist, for each of the following clusters below, come up their favorite food and a detailed description of an image containing their favorite food:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + response.text,
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, for each of the following clusters below, come up their favorite food and a detailed description of an image containing their favorite food:




## Response:

**Cluster 1: The Occasional Shopper**

**Favorite food:** Pizza

**Image description:** A photo of a pizza with pepperoni, mushrooms, and olives. The pizza is on a plate with a knife and fork next to it. The background is a dark blue color.

**Cluster 2: The Loyal Customer**

**Favorite food:** Steak

**Image description:** A photo of a steak with a baked potato and vegetables. The steak is on a plate with a knife and fork next to it. The background is a white color.

**Cluster 3: The Frequent Shopper**

**Favorite food:** Sushi

**Image description:** A photo of a sushi roll with salmon, avocado, and cucumber. The sushi roll is on a plate with chopsticks next to it. The background is a light green color.

**Cluster 4: The Disengaged Shopper**

**Favorite food:** Ice cream

**Image description:** A photo of a bowl of ice cream with chocolate sauce, whipped cream, and a cherry on top. The ice cream is on a plate with a spoon next to it. The background is a light brown color.

**Cluster 5: The High-Value Customer**

**Favorite food:** Lobster

**Image description:** A photo of a lobster tail with melted butter and lemon wedges. The lobster tail is on a plate with a fork next to it. The background is a dark red color.

In [13]:
def persona_builder(preamble):
    pre_text = "Pretend you're a creative strategist, given the following clusters "
    prompt = pre_text + preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [14]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters: ")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


