# Demo 3: Gemini Exercise

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ciri/persona-workshop/blob/main/demo-3/Silicon-Gemini-Exercise.ipynb)

We will start by setting-up the notebook. If you haven't already, first create a Gemini API key [here](https://www.google.com/url?q=https%3A%2F%2Faistudio.google.com%2Fapp%2Fapikey) (free). The free version is somewhat limited (see quotas [here](https://cloud.google.com/gemini/docs/quotas#daily)), but if you add your card information you get $300 free credit for the next 90 days (you don't need to do this for the workshop). You can then add it below.

In [6]:
# You don't need this code, just make sure you have your API key stored
# in a variable called api_secret
from dotenv import load_dotenv 
import os

load_dotenv()
api_secret = os.getenv("API_SECRET")

In [7]:
# Libraries that we will use, if you are missing a library, 
# create a new cell with e.g.:
#   !pip install NAME_OF_MISSING
# where NAME_OF_MISSING is the library that you are missing.

import google.generativeai as genai
from tqdm import tqdm 
import numpy as np
import pandas as pd
import typing_extensions as typing
import json
import random
import seaborn as sns
import matplotlib.pyplot as plt

genai.configure(api_key=api_secret)

## Exercise 1: Initialize the Generative Model

Let's start by veryifying that we can initialize and call a model. Ask it to write a poem about your country of origin.

In [8]:
model    = genai.GenerativeModel('gemini-pro')

# change me
response = model.generate_content('Write me a poem about ...')

print(response.text)

KeyboardInterrupt: 

We will now be using the [Gemini API](https://ai.google.dev/docs/gemini_api_overview) to generate silicon samples.

## Building blocks

There are two main things we need to understand to do silicon sampling:

1. You can create string templates in which you create variations of your question.
2. You can return structured output.

Let's explore both of these.

### Exercise 2: Structured output
You can ask a model to return structured output which makes it easier to post-process into statistics.

1. Add top speed and max pax to the data definition
2. Complete the prompt below to ask it about airplanes of a region that you are interested in.
3. EXTRA: change the object of interest (e.g., as it about cars instead).

In [None]:
# Specify the structure as a python class
class AirplaneSpecification(typing.TypedDict):
    airplane_model: str
    builder: str
    carriers: list[str]
    #TODO: add top speed in kilometers per hour
    #TODO: add maximum number of passengers

# Let's query the modela gain.
model  = genai.GenerativeModel("gemini-1.5-pro-latest")

output = model.generate_content(
    # TODO: edit the prompt so we ask about your region of interest
    "List a few popular airplane models ...",
    
    # This line is crucial! It specifies that we want json output 
    # according to the specification above.
    generation_config = genai.GenerationConfig(
        response_mime_type="application/json", response_schema=list[AirplaneSpecification]
    ),
)

# The response can be transformed into a python dictionary
# using the json library
result = json.loads(output.text)
result

A string template allows us to ask a question repeatedly. Let's use this capability to set the persona of the LLM:

In [None]:
# Step 1: Specify the output format
class MovieSpecification(typing.TypedDict):
    age: int
    location: str
    food: str

# Step 2: Specify the input prompt template
template = "You are a {age}-year old {gender} from {location}."

# Step 3: specify a distribution - for now just a list.
population = [
    {"age":35, "gender":"female","location":"China"},
    
    # TODO: add a couple more profiles here, by copy-pasting
    #       and then modifying the line
    # ...
]

# Step 4: run the survey
for person in population:
  system_prompt = template.format(**person)

  model = genai.GenerativeModel('gemini-1.5-pro-latest', system_instruction=system_prompt)
  response = model.generate_content(     
      """
      What's your single most favorite food? Instead of giving a stereotypical 
      response name a dish that reflects the diversity of the local cuisine or 
      a dish that you personally might enjoy growing up there.
      """,
      generation_config = genai.GenerationConfig(
          response_mime_type="application/json", response_schema=list[MovieSpecification]
      ),
  )

  print(system_prompt)
  print(json.loads(response.text))

You are a 35-year old female from China.
[{'age': 35, 'food': 'Suancai Yu', 'location': 'China'}]
You are a 42-year old male from Nigeria.
[{'age': 42, 'food': 'Ayamase with assorted meat and Pounded Yam', 'location': 'Nigeria'}]
You are a 32-year old male from Belgium.
[{'age': 32, 'food': 'Stoatjespap', 'location': 'Belgium'}]
You are a 42-year old male from Saudi-Arabia.
[{'age': 42, 'food': 'Saleeg', 'location': 'Saudi Arabia'}]


## Privacy Scales

Let us now try to replicate some of the results from the privacy calculus scale (Dinev, Hart 2006). We'll be focusing on the questions related to Internet Privacy Concern (PC) and willigness to provide personal information to transact on the internet (PPIT).


| **Concern/Activity** | **Description** |
|----------------------|-----------------|
| **Indicate the extent to which you are concerned about the following:** |  |
| **PC1** | I am concerned that the information I submit on the Internet could be misused. |
| **PC2** | I am concerned that a person can find private information about me on the Internet. |
| **PC3** | I am concerned about submitting information on the Internet, because of what others might do with it. |
| **PC4** | I am concerned about submitting information on the Internet, because it could be used in a way I did not foresee. |
| **Willingness to provide personal information to transact on the Internet (PPIT)** | Not at all concerned–Very concerned |
| **To what extent are you willing to use the Internet to do the following activities?** |  |
| **PPIT 1** | Purchase goods (e.g., books or CDs) or services (e.g., airline tickets or hotel reservations) from websites that require me to submit accurate and identifiable information (i.e., credit card information) |
| **PPIT 2** | Retrieve information from websites that require me to submit accurate and identifiable registration information, possibly including credit card information (e.g., using sites that provide personalized stock quotes, insurance rates, or loan rates; or using sexual or gambling websites) |
| **PPIT 3** | Conduct sales transactions at e-commerce sites that require me to provide credit card information (e.g., using sites for purchasing goods or software) |
| **PPIT 4** | Retrieve highly personal and password-protected financial information (e.g., using websites that allow me to access my bank account or my credit card account) |
| **Scale** | Not at all–Very much |


 Dinev, T., & Hart, P. (2006). An extended privacy calculus model for e-commerce transactions. Information Systems Research, 17(1), 61-80.

 In the paper they hypothesize and find that the correlation between these two should be negative:

 ![Original Hypothesis](img/DinevHart2006-a.png)

**Step 1**: define the survey question prompt, data response structure

In [9]:
# Survey questions
survey_questions = """
You will now answer questions about your privacy concerns. Rate your agreement with each statement on a scale from 1 (Strongly Disagree) to 7 (Strongly Agree).

1. I am concerned that the information I submit on the Internet could be misused.
2. I am concerned that a person can find private information about me on the Internet.
3. I am concerned about submitting information on the Internet, because of what others might do with it.
4. I am concerned about submitting information on the Internet, because it could be used in a way I did not foresee.

Now, please answer two additional questions. To what extent are you willing to use the Internet to do the following activities? Rate your willingness with each statement on a scale from 1 (Not at all) to 7 (Very much).

5. Purchase goods (e.g., books or CDs) or services (e.g., airline tickets or hotel reservations) from websites that require me to submit accurate and identifiable information (i.e., credit card information)
6. Retrieve information from websites that require me to submit accurate and identifiable registration information, possibly including credit card information (e.g., using sites that provide personalized stock quotes, insurance rates, or loan rates; or using sexual or gambling websites)
7. Conduct sales transactions at e-commerce sites that require me to provide credit card information (e.g., using sites for purchasing goods or software)
8. Retrieve highly personal and password-protected financial information (e.g., using websites that allow me to access my bank account or my credit card account)

Always make the choice that the person would, never answer with a number lower than 1 or larger than 7.
"""


# Define the structure of survey answers with Likert scale responses
class SurveyAnswers(typing.TypedDict):
    privacy_misuse_concern: int
    finding_private_info_concern: int
    misuse_by_others_concern: int
    unforeseen_use_concern: int

    purchase_intention: int
    information_intention: int
    ecommerce_intention: int
    personal_intention: int

# Mapping from numeric string keys to descriptive field names
# we will need this later to go back, this is useful when 
# dealing with very large surveys
response_key_mapping = {
    '1': 'privacy_misuse_concern',
    '2': 'finding_private_info_concern',
    '3': 'misuse_by_others_concern',
    '4': 'unforeseen_use_concern',
    '5': 'purchase_intention',
    '6': 'information_intention',
    '7': 'ecommerce_intention',
    '8': 'personal_intention'
}

**Step 2**: define the population sample system prompt

In [None]:
# We set the random seeds for replicability
np.random.seed(42)
random.seed(42)

# Template for the persona prompt
persona_template = """
You are a virtual person simulator that creates individual synthetic personas, one at a time, that I can specify and then ask them any questions I like. This means that you answer the way the persona would – no matter the implications. Be brief. Do not write any additional explanations unless I ask you to.

You are a {age}-year-old {gender} person.
"""

# Population simulator, creates a random socio-demographic profile
def generate_population(n):
    population = []
    for _ in range(n):
        age = int(np.random.uniform(18, 65))
        gender = random.choice(["female", "male"])
        population.append({"age": age, "gender": gender})
    return population

#population = generate_population(50) # Only use 50 if you have $300 free credits
population = generate_population(5)

population[:5]

[{'age': 35, 'gender': 'female'},
 {'age': 62, 'gender': 'female'},
 {'age': 52, 'gender': 'male'},
 {'age': 46, 'gender': 'female'},
 {'age': 25, 'gender': 'female'}]

**Step 3**: do the sampling

In [None]:
# Run the survey with the LLM (simulation)
responses = []
for person in tqdm(population):
    system_prompt = persona_template.format(**person)    

    # Set-up the model with the correct persona system prompt
    model = genai.GenerativeModel(
        'gemini-1.5-pro-latest',
        system_instruction=system_prompt,
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            temperature= 1.0
        ),
    )
    # Retry logic for invalid responses, this becomes important
    # at higher temperature settings.
    max_retries = 3
    retries = 0
    while retries < max_retries:
        response = model.generate_content(survey_questions)
        try:
            # Convert response to a dictionary
            result = json.loads(response.text)
            
            # Convert numeric keys to descriptive keys
            mapped_result = {response_key_mapping[key]: value for key, value in result.items() if key in response_key_mapping}

            # Ensure the mapped result has all required fields and values are of correct type
            if all(key in mapped_result for key in SurveyAnswers.__annotations__) and all(isinstance(mapped_result[key], int) for key in SurveyAnswers.__annotations__):
                mapped_result.update(person)
                responses.append(mapped_result)
                break  # Exit retry loop if successful
            else:
                print(f"Invalid response format after mapping: {mapped_result}")
        except json.JSONDecodeError:
            print(f"Unable to parse response as JSON: {response.text}")

        retries += 1
        if retries == max_retries:
            print(f"Max retries reached for person: {person}")

Now that we have the sample collected, we can calculate statistics on the responses, here's code to get you started.

In [None]:
df = pd.DataFrame(responses)

df['PC']   = df[['privacy_misuse_concern', 'finding_private_info_concern','misuse_by_others_concern', 'unforeseen_use_concern']].mean(axis=1)
df['PPIT'] = df[['purchase_intention', 'information_intention', 'ecommerce_intention','personal_intention']].mean(axis=1)

df.to_csv('./silicon_sample.csv')