# Day 4b
In this notebook we will evaluate the ability of a large language model (LLM) to model demographic differences. We will use a dataset on [vaccine hesitancy](https://www.kaggle.com/datasets/christianhritter/vaccine-hesitancy-canada-cosmo-survey?select=COSMO_in_Canada_Waves_1-8_FINAL.csv) to evaluate this.

By the end of this notebook, you will have learned to:
- Compare the distribution of LLM-generated responses to a survey question with the actual data
- Evaluate the effect of demographic information on the generated responses
 


## Environment Setup
Make sure to set your runtime to use a GPU by going to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`

In [ ]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Installing requisite packages
    !pip install transformers accelerate &> /dev/null
    
    # Change working directory to day_4
    %cd /content/drive/MyDrive/LLM4BeSci_GSERM2024/day_4

In [None]:
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm_notebook as tqdm

## Loading the data
The code begins by loading the dataset as a `pandas.DataFrame`. The dataset contains the following columns:
1. `'age'`
2. `'gender'`
3. `'education'`
4. `'take_vaccine'`: Survey question on willingness to take a COVID-19 vaccine (1 (Strongly Disagree) to 7 (Strongly Agree))
5. `'mandatory_vaccine'`

It also contains three '`persona`' columns, which contain the demographic information in a prompt format:

1. `'persona_a'`: Gender, Age, Education
2. `'persona_b'`: Gender, Age
3. `'persona_c'`: Age, Education

In [None]:
vaccine = pd.read_csv('vaccine_hesitancy.csv')
vaccine

You can use the `value_counts()` method to get an idea of the distribution of the data. For instance, we can look at the distribution of the `'take_vaccine'` column:

In [None]:
vaccine['take_vaccine'].value_counts()

**TASK 1**: Replace `'age`' above with other demographic column names (`'gender'`, `'education'`) to get an idea of the distribution of the data.

**TASK 2**: Replace `'take_vaccine'` above with other survey questions to get an idea of the distribution of the data. FYI: participants were asked the following question:

In the next section, we will generate responses to the survey question using the LLM. The survey question is as follows:

In [None]:
survey_q = """
    Please give your opinion on the following statement:
    
    'If an effective COVID-19 vaccine becomes available and is recommended for me, I would get it.'
    
    Choice:
    
    1 = Strongly Disagree
    2
    3
    4
    5
    6
    7 = Strongly Agree 
    
    Strictly only respond with the number corresponding to your choice and nothing else.
"""

## Comparing Phi-3's default distribution with the data
This section generates responses to the survey question using the LLM and compares the distribution of the generated responses with the actual data. We again use microsoft's Phi-3 model. We begin by loading the model and tokenizer:

In [ ]:
torch.random.manual_seed(42) # Set seed for reproducibility

# Load Phi-3
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda", # Use GPU
    torch_dtype=torch.float16,  # Use float16 for faster inference
    trust_remote_code=True, 
    attn_implementation='eager' # Faster inference on T4 GPUs
)

# Load tokenizer`
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

The code then uses the `pipeline` class to generate responses to the survey question. The `generation_args` dictionary contains the following arguments:

In [ ]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 10,  # Maximum number of tokens to generate
    "return_full_text": False, # Return only the generated text
    "do_sample": True, # Allow sampling from the softmax distribution
    "temperature": 1.0  # Temperature parameter 
}

It firstly generates 100 responses to the survey question. The code then detects which integer (`1`-`7`) the model responded with and appends this to the `take_vaccine_preds` list. 

In [None]:
# Generate responses
take_vaccine_preds = [] # List to store generated responses
n_samples = 100 # Number of samples to generate
for i in tqdm(range(n_samples)):
    
    # Define prompt with JSON structure
    prompt = [{"role": "user", "content": survey_q}]  
    
    # Generate response
    response = pipe(prompt, **generation_args)[0]['generated_text']
    
    # Checks which integer corresponds to the response and appends to list
    possibles = ['1', '2', '3', '4', '5', '6', '7']
    pred = [int(x) for x in possibles if x in response]
    if len(pred) == 1:
        pred = pred[0]
        take_vaccine_preds.append(pred)

To compare the distribution of the generated responses (`take_vaccine_preds`) with the actual data (`vaccine['take_vaccine']`), the code uses a histogram:

In [None]:
# Comparing distribution of generated responses with actual data
figs, axs = plt.subplots(1, 2, figsize=(12, 6))

# Plotting actual data
sns.histplot(vaccine['take_vaccine'], stat='percent', bins=7, ax=axs[0])
axs[0].set_title('Actual Data')

# Plotting generated data
sns.histplot(take_vaccine_preds, stat='percent', bins=7, ax=axs[1])
axs[1].set_title('Generated Data')
axs[1].set_xlabel('take_vaccine')

As you can see, with a temperature of `1.0`, the model simply generates the same response for all samples. In other words, we need to increase the `temperature` parameter to allow for more realistic individual variability in the generated responses. 

**TASK 1**: Try playing around with larger values of `"temperature"` (e.g. `3.0`, `5.0`, `10.0`) to see how it affects the generated responses. **Don't forget to re-run the cell that defines `generation_args` after changing the temperature.**

## Demographic Steering: Gender, Age, Education
As mentioned, our dataset also has 3 "personas", which contain different kinds of demographic information in a prompt format:".

1. `'persona_a'`: Gender, Age, Education
2. `'persona_b'`: Gender, Age
3. `'persona_c'`: Age, Education

We will now evaluate each persona's effect on the generated responses. For each "participant", we will append the demographic information to `survey_q` and generate a response. For instance, the prompt for the first "participant" would look like this:

In [None]:
prompt = f"You are {vaccine['persona_a'][0]}.\n-------------------------------\n{survey_q}"
print(prompt)

The code next generates responses for each persona and compares the generated responses with the actual data. It starts with `'persona_a'` and a temperature of `3.0`, which we found to be the best temperature for this task out of [`1.0`, `3.0`, `5.0`, `10.0`] using the previous task:

In [ ]:
generation_args = {
    "max_new_tokens": 10,  # Maximum number of tokens to generate
    "return_full_text": False, # Return only the generated text
    "do_sample": True, # Use greedy decoding
    "temperature": 3.0  # Temperature parameter 
}


demog_col = 'persona_a'  # Replace this for TASK 1

# Generate responses
take_vaccine_preds = [] # List to store generated responses
for demog_prompt in tqdm(vaccine[demog_col]):
    prompt = f"You are {demog_prompt}.\n-------------------------------\n{survey_q}"
    
    # Define prompt with JSON structure
    prompt = [{"role": "user", "content": prompt}]  # Define prompt with JSON structure
    
    # Generate response
    response = pipe(prompt, **generation_args)[0]['generated_text']
    
    # Checks which integer corresponds to the response
    possibles = ['1', '2', '3', '4', '5', '6', '7']
    pred = [int(x) for x in possibles if x in response]
    if len(pred) == 1:
        pred = pred[0]
        take_vaccine_preds.append(pred)
    else: # If no integer found, or too many found, return None
        take_vaccine_preds.append(None)
    
# Append generated responses to dataframe
vaccine[f'{demog_col}_preds'] = take_vaccine_preds
vaccine[f'{demog_col}_preds'].value_counts()

The generated responses can be compared with the actual data using a regression plot:

In [ ]:
# Comparing generated and actual data with a regression plot
ax = sns.regplot(
    x=f'{demog_col}_preds', y='take_vaccine', x_jitter=.1, y_jitter=.1, data=vaccine,
    scatter_kws={'alpha': 0.5}
)
x_y_lim = (.5, 7.5)
ax.set(xlim=x_y_lim, ylim=x_y_lim)
print(f"Correlation: {vaccine['take_vaccine'].corr(vaccine[f'{demog_col}_preds'])}")

**TASK 1**: Unfortunately,`'persona_a'` did not perform well. Try using other demographic columns (`'persona_b'`, `'persona_c'`) by replacing `'persona_a'` above to see how the generated responses change.

As you have now discovered, it can be challenging to generate responses to survey questions that reflect individual-level differences, especially when the LLM only has access to group-level (demographic) information. Were we to evaluate performance at the group level, we would likely find that the model is able to capture some of the differences between groups.