<a href="https://colab.research.google.com/github/danadria/Skills-Lab-Introduction-to-Transformers-BERT-and-Explainable-NLP/blob/main/notebooks/llm_data_collection_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Large Language Models for Data Collection in Social Sciences


In [1]:
# Install required packages
%pip install -qU langchain[openai] tqdm pydantic krippendorff

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m0.6/1.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m23.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import getpass
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import krippendorff

## Gabrielle Martins van Jaarsveld's SoDa fellowship dataset

Feel free to use your own data. By default, we will use a toy dataset from Gabrielle Martins van Jaarsveld's SoDa fellowship project on annotating markers of self-regulated learning from student conversation data.

This dataset contains the following columns:

- `id`: The id of the row/conversation/student.

- `conversation`: The text of the conversations based on which specificity scores are derived (by humans or LLMs).

- `score_specificity_llm`: The specificity score of a conversation based on carefully prompted response from LLMs. It varies between 0, 1 and 2.

- `score_specificity_human`: The specificity score of a conversation based on human expert annotators. It is treated as gold standard (i.e., free from measurement error). It varies between 0, 1 and 2.

- `performance`: The academic performance of a student, varying from 1 to 10.

## Data loading

In [3]:
# url where you can download our example data
data_url = "https://sodascience.github.io/workshop_llm_data_collection/data/srl_data_example.csv"

# Read CSV into dataframe
df = pd.read_csv(data_url)

Display the first 10 rows of the dataset.

Note that only the first 10 rows contain the text of the conversations. We will use these texts for the prompting experiments to come.

In [4]:
df.head(10)

Unnamed: 0,id,conversation,score_specificity_llm,score_specificity_human,performance
0,request_1,PROMPT: Set an academic goal for the upcoming ...,1,1.0,3.5
1,request_2,PROMPT: Set an academic goal for the upcoming ...,2,2.0,6.8
2,request_3,PROMPT: Set an academic goal for the upcoming ...,0,0.0,6.7
3,request_4,PROMPT: Set an academic goal for the upcoming ...,1,1.0,7.3
4,request_5,PROMPT: Set an academic goal for the upcoming ...,1,1.0,5.8
5,request_6,PROMPT: Set an academic goal for the upcoming ...,1,1.0,6.4
6,request_7,PROMPT: Set an academic goal for the upcoming ...,2,1.0,5.9
7,request_8,PROMPT: Set an academic goal for the upcoming ...,0,0.0,7.6
8,request_9,PROMPT: Set an academic goal for the upcoming ...,1,1.0,7.6
9,request_10,PROMPT: Set an academic goal for the upcoming ...,2,2.0,7.0


## Using langchain to call OpenAI's API

We will be using the Python package `langchain` to perform our prompting experiments. One great advantage of using `langchain` is that it takes away the trouble of having to learn different LLM APIs. Instead, it allows you to call different LLM APIs (both commercial and open-source) effortlessly (relatively speaking) with very simple modifications of your `langchain` code!

We will be calling OpenAI's LLM in this notebook. Feel free to experiment with other APIs and models! To do so, check out https://python.langchain.com/docs/tutorials/chatbot.

Initialise an LLM model. You need to enter your OpenAI API key for this when being prompted. Don't have one? Ask the workshop instructors!

In [12]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
model = init_chat_model("gpt-4o-mini",
                        model_provider="openai",
                        temperature=0,
                        max_tokens=1000)

Enter API key for OpenAI: ··········


## Working with a single prompt

Let's start with the system prompt (i.e., high-level instruction to the model).

In [50]:
system_prompt = """
"You are an expert in educational assessment and goal evaluation, with specialized expertise in applying deductive coding schemes to score the quality and content of student goals.
You have a deep understanding of scoring rubrics and are highly skilled at analysing goals for specific characteristics according to well-defined criteria.

##TASK##
A university student was given a series of prompts, guiding them through the
process of setting and elaborating on an academic goal for the coming week. You
will be provided with the entire conversation including the prompts, and the
student answers. Your objective is to assess the specificity of of the student’s
goal on a scale of 0 to 2 based on the entire conversation.

##EVALUATION CRITERIA##
Assign the scores based on the following criteria:
SPECIFICITY: ASSESS the extent to which the goal is specific rather than general.
DETERMINE if goal is measurable, assessable, documentable, or observable. Is the outcome measurable, and is it possible to track progress while working on the goal?
PERSONAL IMPORTANCE: DETERMINE if there is an explicit reason for the goal which outlines why this goal is important to achieve on the basis of previous experience or in the context of future goals.
MULTI-SOURCE PLANNING: EXAMINE whether there are specific activities mentioned, and whether these activities directly relate to the goal. Is there a schedule included mentioning days or times of day for working on these activities and accomplishing the goal?

##SCORING INSTRUCTIONS##
For each category, ASSIGN a score of 0 (lowest), 1, or 2 (highest).
IDENTIFY the key elements that distinguish a low score (0) from a high score (2)
in each category and provides reasoning for each criteria.
Base on all the scores received, produce an average, overall score that is most representative of the goal quality of the student.

##EDGE CASE HANDLING##
If a goal is ambiguous or unclear, SCORE it on the lower end.
If a goal appears to partially meet the criteria for two different scores, SELECT
the score that best reflects the majority of the goals characteristics for that category.

##WHAT NOT TO DO##
Never apply personal opinion or assumptions outside the rubric criteria.
never give a score without a detailed explanation, even if the scoring seems obvious.
never modify or assume student intent score the goal exactly as written.
never ignore the rubric or provided examples when scoring
"""

Use the prompt template module from langchain to create a prompt request with both **system** and **user** prompts.

In [51]:
prompt_template = ChatPromptTemplate([
    ("system", system_prompt),
    ("user", "{conversation}")
])

# Create a prompt request with the system prompt and the user prompt based on
# the first conversation from the dataset
prompt_request = prompt_template.invoke({"conversation": df.iloc[0,1]})

Check the prompt:

In [52]:
prompt_request.to_messages()

[SystemMessage(content='\n"You are an expert in educational assessment and goal evaluation, with specialized expertise in applying deductive coding schemes to score the quality and content of student goals. \nYou have a deep understanding of scoring rubrics and are highly skilled at analysing goals for specific characteristics according to well-defined criteria.\n\n##TASK##\nA university student was given a series of prompts, guiding them through the\nprocess of setting and elaborating on an academic goal for the coming week. You\nwill be provided with the entire conversation including the prompts, and the\nstudent answers. Your objective is to assess the specificity of of the student’s\ngoal on a scale of 0 to 2 based on the entire conversation.\n\n##EVALUATION CRITERIA##\nAssign the scores based on the following criteria:\nSPECIFICITY: ASSESS the extent to which the goal is specific rather than general. \nDETERMINE if goal is measurable, assessable, documentable, or observable. Is th

Prompt the model and inspect the response!

In [53]:
response = model.invoke(prompt_request)
print(response.content)

### Evaluation of the Student's Goal

**1. Specificity:**
- **Score: 1**
- **Reasoning:** The goal is somewhat specific as it mentions catching up on geography reading and provides options (reading the book or friends' notes). However, it lacks precise details about which specific chapters or pages to read, and it does not specify a clear outcome or target for completion. A score of 2 would require a more detailed plan with specific chapters or a clear target for completion.

**2. Measurable:**
- **Score: 1**
- **Reasoning:** The student mentions measuring progress by the number of pages written per day, which is a measurable aspect. However, it does not specify how many pages are to be read or written in total, nor does it outline a clear method for tracking this progress over the week. A score of 2 would require a more comprehensive measurement plan.

**3. Personal Importance:**
- **Score: 2**
- **Reasoning:** The student clearly articulates the importance of the goal by stating that

Voila! You have your first successful prompting interaction with the API of a large language model!

## Working with multiple prompts
Next, we are going beyond a single prompt. Instead, we will work with **multiple prompts** at the same time!

Define a list of request IDs and another list of conversations (necessary for forming the user prompts)

In [54]:
ids = df.id[:10].tolist()
conversations = df.conversation[:10].tolist()

In [55]:
responses = {}
for id, conversation in tqdm(zip(ids, conversations),
                             total=len(ids),
                             desc="Processing Requests"):
    prompt_request = prompt_template.invoke({"conversation": conversation})
    response = model.invoke(prompt_request)
    responses[id] = response.content

Processing Requests: 100%|██████████| 10/10 [01:33<00:00,  9.39s/it]


Inspect the responses!

In [56]:
print(responses['request_3'])

### Evaluation of the Student's Goal

**1. Specificity:**
- **Score: 1**
- **Reasoning:** The goal "to not procrastinate" is quite general and lacks specificity. The student attempts to add details by mentioning reducing phone time and not leaving reading until the last minute, which provides some context. However, the goal remains vague as it does not specify what "not procrastinating" entails in measurable terms. The details provided do not clearly define what success looks like or how it can be tracked.

**2. Measurable Outcome:**
- **Score: 1**
- **Reasoning:** The student mentions measuring progress by "finishing work at a certain time of the day." While this indicates a measurable aspect, it lacks clarity on what that specific time is or how it will be tracked. The goal does not provide a clear metric for success, making it difficult to assess progress effectively.

**3. Personal Importance:**
- **Score: 1**
- **Reasoning:** The student states that the goal is important "to help 

## Using structured output with a single prompt

To force the LLM to produce outputs in formats specified by you, you need to use the `BaseModel` and `Field` classes from the `pydantic` package.

Below, we define our desired output format as:
- "specificity_score": an integer (either 0, 1 or 2) reflecting the specificity of a conversation.
- "reasoning": a string that provides the model's reasoning.

In [57]:
class SpecificityFormat(BaseModel):
    """Always use this tool to structure your response to the user."""
    specificity_score: int = Field(description="The specificity score of the entire conversation on a scale of 0, 1 and 2.")
    reasoning: str = Field(description="Your reasoning process.")

Try with a single prompt request.

In [58]:
# Bind responseformatter schema to the model
model_structured = model.with_structured_output(SpecificityFormat)

# try to run a request throught this new model
prompt_request = prompt_template.invoke({"conversation": df.iloc[0,1]})
structured_response = model_structured.invoke(prompt_request)

In [59]:
dict(structured_response)

{'specificity_score': 1,
 'reasoning': "1. SPECIFICITY: The goal is somewhat specific as it mentions catching up on geography reading, but it lacks precise details about the exact pages or chapters to be read. The mention of reading friends' notes adds some clarity, but it is still vague regarding the exact content. Score: 1.\n\n2. MEASURABILITY: The goal includes a measurable aspect by stating progress will be tracked by the number of pages written per day. However, it does not specify how many pages are expected to be read or written, which limits the ability to fully assess progress. Score: 1.\n\n3. PERSONAL IMPORTANCE: The student provides a reason for the goal, indicating that falling behind could affect their exam readiness. This shows some personal importance, but it could be more detailed regarding how this goal aligns with their overall academic aspirations. Score: 1.\n\n4. MULTI-SOURCE PLANNING: The step-by-step plan includes evaluating the workload, seeking help from friends

## Using structured output with multiple prompts

Being able to work with multiple prompts at the same time and obtain structured output will save you a substantial amount of time in research projects!

In [60]:

structured_responses = {}
for id, conversation in tqdm(zip(ids, conversations), total=len(ids), desc="Processing Messages"):
    prompt_request = prompt_template.invoke({"conversation": conversation})
    structured_response = model_structured.invoke(prompt_request)
    # Below we save only the specificity scores
    structured_responses[id] = dict(structured_response)["specificity_score"]

Processing Messages: 100%|██████████| 10/10 [00:52<00:00,  5.27s/it]


Display all the structured responses (only the specificity scores).

In [61]:
structured_responses

{'request_1': 1,
 'request_2': 2,
 'request_3': 1,
 'request_4': 2,
 'request_5': 1,
 'request_6': 2,
 'request_7': 2,
 'request_8': 1,
 'request_9': 1,
 'request_10': 2}

## Check annotation quality

Implement a handy function to calculate Krippendorff's Alpha (i.e., agreement) between two lists of specificity scores.

In [62]:
def compute_krippendorff_alpha(x: list[int], y: list[int]):
  # Format data into a reliability matrix (rows=raters, cols=items)
  data_krippendorff = np.array([x, y])
  # Compute Krippendorff’s Alpha (interval metric)
  kripp_alpha = krippendorff.alpha(reliability_data=data_krippendorff, level_of_measurement='interval')
  return kripp_alpha

Let's check the agreement between the specificity scores we got from the LLM above and the human expert-coded specificity scores!

In [63]:
score_specificity_human = df.score_specificity_human[:10].tolist()
structured_response_values = list(structured_responses.values())
print("Krippendorff's Alpha:", compute_krippendorff_alpha(structured_response_values, score_specificity_human))

Krippendorff's Alpha: 0.3870967741935484


Not a great agreement score!

How about the agreement between the LLM specificity scores that already came with the dataset (i.e., column `score_specificity_llm`) and the human expert-coded scores?

Note that `score_specificity_llm` is based on prompts that were carefully engineered by Gabrielle.

In [64]:
score_specificity_llm = df.score_specificity_llm[:10].tolist()
print("Krippendorff's Alpha:", compute_krippendorff_alpha(score_specificity_llm, score_specificity_human))

Krippendorff's Alpha: 0.8938547486033519


Wow! Much better result!

## Exercise: Try different prompting techniques to get better results!

For example:

1. Improve clarity & specificity
2. Role-based prompting
3. Step-by-step reasoning (Chain-of-Thought Prompting)
4. Few-shot prompting
5. Output structuring
6. Self-consistency prompting

Use the previous `compute_krippendorff_alpha` function to check the LLM's annotation quality.

In [None]:
# Let's write some code!