# Using Large Language Models for Data Collection in Social Sciences


In [2]:
# Install required packages
%pip install -qU langchain[openai] tqdm pydantic krippendorff

In [3]:
import getpass  # Provides a secure way to handle user passwords or other sensitive input without echoing them on the screen.
import os  # Gives access to operating system functionalities like file paths, environment variables, and directory operations.
import pandas as pd  # Imports the pandas library (aliased as pd) for handling and analyzing tabular data in DataFrames.
import numpy as np  # Imports NumPy (aliased as np), a library for fast numerical computations and array manipulations.
from tqdm import tqdm  # Imports tqdm, a progress bar utility that provides visual feedback for loops and long-running processes.
from langchain.chat_models import init_chat_model  # Imports a function from LangChain to initialize a chat-based large language model (LLM) interface.
from langchain_core.prompts import ChatPromptTemplate  # Imports a template class for creating structured prompts used in LLM interactions.
from pydantic import BaseModel, Field  # Imports BaseModel (for defining structured data models) and Field (for specifying metadata and validation rules).
import krippendorff  # Imports the krippendorff library, used to compute

## Gabrielle Martins van Jaarsveld's SoDa fellowship dataset

Feel free to use your own data. By default, we will use a toy dataset from Gabrielle Martins van Jaarsveld's SoDa fellowship project on annotating markers of self-regulated learning from student conversation data.

This dataset contains the following columns:

- `id`: The id of the row/conversation/student.

- `conversation`: The text of the conversations based on which specificity scores are derived (by humans or LLMs).

- `score_specificity_llm`: The specificity score of a conversation based on carefully prompted response from LLMs. It varies between 0, 1 and 2.

- `score_specificity_human`: The specificity score of a conversation based on human expert annotators. It is treated as gold standard (i.e., free from measurement error). It varies between 0, 1 and 2.

- `performance`: The academic performance of a student, varying from 1 to 10.

## Data loading

In [None]:
# url where you can download our example data
data_url = "https://sodascience.github.io/workshop_llm_data_collection/data/srl_data_example.csv"

# Read CSV into dataframe
df = pd.read_csv(data_url)

Display the first 10 rows of the dataset.

Note that only the first 10 rows contain the text of the conversations. We will use these texts for the prompting experiments to come.

In [None]:
df.head(10)

## Using langchain to call OpenAI's API

We will be using the Python package `langchain` to perform our prompting experiments. One great advantage of using `langchain` is that it takes away the trouble of having to learn different LLM APIs. Instead, it allows you to call different LLM APIs (both commercial and open-source) effortlessly (relatively speaking) with very simple modifications of your `langchain` code!

We will be calling OpenAI's LLM in this notebook. Feel free to experiment with other APIs and models! To do so, check out https://python.langchain.com/docs/tutorials/chatbot.

Initialise an LLM model. You need to enter your OpenAI API key for this when being prompted. Don't have one? Ask the workshop instructors!

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
model = init_chat_model("gpt-4o-mini",
                        model_provider="openai",
                        temperature=0,
                        max_tokens=1000)

## Working with a single prompt

Let's start with the system prompt (i.e., high-level instruction to the model).

In [None]:
system_prompt = """
You are an expert in educational assessment and goal evaluation, with
specialized expertise in applying deductive coding schemes to score the quality
and content of student goals.

##TASK##
A university student was given a series of prompts, guiding them through the
process of setting and elaborating on an academic goal for the coming week. You
will be provided with the entire conversation including the prompts, and the
student answers. Your objective is to assess the specificity of of the student’s
goal on a scale of 0 to 2 based on the entire conversation.
"""

Use the prompt template module from langchain to create a prompt request with both **system** and **user** prompts.

In [None]:
prompt_template = ChatPromptTemplate([
    ("system", system_prompt),
    ("user", "{conversation}")
])

# Create a prompt request with the system prompt and the user prompt based on
# the first conversation from the dataset
prompt_request = prompt_template.invoke({"conversation": df.iloc[0,1]})

Check the prompt:

In [None]:
prompt_request.to_messages()

Prompt the model and inspect the response!

In [None]:
response = model.invoke(prompt_request)
print(response.content)

Voila! You have your first successful prompting interaction with the API of a large language model!

## Working with multiple prompts
Next, we are going beyond a single prompt. Instead, we will work with **multiple prompts** at the same time!

Define a list of request IDs and another list of conversations (necessary for forming the user prompts)

In [None]:
ids = df.id[:10].tolist()
conversations = df.conversation[:10].tolist()

In [None]:
responses = {}
for id, conversation in tqdm(zip(ids, conversations),
                             total=len(ids),
                             desc="Processing Requests"):
    prompt_request = prompt_template.invoke({"conversation": conversation})
    response = model.invoke(prompt_request)
    responses[id] = response.content

Inspect the responses!

In [None]:
print(responses['request_3'])

## Using structured output with a single prompt

To force the LLM to produce outputs in formats specified by you, you need to use the `BaseModel` and `Field` classes from the `pydantic` package.

Below, we define our desired output format as:
- "specificity_score": an integer (either 0, 1 or 2) reflecting the specificity of a conversation.
- "reasoning": a string that provides the model's reasoning.

In [None]:
class SpecificityFormat(BaseModel):
    """Always use this tool to structure your response to the user."""
    specificity_score: int = Field(description="The specificity score of the entire conversation on a scale of 0, 1 and 2.")
    reasoning: str = Field(description="Your reasoning process.")

Try with a single prompt request.

In [None]:
# Bind responseformatter schema to the model
model_structured = model.with_structured_output(SpecificityFormat)

# try to run a request throught this new model
prompt_request = prompt_template.invoke({"conversation": df.iloc[0,1]})
structured_response = model_structured.invoke(prompt_request)

In [None]:
dict(structured_response)

## Using structured output with multiple prompts

Being able to work with multiple prompts at the same time and obtain structured output will save you a substantial amount of time in research projects!

In [None]:
structured_responses = {}
for id, conversation in tqdm(zip(ids, conversations), total=len(ids), desc="Processing Messages"):
    prompt_request = prompt_template.invoke({"conversation": conversation})
    structured_response = model_structured.invoke(prompt_request)
    # Below we save only the specificity scores
    structured_responses[id] = dict(structured_response)["specificity_score"]

Display all the structured responses (only the specificity scores).

In [None]:
structured_responses

## Check annotation quality

Implement a handy function to calculate Krippendorff's Alpha (i.e., agreement) between two lists of specificity scores.

In [None]:
def compute_krippendorff_alpha(x: list[int], y: list[int]):
  # Format data into a reliability matrix (rows=raters, cols=items)
  data_krippendorff = np.array([x, y])
  # Compute Krippendorff’s Alpha (interval metric)
  kripp_alpha = krippendorff.alpha(reliability_data=data_krippendorff, level_of_measurement='interval')
  return kripp_alpha

Let's check the agreement between the specificity scores we got from the LLM above and the human expert-coded specificity scores!

In [None]:
score_specificity_human = df.score_specificity_human[:10].tolist()
structured_response_values = list(structured_responses.values())
print("Krippendorff's Alpha:", compute_krippendorff_alpha(structured_response_values, score_specificity_human))

Not a great agreement score!

How about the agreement between the LLM specificity scores that already came with the dataset (i.e., column `score_specificity_llm`) and the human expert-coded scores?

Note that `score_specificity_llm` is based on prompts that were carefully engineered by Gabrielle.

In [None]:
score_specificity_llm = df.score_specificity_llm[:10].tolist()
print("Krippendorff's Alpha:", compute_krippendorff_alpha(score_specificity_llm, score_specificity_human))

Wow! Much better result!

## Exercise: Try different prompting techniques to get better results!

For example:

1. Improve clarity & specificity
2. Role-based prompting
3. Step-by-step reasoning (Chain-of-Thought Prompting)
4. Few-shot prompting
5. Output structuring
6. Self-consistency prompting

Use the previous `compute_krippendorff_alpha` function to check the LLM's annotation quality.

In [None]:
# Let's write some code!