<a href="https://colab.research.google.com/github/fqixiang/workshop_llm_data_collection/blob/main/notebooks/llm_data_collection_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Large Language Models for Data Collection in Social Sciences

In [None]:
# Install and load required packages
install.packages(c("ellmer", "patchwork", "irr"))
library(ellmer) # for LLM API interfacing in R
library(patchwork) # for combining multiple ggplot visualizations,
library(irr) # for computing inter-rater reliability metrics such as Cohen’s Kappa or Krippendorff’s Alpha.

# Loads the 'tidyverse' collection of packages for data manipulation and visualization.
library(tidyverse)

## Gabrielle Martins van Jaarsveld's SoDa fellowship dataset

Feel free to use your own data. By default, we will use a toy dataset from Gabrielle Martins van Jaarsveld's SoDa fellowship project on annotating markers of self-regulated learning from student conversation data.

This dataset contains the following columns:

- `id`: The id of the row/conversation/student.

- `conversation`: The text of the conversations based on which specificity scores are derived (by humans or LLMs).

- `score_specificity_llm`: The specificity score of a conversation based on carefully prompted response from LLMs. It varies between 0, 1 and 2.

- `score_specificity_human`: The specificity score of a conversation based on human expert annotators. It is treated as gold standard (i.e., free from measurement error). It varies between 0, 1 and 2.

- `performance`: The academic performance of a student, varying from 1 to 10.

Load the data into a dataframe.

In [None]:
# Download our example data
data_url <- "https://sodascience.github.io/workshop_llm_data_collection/data/srl_data_example.csv"

# Read CSV into dataframe
df <- read_csv(data_url)

Note that only the first 10 rows contain the text of the conversations. We will use these texts for the prompting experiments to come.

Display the first 10 rows of the dataset.

In [None]:
head(df, 10)

## Using ellmer to call OpenAI's API

We will be using the R package `ellmer` to perform our prompting experiments. One great advantage of using `ellmer` is that it takes away the trouble of having to learn different LLM APIs. Instead, it allows you to call different LLM APIs (both commercial and open-source) effortlessly (relatively speaking) with very simple modifications of your `ellmer` code!

We will be calling OpenAI's LLM in this notebook. Feel free to experiment with other APIs and models! To do so, check out https://ellmer.tidyverse.org/.

Let's first configure your OpenAI API key. Enter when being prompted. Don't have one? Ask the workshop instructors!

In [None]:
# Prompt user for API key
openai_api_key <- readline(prompt = "Enter API key for OpenAI: ")
Sys.setenv(OPENAI_API_KEY = openai_api_key)

Let's now define a function that makes a call to the OpenAI API for gpt-4o-mini.

In [None]:
call_openai_api <- function(system_prompt, user_prompt) {
  chat <- chat_openai(
    model = "gpt-4o-mini",
    system_prompt = system_prompt,
    api_args = list(temperature = 0,
                    max_tokens = 1000,
                    seed = 42),
    echo = "none" #suppress the output from being printed
  )
  return(chat$chat(user_prompt))
}

## Working with a single prompt

Let's start with the system prompt (i.e., high-level instruction to the model).

In [None]:
system_prompt <- "
You are an expert in educational assessment and goal evaluation, with
specialized expertise in applying deductive coding schemes to score the quality
and content of student goals.

##TASK##
A university student was given a series of prompts, guiding them through the
process of setting and elaborating on an academic goal for the coming week. You
will be provided with the entire conversation including the prompts, and the
student answers. Your objective is to assess the specificity of of the student’s
goal on a scale of 0 to 2 based on the entire conversation.
"

Create a prompt request with the system prompt and the user prompt based on the first conversation from the dataset.

In [None]:
response <- call_openai_api(
  system_prompt = system_prompt,
  user_prompt   = df[["conversation"]][1]
)
cat(response)

Voila! You have your first successful prompting interaction with the API of a large language model!

## Working with multiple prompts
Next, we are going beyond a single prompt. Instead, we will work with **multiple prompts** at the same time!

Run a for loop through all the 10 conversations.

In [None]:
# initialize output list
responses <- list()

for (i in 1:10) {
  # extract current conversation and id
  current_convo <- df |> slice(i) |> pull("conversation")
  current_id    <- df |> slice(i) |> pull("id")

  # get response from llm
  response <- call_openai_api(
    system_prompt = system_prompt,
    user_prompt   = current_convo
  )

  # assign to output list
  responses[[current_id]] <- response

  # report progress (does not work well in colab)
  message(sprintf("%d/10 completed", i))
}

Let's inspect the responses!

In [None]:
cat(responses[[2]])

## Using structured output with a single prompt

To force the LLM to produce outputs in formats specified by you, you need to use the `$extract_data()` method instead of the `$chat` method.

Below, we define our desired output format as:
- "reasoning": a string that provides the model's reasoning.
- "specificity_score": an integer (either 0, 1 or 2) reflecting the specificity of a conversation.

In [None]:
output_structure <- type_object(
  specificity_score = type_integer("The specificity score of the entire conversation on a scale of 0, 1 and 2."),
  reasoning = type_string("Your reasoning process.")
)

call_openai_api_structured <- function(system_prompt, user_prompt) {
  chat <- chat_openai(
    model = "gpt-4o-mini",
    system_prompt = system_prompt,
    api_args = list(temperature = 0,
                    max_tokens = 1000,
                    seed = 42),
    echo = "none" #suppress the output from being printed
  )
  response <- chat$chat_structured(user_prompt, type = output_structure)
  return(response)
}

Try with a single prompt request.

In [None]:
structured_response = call_openai_api_structured(system_prompt, df[["conversation"]][1])
print(structured_response)

## Using structured output with muitiple prompts

Being able to work with multiple prompts at the same time and obtain structured output will save you a substantial amount of time in research projects!

In [None]:
structured_responses <- list()
for (i in 1:10) {
  # extract current conversation and id
  current_convo <- df |> slice(i) |> pull("conversation")
  current_id    <- df |> slice(i) |> pull("id")

  # get response from llm
  response <- call_openai_api_structured(
    system_prompt = system_prompt,
    user_prompt   = current_convo
  )

  # assign to output list
  structured_responses[[current_id]] <- response

  # report progress (does not work well in colab)
  message(sprintf("%d/10 completed", i))
}

Turn the structured responses into a data frame and show it.

In [None]:
# use map_dfr from purrr package to turn list into dataframe
response_df <- map_dfr(structured_responses, I, .id = "id")
response_df

## Check annotation quality

The `kripp.alpha()` function from the `irr` package can be used to calculate agreement (i.e., Krippendorff's Alpha) of specificity scores between two raters (e.g., LLMs and human experts).

Let's check the agreement between the specificity scores we got from the LLM above and the human expert-coded specificity scores!

In [None]:
# create rating matrix (rows = raters, cols = items)
rating_matrix <- rbind(
  df |> slice(1:10) |> pull(score_specificity_human),
  response_df |> pull(specificity_score)
)

# compute agreement (0 - 1)
kripp.alpha(rating_matrix, method = "interval")

Not a great agreement score!

How about the agreement between the LLM specificity scores that already came with the dataset (i.e., column `score_specificity_llm`) and the human expert-coded scores?

Note that `score_specificity_llm` is based on prompts that were carefully engineered by Gabrielle.

In [None]:
rating_matrix <- rbind(
  df |> slice(1:10) |> pull(score_specificity_human),
  df |> slice(1:10) |> pull(score_specificity_llm)
)

kripp.alpha(rating_matrix, method = "interval")

Wow! Much better after some careful prompt engineering!

## Exercise: Try different prompting techniques to get better results!

For example:

1. Improve clarity & specificity
2. Role-based prompting
3. Step-by-step reasoning (Chain-of-Thought Prompting)
4. Few-shot prompting
5. Output structuring
6. Self-consistency prompting

Use the `kripp.alpha` function to check the LLM's annotation quality.

In [None]:
# Let's write some code!