This notebook is inspired by Chapter 1 of the book **"Natural Language Processing with Transformers: Building Language Applications with Hugging Face"** by Tunstall, von Werra, and Wolf.

You will get first experience in using pretrained models for specific tasks, using the *pipeline* API from the Hugging Face Transformes library, which allows you to do inference at a very high level of abstraction.

In each exercise, you will first define a *pipeline* for a specific task, using a pretrained model from the Hugging Face models library, and apply the pipeline on text snippets from the course webpage.

## Resources:

- [pipeline documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)
- [Hugging Face models library](https://huggingface.co/models)

## The input text

In [None]:
import pandas as pd
from transformers import pipeline

pd.set_option("display.max_rows", 100)

text = """Last year has shown multiple breakthroughs in deep learning, bringing large language
models to the mainstream. OpenAI's ChatGPT, Microsoft's new Bing Search and GitHub
Copilot, and Deep Mind's AlphaCode are the most prominent. While they still have many
flaws, they show a potential to transform many sectors of the economy, replace some
workers and make other vastly more productive.

NLP also has an immense potential to change research in economics. Most economists use
small and expensive structured datasets. NLP offers a way to work with novel data sources
that often can be scraped for free from the web. Examples are classifying speeches along
the political spectrum, classifying tweets to measure opinions, extracting concepts
mentioned in free-form survey replies, or translating questionnaires or datasets into
different languages.

This class is an introduction to deep learning and NLP for economists. Starting from
zero, the first half of the course focuses on learning the practical skills needed to
incorporate NLP into empirical workflows. We will use Huggingface's transformers library
and only work with pre-trained models for this. The second half of the class zooms in
and focuses on understanding what language models are, how they differ, and how they are
trained. We will write some purely didactical code in numpy and implement a few simple
models in PyTorch. The main focus of the second half is to build enough understanding to
work effectively with pre-trained models. It is beyond our scope and computational
resources to actually train large models."""

paragraphs = text.split("\n\n")

## Text classification

1. Define a classifier pipeline using the ` "distilbert-base-uncased-finetuned-sst-2-english"` model.

2. Apply the pipeline to the entire text.

3. Apply the pipeline separately on each paragraph of the input text to extract sentiments.

4. Convert the output of the previous task to a pandas DataFrame

In [None]:
classifier = pipeline(
    "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english"
)

In [None]:
classifier(text)

In [None]:
sentiments = classifier(paragraphs)
sentiments

In [None]:
classifier("awful terrible bad horrible")

In [None]:
pd.DataFrame(sentiments)

In [None]:
classifier_emotion = pipeline(
    "text-classification",
    model="bhadresh-savani/distilbert-base-uncased-emotion",
    return_all_scores=True,
)

In [None]:
classifier_emotion(text)

In [None]:
emotions = classifier_emotion(paragraphs)
emotions

In [None]:
classifier_emotion("neutral")

## Named entity recognition
1. Define a named entity recognition (ner) pipeline using the `"dslim/bert-base-NER-uncased"` model. 

2. Apply the pipeline to `text` and convert the result to a DataFrame

In [None]:
ner_tagger = pipeline(
    "ner",
    model="dslim/bert-base-NER-uncased",
)
pd.DataFrame(ner_tagger(text))

## Aggregation strategies

Use the same model as before, but try out different aggregation strategies. Which one gives you the best results?

**Note**: Below we show a solution where you apply all strategies in a loop and combine the result in one DataFrame. Other solutions are completely ok. 

In [None]:
ner_results = []
for strategy in ["none", "simple", "first", "average", "max"]:
    tagger = pipeline(
        "ner",
        model="dslim/bert-base-NER-uncased",
        aggregation_strategy=strategy,
    )
    df = pd.DataFrame(tagger(text))
    df["strategy"] = strategy
    ner_results.append(df)

pd.concat(ner_results).set_index("strategy")

## Question answering

1. Define a question answering pipeline using the `"deepset/roberta-base-squad2"` model.

2. Come up with a few questions one might ask about the course logistics. 

3. Apply the pipeline to get answers to your questions.

**Do not trust any answer without double-checking it!**

In [None]:
logistics_text = """Text Data in Economics
 
The goal of this course is to equip you with modern text data methods and integrate these techniques into your research. You should know what type of text methods are used in economic research and have some hands-on experience with text data methods. By the end of the course, you should have an actionable research plan.
The course begins with introductory lectures on text data methods and showcases examples of how these methods have been used in economics research. The topics covered include tokenisation, distance in text, vectorisation, and the use of large language models.
Following the introductory lectures, we start working on our own text data project and developing our own research ideas using text data. We work individually or in groups depending on the number of students. First, we develop research questions and consider potential data sources that would allow us to answer the question. We explore available data sources and design the analysis we would run on our data. We present the ideas to the group, write a research proposal, and give feedback to our peers. In the end, we write a term paper detailing the research question, contributions to the literature, data sources, analysis, and an empirical part using text data. If data is too difficult to acquire during the course, students can do a separate text data exercise. Ideally, the plan leads to a proper research project.
The course is for economics students in all fields interested in text methods and you will be encouraged to work on a project in your field of interest. However, most examples in the lectures will focus on my fields of expertise like political economy and economic development.
Prerequisites:
Some Python programming experience is required (or excellent skills in another programming language and a willingness to acquire the necessary skills very fast).
There are also prerequisite courses set by the administration: Mathematics for Economists and Basic module Econometrics. I cannot give you credit if you have not done the prerequisite courses!

Requirements:
You develop and present a research idea to the group.
You read your peers’ proposals and provide short, written feedback with suggestions on how to improve them.
You engage in the group discussion after each presentation.
At the end of the course, you hand in a term paper on the final version of your research proposal.
 
Grading: The final grade is a weighted average of your presentation and participation in the group discussions (40%) and your term paper (60%).  
 
 
 
 
Syllabus

Lecture 1: Introduction to Text Data, Scraping and Tokenization
Reading: 
Ash and Hansen, “Text Algorithms in Economics”
Gentzkow, Kelly, and Taddy, “Text as Data”
 
Lecture 2: Tokenization and Dictionaries
          	Reading: 
 
Lecture 3: Vectorization and Document Distance
            Reading: 
Autor et al. “New Frontiers: The Origins and Content of New Work, 1940–2018”

 
Lecture 4: Large Language Models
 	Reading: 
		

 

 
Schedule
Presentations: Text Data: Presentations 

Date
Content
08.10.
Lecture 1: Overview
15.10.
Lecture 2: Tokenization Dictionaries
22.10.
Lecture 3: Vectorization and Document Distance
29.10.
Lecture 4: Language Models
05.11.
Paper Presentations
12.11.
Paper Presentations
19.11.
Paper Presentations
26.11.
No Lecture (option for Q&A)
03.12.
Paper Presentations
10.12.
Paper Presentations
17.12.
Q&A for Research Proposals (On Zoom)
07.01.
Research Proposal Presentations
14.01.
Research Proposal Presentations
21.01.
Research Proposal Presentations
28.02.
Term Paper Deadline

 
Notebooks
Data for the Notebooks: Link (Dropbox)
Notebook1: Corpora Matching Link (Dropbox)
Notebook2: Tokenization Link (Dropbox
Notebook3: Word Embeddings and Document Distance Link (Dropbox)



Instructions for Paper Presentations:
Presentations should be 30 minutes with a focus on the text methods in the paper. The presentation should cover the following content:
- What is the authors’ motivation and research question? Which gap in the literature do they
address (short)?
- Which data do they use, how do they preprocess data? How did the research access data, Is it available for anyone?
- What text methods did they use?
- Are these methods complemented with other empirical methods or a research design to allow studying causal questions?
- What are the most important results?
- What are the limits of the study? Where do you see potential for future research?
- 2-3 questions or thoughts for discussion

Instructions for Research Proposal Presentations:
Presentations should be 20 minutes with a focus on the text methods in the paper. The presentation should cover the following content:
Research Question
Existing Literature
What data is/will be used
What text methods are being used?
Is there a theoretical framework, other empirical methods, research design?
Some preliminary results?
Instructions for the Term Paper:
The proposal has to include empirical research using text data.

The following content should be included:
- Introduction, including the motivation and research question
- Overview of the related literature
- Suitable data source, how is it accessed?
- Description of text data method that will be used
- Some analysis on the text*

- Optional: discussion of other methods used in the paper
Length: maximum 10 pages. Also code and results should be submitted.

 *I want every student to get some hands-on experience on analyzing text. However, if the data is not accessible fast enough for this course, you should not let that restrict you writing the Term Paper on that idea. So, in this case you can do some analysis on different text data, write a short report and submit that analysis as part of your Term Paper.

"""

In [None]:
reader = pipeline("question-answering", model="deepset/roberta-base-squad2")

questions = [
    "What is the deadline for the term paper?",
    "How long is the research proposal presentation?",
    "How is the grade determined?",
    "What is the topic of lecture 3?",
    "What is the maximum length of the term paper?",
]
answers = pd.DataFrame(reader(question=questions, context=logistics_text))
answers["question"] = questions

answers

## Summarization
1. Define a text summarizing pipeline using the `"sshleifer/distilbart-cnn-6-6"` model. 
2. Apply the pipeline to `text`
3. Play around with the keyword arguments `min_length` and `max_length` until you get something you like.

In [None]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6")

summarizer(
    text,
    min_length=100,
    max_length=200,
    clean_up_tokenization_spaces=True,
)

## Summarization by paragraph

1. Apply the pipeline from the previous task to each paragraph of `text`.
2. Play around with `min_length` and `max_length` until you are satisfied with the result.
3. Combine the results back into one string

In [None]:
summaries = summarizer(
    paragraphs, min_length=40, max_length=60, clean_up_tokenization_spaces=True
)
texts = [entry["summary_text"] for entry in summaries]
print("\n\n".join(texts))

## Translation

1. Go to [huggingface](https://huggingface.co/models) and search for a model that can translate a text from english to your favorite language.
2. Define a pipeline to do the translation
3. Apply the pipeline on the example input text to translate the content to German.

In [None]:
# translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-es")
# See the how-to guide for further instructions

# outputs = translator(text, clean_up_tokenization_spaces=True)
# outputs[0]["translation_text"]

In [None]:
import openai
from openai import OpenAI

# Set the API key.

In [None]:
# We define which model to use throughout
MODEL = "gpt-5-nano"
MAX_TOKENS = 8000
WAIT_TIME = 0.8  # Wait time between each request. This depends on the rate limit of the model used: GPT-4 needs longer wait time than GPT-3.5.

client = OpenAI(api_key=OPENAI_API_KEY)

In [None]:
response = client.chat.completions.create(
    model="gpt-4.1-nano",  # Which model to use
    temperature=0.2,  # How random is the answer
    max_tokens=120,  # How long can the reply be
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
)
result = """"""
for choice in response.choices:
    result += choice.message.content
print(f"Model answer: '{result}'")

In [None]:
# Loading data from textfiles
import glob
import os

import pandas as pd

# Define the folder path where the text files are located
folder_path = "./global-populism-dataset/speeches_20220427/"

# Use glob to get a list of all *.txt files in the folder
txt_files = glob.glob(folder_path + "/*.txt")

# Create an empty list to store the data
data = []

# Loop through each text file
for file_path in txt_files:
    with open(file_path, "r", encoding="utf-8", errors="ignore") as file:
        # Read the text from the file
        text = file.read()

        # Get the filename without the directory path
        filename = os.path.basename(file_path)

        # Append the text and filename to the data list
        data.append({"filename": filename, "text": text})

# Create a dataframe with the data
df = pd.DataFrame(data)

## 3. Loading and preparing your data

The next step is to load and prepare the data that we want to analyze. We will load the data into a Pandas dataframe to allow easy processing.

The details of how to open your particular data depends on the structure and format of the data. Pandas offers ways of opening a range of file formats, including CSV and Excel files. You may wish to refer to the Pandas documentation for more details.

In our example, we will use the data from the Global Populism Dataset (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LFTQEZ). This data offers a number of texts from politicians, and can be used for validating our method. The texts are provided as .txt files in a folder. We will load all these files into a single dataframe.

In [None]:
# Loading data from textfiles
import glob
import os

import pandas as pd

# Define the folder path where the text files are located
folder_path = "global-populism-dataset/speeches_20220427/"

# Use glob to get a list of all *.txt files in the folder
txt_files = glob.glob(folder_path + "/*.txt")

# Create an empty list to store the data
data = []

# Loop through each text file
for file_path in txt_files:
    with open(file_path, "r", encoding="utf-8", errors="ignore") as file:
        # Read the text from the file
        text = file.read()

        # Get the filename without the directory path
        filename = os.path.basename(file_path)

        # Append the text and filename to the data list
        data.append({"filename": filename, "text": text})

# Create a dataframe with the data
df = pd.DataFrame(data)

### Filter the data 

You will likely need to filter out and select the data you wish to include. 

In our case, we will filter out texts with non-latin alphabets. While ChatGPT can handle languages with non-latin characters, it is currently more expensive, and there are issues with managing text length. For simplicity, we therefore remove the texts with non-latin alphabet.


In [None]:
def is_latin_alphabet(text):
    latin_characters = 0
    total_characters = 0

    for char in text:
        if ord(char) >= 0x0000 and ord(char) <= 0x007F:
            latin_characters += 1
        total_characters += 1

    # Check if the majority of characters are Latin alphabet characters
    if latin_characters / total_characters >= 0.9:
        return True
    else:
        return False


df = df[df["text"].apply(is_latin_alphabet)]

### Chunking the texts 

Unlike other NLP methods, not much preprocessing is needed. However, LLMs are only able to process texts that are smaller than their "context window". If our texts are longer than the context window of our model, we have to either split the texts into several smaller chunks and analyze them part by part, or simply truncate the text (not recommended).

The details depend on the model you use and the amount for data. For our example, with the GPT-4-32k model, our speeches all fit in the model window, and we do not need to split the texts. 

However, for pedagogical reasons, we will use the standard 8K GPT-4 model and chunk the text into smaller pieces. If your text is short, such as a tweet, this function will do nothing.

In [None]:
# Example of how to chunk the text into pieces, separated on sentence level.
# To do so, we use the nltk library

In [None]:
!pip install tiktoken
!pip install nltk

In [None]:
import nltk
import nltk.data
import numpy as np
import tiktoken
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")

In [None]:
# This code chunks the text into processable pieces of similar size.
# If the text is longer than allowed in terms of model tokens, we want to split the text into equally sized parts, without splitting any text mid-sentence.


def split_text_into_chunks(text, max_tokens):

    # Code the text in gpt coding and calculate the number of tokens
    encoding = tiktoken.encoding_for_model(MODEL)
    nrtokens = len(encoding.encode(text))

    if nrtokens < max_tokens:
        return [text]

    # how many chunks to split it into?
    num_chunks = np.ceil(nrtokens / max_tokens)

    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Calculate the number of words per chunk
    words_per_chunk = len(text.split()) // num_chunks

    # Initialize variables
    chunks = []
    current_chunk = []

    word_counter = 0
    # Iterate through each sentence
    for sentence in sentences:
        # Add the sentence to the current chunk
        current_chunk.append(sentence)
        word_counter += len(sentence.split())

        # Check if the current chunk has reached the desired number of words
        if word_counter >= words_per_chunk:
            # Add the current chunk to the list of chunks
            chunks.append(" ".join(current_chunk))
            word_counter = 0
            # Reset the current chunk
            current_chunk = []

    # Add the remaining sentences as the last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

In [None]:
# Maximum number of words per chunk, this depends on the model context window.
# We set it to a bit lower than the max tokens, to leave space for our instruction and the response.
max_tokens = MAX_TOKENS - 2000
df["text_chunks"] = df["text"].apply(lambda x: split_text_into_chunks(x, max_tokens))

# 4. Prompt engineering

The next step is to formulate a first instructions for analyzing the text. The prompts will be a result of an iterative process through which you develop a formulation of the concept that you wish to capture. 

We here start by drawing on the instructions for human coders from a previous study.

See the how-to guide for details on this process. 

In [None]:
instruction = """Your task is to evaluate the level of populism in a political text. Populism is defined as "an ideology that considers society to be ultimately separated into two homogeneous and antagonistic groups, 'the pure people' versus 'the corrupt elite', and which argues that politics should be an expression of the volonté générale (general will) of the people."
A populist text is characterized by BOTH of the following elements:
- People-centrism: how much does the text focus on "the people" or "ordinary people" as an indivisible or homogeneous community? Does the text promote a politics as the popular will of "the people"?
Appeals to specific subgroups of the population (such as ethnicities, regional groups, classes) are inherently antithetical to populism.
- Anti-elitism: how much does the text focus on "the elite", and to what extent are elites in general described in negative terms? In populist texts, the elite is often described as corrupt, and the juxtaposition between the ordinary people and the elite is cast as a moral struggle between good and bad. 
Criticism of specific elements within an elite is not populist: a populist appeal must regard the elite in its entirety as anathema. 

You should give the text a numeric grade between 0 and 2.
2. The text is very populist and comes very close to the ideal populist discourse.
1. A speech in this category includes strong expressions of all of the populist elements,  but either does not use them consistently or tempers them by including non-populist elements. The text may have a romanticized notion of the people and the idea of a unified popular will, but it avoids bellicose language or any particular enemy.
0. A speech in this category uses few if any populist elements. 
[Answer with a number in the 0-2 range, followed by a semi-colon, and then a brief motivation. For instance: "1.23; The text shows many elements of a populist text." Do not use quotation marks.]
"""

# 5. Calling the LLM and analyzing the results

### 5.1 Call the LLM

We will now write simple functions for calling the API and carry out our analysis request. We will also need to handle possible errors returned from the API.

In [None]:
import time


def analyze_message(text, instruction, model="gpt-4", temperature=0.2):
    print(f"Analyzing message...")

    response = None
    tries = 0
    failed = True

    while failed:
        try:
            response = client.chat.completions.create(
                model=model,
                temperature=temperature,
                messages=[
                    {
                        "role": "system",
                        "content": f"'{instruction}'",
                    },  # The system instruction tells the bot how it is supposed to behave
                    {
                        "role": "user",
                        "content": f"'{text}'",
                    },  # This provides the text to be analyzed.
                ],
            )
            failed = False

        # Handle errors.
        # If the API gets an error, perhaps because it is overwhelmed, we wait 10 seconds and then we try again.
        # We do this 10 times, and then we give up.
        except openai.APIError as e:
            print(f"OpenAI API returned an API Error: {e}")

            if tries < 10:
                print(
                    f"Caught an APIError: {e}. Waiting 10 seconds and then trying again..."
                )
                failed = True
                tries += 1
                time.sleep(10)
            else:
                print(f"Caught an APIError: {e}. Too many exceptions. Giving up.")
                raise e

        except openai.ServiceUnavailableError as e:
            print(f"OpenAI API returned an ServiceUnavailable Error: {e}")

            if tries < 10:
                print(
                    f"Caught a ServiceUnavailable error: {e}. Waiting 10 seconds and then trying again..."
                )
                failed = True
                tries += 1
                time.sleep(10)
            else:
                print(
                    f"Caught a ServiceUnavailable error: {e}. Too many exceptions. Giving up."
                )
                raise e

        except openai.APIConnectionError as e:
            print(f"Failed to connect to OpenAI API: {e}")
            pass
        except openai.RateLimitError as e:
            print(f"OpenAI API request exceeded rate limit: {e}")
            pass

        # If the text is too long, we truncate it and try again. Note that if you get this error, you probably want to chunk your texts.
        except openai.InvalidRequestError as e:
            # Shorten request text
            print(
                f"Received a InvalidRequestError. Request likely too long; cutting 10% of the text and trying again. {e}"
            )
            time.sleep(5)
            words = text.split()
            num_words_to_remove = round(len(words) * 0.1)
            remaining_words = words[:-num_words_to_remove]
            text = " ".join(remaining_words)
            failed = True

        except Exception as e:
            print(f"Caught unhandled error.")
            pass

    if response is None:
        raise RuntimeError("API returned None (unexpected).")
    return "".join((c.message.content or "") for c in response.choices)

### 5.2 Parse response 

The LLM will return a text message. We need to parse this response so that we can use it for further analysis. The details of this function will depend on how you asked the API to respond in your instruction (see above). In our case, we asked the LLM to return a list of numbers, followed by a motivation.

In [None]:
def parse_result(result):
    # The LLMs at times surround their answers with quotation marks, even if you explicitly tell them not to. If so, we remove them her.
    result = result.strip("'\"")
    try:
        # We asked the LLM to start with a number, followed by a semi-colon, followed by the motivation. We assume this format in the response here.
        return result.split(
            ";", 2
        )  # Split by ';' and use first part as numeric answer, second part as motivation
    except Exception as e:
        # If we get an error, we here print the string that failed, to allow debugging.
        print(result)
        pass

### 5.3 Run the analysis

This is the main loop of the code, where we call the LLM for each line in our data, and give it the instructions.

In [None]:
# First, we need to prepare the data and store it in a file for persistency
filename = "data.pkl"

In [None]:
# These are the columns where we will store the analyzed data
df["answers"] = [[] for _ in range(len(df))]
df["motivations"] = [[] for _ in range(len(df))]

df.to_pickle(filename)

In [None]:
# Main loop
df = pd.read_pickle(filename)

# If you want to limit the number of lines to analyze
maximum_lines_to_analyze = 10
i = 0

while True:

    # Find all unprocessed lines
    # left = df.loc[df['result'].isna()]
    left = df.loc[df["answers"].map(len) == 0]

    # No lines left? Then we're done
    if len(left) == 0 or i >= maximum_lines_to_analyze:
        print("All done!")
        break

    # Take a random line
    line = left.sample()
    index = line.index.values[0]

    print(f"There are {len(left)} left to process. Processing: {index}")

    # Wait for a bit, to not overload the API
    time.sleep(WAIT_TIME)

    # Analyze the specific line, chunk by chunk
    for chunk in line["text_chunks"].values[0]:
        result = analyze_message(chunk, instruction, model=MODEL)

        # Parse the results, and put into dataframe
        answer, motivation = parse_result(result)

        df.loc[index, "answers"].append(answer)
        df.loc[index, "motivations"].append(motivation)

    i += 1

    # Save the result to persistent file
    df.to_pickle(filename)

### Post-analysis calculations

Following the LLM analysis, we may need to do some minor calculations or modifications of the results. 

For instance, we need to combine the values returned for the different chunks to a final complete values for the full text. This can be done in several ways, but the most straight-forward is to take the average values for each part. If the text only has one chunk, the result will be used without change. We will here leave the motivations as a list.

In [None]:
df = pd.read_pickle(filename)

# Take the mean level of populism in the text


def safe_mean(values):
    nums = pd.to_numeric(pd.Series(values), errors="coerce").dropna()
    return float(nums.mean()) if not nums.empty else None


df["answer"] = df["answers"].apply(safe_mean)

# Sort so that the most populist speeches are at the top
df = df.sort_values(by="answer", ascending=False)
df

### Example result

We can now look at some examples of the result from the analysis and the associated motivation. 

For instance, we here look at the rating of Donald Trump's inaguration speech:

In [None]:
fico = df.loc[df.filename == "Slovakia_Fico_International_2.txt"]
print(
    f"""Rating: {fico.answer.values[0]}. Motivation: '{fico.motivations.values[0][0]}'"""
)

At face value, this motivation seems both reasonable and plausible. We will now turn to carry out a more in-depth validation.

# 6. Validation

Finally, we need to validate our results. Careful validation is essential to make sure that the models are measuring what we intend -- and that they do so without problematic biases. To validate our models, we can compare the outputs with established benchmarks, ground truth data, or expert evaluations to validate the effectiveness in achieving the desired analysis outcomes. Validation can furthermore help us fine-tune the model prompt to improve the results.

A simple way of validating can be to output a random sample to an Excel file, and have human coders manually classifying the data to compare the results. To do so, the code below can be used:

### 6.1 Acquire validation data

In [None]:
# # Code to extract the data as excel for manual checking. This is included for illustration, however, we won't use this here.
# sample_size = 100
# sample = df.sample(sample_size).reset_index()
# sample['manual_classification'] = None
# sample[['index','text','manual_classification']].to_excel('manual_validation.xlsx')

# # Now open the resulting file in Excel. Carry out manual classification and put result in the final column

# manual_result = pd.read_excel('manual_validation_finished.xlsx')

In the case of the populism example, however, the Global Populism Database already offers a large sample of manually classified datapoints that we can use for validation. 

We first need to make sure the data is in the right format for running simpledorff. Each line should be one coder response.

In [None]:
# Load and clean the validation data.
val = pd.read_csv("./global-populism-dataset/gpd_v2_20220427.csv")
val = val[val["merging_variable"].notna()]
val = val[
    val["rubricgrade"].notna()
]  # The database contains some NaN values for the index; we remove these lines
val = val[["merging_variable", "codernum", "rubricgrade", "averagerubric"]]

In [None]:
# We include only the lines that we've coded with the LLM
included = set(df.loc[~df["answer"].isna()].filename.values)
val = val.loc[(val["merging_variable"].isin(included)) & (val["codernum"] <= 2)]

# We compare our result with that of the average coder result
val = val.drop_duplicates(subset=["merging_variable"], keep="first")[
    ["merging_variable", "averagerubric"]
].rename(columns={"averagerubric": "answer"})
val["codernum"] = "human"

# Fit our coded data into the same format to allow processing
df2 = (
    df[["filename", "answer", "motivations"]]
    .dropna(subset=["answer"])
    .rename(columns={"filename": "merging_variable"})
)
df2["codernum"] = "llm"

# Combine the two datasets
validation_data = pd.concat([val, df2])

### 6.2 Measure Krippendorf's Alpha

To compare our data against the validation data, we can use Krippendorf's Alpha (see how-to guide for details.) We here use the simpledorff library to do so.

In [None]:
!pip install simpledorff

In [None]:
import simpledorff

In [None]:
# Calculate inter-coder reliability
# Note that this uses the interval metric. If your variable is categorical, you need to remove the metric_fn parameter.
KA = simpledorff.calculate_krippendorffs_alpha_for_df(
    test,
    metric_fn=simpledorff.metrics.interval_metric,
    experiment_col="merging_variable",
    annotator_col="codernum",
    class_col="answer",
)

print(f"The resulting Krippendorf's Alpha is is {KA}.")

This is a relatively high value for a first iteration of prompt development for a challenging concept.


### 6.3 Carry out iterative process of concept and prompt development 

Having measured the disagreements between coders and LLM, we can now seek to try to understand the sources of the disagreement. This can be best thought of as a process of mutual learning through which we develop and operationalize a rigorous social scientific concept in the form of a prompt.

We can here work with the coders, and comparing their notes to the motivations given by the LLM, focusing on the examples where the LLM and the human coders are (most) in disagreement. We may find that the prompt can be improved - or that our human coders were mistaken or biased. 

In our case, we do not have access to the coders, and we will simply show the process through which this form of work can be done.

In [None]:
# We create a dataframe that lists the level of disagreement between coders and LLM
wrong = df2.merge(val, on="merging_variable")
wrong["diff"] = abs(wrong["answer_x"] - wrong["answer_y"])

In [None]:
# We can save as CSV file to analyze results in Excel, or examine the results here.
# display(wrong.sort_values(['diff']))
wrong.sort_values(["diff"]).to_csv("disagreements.csv")

In [None]:
# One of the cases where the LLM and human coders disagree the most is a speech by Berlusconi. The LLM does not think it is populist, but the human coders do.

# The motivation given by the LLM is:
wrong.loc[
    wrong["merging_variable"] == "Italy_Berlusconi_Ribbon_2.txt"
].motivations.values[0][0]

Here follows the text, translated to English. 

Do you agree with the LLM or the huamn coders? If the latter, how do you think the prompt should be modified to improve the results? 

In [None]:
print(
    """Dear friends,\n\nIt is not easy to find the words to describe my, our state of mind at this moment. We are gathered here in Onna to celebrate the Liberation Day, a celebration that is both an honor and a commitment.\n\nAn honor: to commemorate a terrible massacre that took place right here in June 1944 when the Nazis, in retaliation, killed 17 citizens of Onna and then blew up the house where the bodies of those innocent victims were found.\n\nA commitment: what should inspire us is not to forget what happened here and to remember the horrors of totalitarianism and the suppression of 'freedom'.\n\nRight here, in Abruzzo, the legendary Maiella Brigade was born and operated, decorated with the Gold Medal for Military Valor. In December '43, 15 young people founded what would become the Maiella Brigade, which grew to 1,500 strong.\n\nIt is no coincidence that on this special day, the soldiers of the Honor Guard standing before us belong to the 33rd Artillery Regiment, the Abruzzesi unit that in 1943 on Cephalonia had the courage to resist the Nazis and sacrifice themselves – fighting – for the honor of our country.\n\nTo those patriots who fought for the redemption and rebirth of Italy, our admiration, gratitude, and recognition must always go.\n\nMost Italians today have not experienced what it means to be deprived of freedom. Only the elderly have a direct memory of totalitarianism, foreign occupation, and the war for the liberation of our homeland.\n\nFor many of us, it is a memory tied to our families, our parents, our grandparents, many of whom were protagonists or victims of those dramatic days. For me, it is the memory of years of separation from my father, forced to emigrate to avoid arrest, the memory of my mother's sacrifices, who alone had to support a large family during those difficult years. It is the memory of her courage, of her, like many others, traveling by train every day from a small town in the province of Como to work in Milan, and on one of those trains, risking her life but managing to save a Jewish woman from the clutches of a Nazi soldier destined for the extermination camps.\n\nThese are the memories, the examples with which we grew up – the memories of a generation of Italians who did not hesitate to choose freedom, even at the risk of their own safety and lives.\n\nOur country owes an inexhaustible debt to those many young people who sacrificed their lives during their most beautiful years to redeem the honor of the nation, out of fidelity to an oath, but above all for that great, splendid, and essential value which is freedom.\n\nWe owe the same debt of gratitude to all those other boys, Americans, English, French, Polish, from the many allied countries, who shed their blood in the Italian campaign. Without them, the sacrifice of our partisans would have risked being in vain.\n\nAnd with respect, we must remember today all the fallen, even those who fought on the wrong side, sincerely sacrificing their lives for their ideals and a lost cause.\n\nThis does not mean, of course, neutrality or indifference. We are – all free Italians are – on the side of those who fought for our freedom, for our dignity, and for the honor of our homeland.\n\nIn recent years, the history of the Resistance has been deepened and discussed. It is a good thing that it happened. The Resistance, along with the Risorgimento, is one of the founding values of our nation, a return to the tradition of freedom. And freedom is a right that comes before laws and the state because it is a natural right that belongs to us as human beings.\n\nHowever, a free nation does not need myths. As with the Risorgimento, we must also remember the dark pages of the civil war, even those in which those who fought on the right side made mistakes and took on blame.\n\nIt is an exercise in truth, in honesty, an exercise that makes the history of those who fought on the right side with selflessness and courage even more glorious.\n\nIt is the history of the many who fought in the Southern army, who, from Cephalonia onwards, redeemed the honor of the uniform with their blood.\n\nIt is the history of martyrs like Salvo D’Acquisto, who did not hesitate to sacrifice his life in exchange for other innocent lives.\n\nIt is the history of our soldiers interned in Germany who chose concentration camps rather than collaborating with the Nazis.\n\nIt is the history of the many who hid their fellow Jewish citizens, saving them from deportation.\n\nAbove all, it is the history of the many, countless unknown heroes who, with small or great acts of daily courage, contributed to the cause of freedom.\n\nEven the Church, I want to remember, played its part with true courage, to prevent odious concepts like race or religious differences from becoming reasons for persecution and death.\n\nSimilarly, we must remember the young Jews of the Jewish Brigade, who came from ghettos all over Europe, took up arms, and fought for freedom.\n\nAt that moment, many Italians of different faiths, cultures, and backgrounds came together to pursue the same great dream – the dream of freedom.\n\nAmong them were very different individuals and groups. Some thought only of freedom, some dreamed of establishing a different social and political order, some considered themselves bound by an oath of loyalty to the monarchy.\n\nBut they all managed to set aside their differences, even the most profound ones, to fight together. The communists and the Catholics, the socialists and the liberals, the actionists and the monarchists, faced with a common tragedy, each wrote a great page of our history. A page on which our Constitution is based, a page on which our freedom is based.\n\nIn the drafting of the Constitution, the wisdom of the political leaders of that time – De Gasperi and Togliatti, Ruini and Terracini, Nenni, Pacciardi, and Parri – managed to channel deep initial divisions towards a single objective.\n\nAlthough clearly the result of compromises, the republican Constitution achieved two noble and fundamental objectives: guaranteeing freedom and creating the conditions for democratic development in the country. It was not a small feat; in fact, it was the best compromise possible at the time.\n\nHowever, the goal of creating a "common" moral conscience for the nation was missed, perhaps premature for those times, so much so that the predominant value for everyone was anti-fascism, but not necessarily anti-totalitarianism. It was a product of history, a compromise useful to avoid the Cold War that vertically divided Italy from degenerating into a civil war with unpredictable outcomes. But the assumption of responsibility and the sense of the State that animated all the political leaders of that time remain a great lesson that would be unforgivable to forget.\n\nToday, 64 years after April 25, 1945, and twenty years after the fall of the Berlin Wall, our task, the task of all, is to finally build a unified national sentiment.\n\nWe must do it together, together, regardless of political affiliation, together, for a new beginning of our republican democracy, where all political parties recognize the greatest value, freedom, and debate in its name for the good and the interest of all.\n\nThe anniversary of the regained freedom is, therefore, an opportunity to reflect on the past, but also to reflect on the present and the future of Italy. If we can do it together from today onwards, we will have rendered a great service not to one\n\nWe have always rejected the idea that our adversary was our enemy. Our religion of freedom demanded it from us and still does. With the same spirit, I am convinced that the time has come for the Liberation Day to become the Day of Freedom, and for this commemoration to shed the character of opposition that revolutionary culture gave it, a character that still 'divides' rather than 'unites'.\n\nI say this with great serenity, without any intention of creating controversy. April 25 was the origin of a new season of democracy, and in democracy, the people's vote deserves absolute respect from everyone.\n\nAfter April 25, the people peacefully voted for the Republic, and the monarchy accepted the people's judgment.\n\nShortly after, on April 18, 1948, the people's choice was once again decisive for our country: with De Gasperi's victory, the Italian people recognized themselves in the Christian and liberal tradition of their history. The 1950s, always with the support of the popular vote, shaped Italy into a democratic, economic, and social reality. Italy became part of Europe and the West, played a role in promoting Atlantic unity and European unity, transforming from a rejected nation to a respected one.\n\nToday, our young people face other challenges: to defend the freedom conquered by their fathers and expand it even further, aware that without freedom, there can be no peace, justice, or well-being.\n\nSome of these challenges are global and see us engaged alongside free nations: the fight against terrorism, the fight against fanatic and repressive fundamentalism, the fight against racism because freedom, dignity, and peace are rights of every human being, 'everywhere' in the world.\n\nThat's why I want to remember the Italian soldiers engaged in peace missions abroad, especially those who have fallen in carrying out this noble mission. There is an ideal continuity between them and all the heroes, Italian and allied, who sacrificed their lives over 60 years ago to give us back freedom, security, and peace.\n\nToday, the teachings of our fathers take on a special value: this April 25 comes just after the great tragedy that struck this land of Abruzzo. Once again, facing the emergency and tragedy, Italians have shown their ability to unite, to overcome differences, demonstrating that they are a great and cohesive people, full of generosity, solidarity, and courage.\n\nLooking at the many Italians who have been engaged here in rescue and reconstruction efforts, I feel proud, once again, even more so, to be Italian and to lead this wonderful country.\n\nToday, Onna is the symbol of our Italy. The earthquake that destroyed it reminds us of the days when invaders destroyed it. Rebuilding it will mean repeating the gesture of its rebirth after Nazi violence.\n\nAnd it is precisely concerning the heroes of then and today that we all have a great responsibility: to set aside any controversy, to look at the interest of the nation, to safeguard the great heritage of freedom that we inherited from our fathers.\n\nTogether, we all have the responsibility and duty to build a future of prosperity, security, peace, and freedom for all.\n\nLong live Italy! Long live the Republic!\n\nLong live April 25, the celebration of all Italians who love freedom and want to remain free!\n\nLong live April 25, the celebration of regained freedom!"""
)