# LLMs in Political Science Research via API

Welcome to a tutorial notebook on the practical application of Large Language

---

Models (LLMs) in political science research. The integration of AI technologies into political analysis represents a fundamental shift in how researchers approach policy analysis, public opinion research, and institutional studies.

### Core Applications

Political science research now leverages LLMs for:

- **Policy document analysis** - automated extraction of policy positions across legislative corpora
- **Sentiment analysis** - measuring public opinion dynamics in social media and news coverage  
- **Content classification** - categorizing political texts by ideology, topic, or rhetorical strategy
- **Cross-national research** - enabling comparative analysis across multiple languages and contexts

### Technical Implementation Overview

This workshop provides hands-on experience with:

1. **API fundamentals** - programmatic access to various LLM services
2. **Structured data extraction** - entity recognition for political actors, organizations, and events
3. **Scalable analysis workflows** - batch processing of large document collections
4. **Methodological considerations** - addressing bias, validity, and reproducibility in computational approaches

### Strategic Advantages

API-based LLM integration enables:
- **Reproducible research** through documented, version-controlled analysis pipelines
- **Cost-effective scaling** for processing extensive legislative archives or media databases
- **Customizable workflows** tailored to specific research questions
- **Real-time analysis** of evolving political discourse and policy developments

The following sessions will establish practical competency in leveraging these technologies for robust political science research.

### What Makes This Approach Special

Rather than relying on simple chat interfaces, you'll learn to harness the full power of LLMs through API access, enabling:
- **Batch processing** of large document collections
- **Customizable workflows** tailored to your specific research needs
- **Reproducible research** methods with documented processes
- **Integration** with existing computational social science tools and methodologies


## Section 1 - Introduction to LLMs and APIs

In the first section, we will explore the basics of Large Language Models (LLMs) and how to interact with them using APIs. We will cover the following topics:
- **What are LLMs?**: An introduction to Large Language Models, their capabilities, and how they can be applied in social science research.
- **Setting Up Your Environment**: Instructions on how to set up your programming environment to interact with LLM APIs.
- **Understanding APIs**: A brief overview of what APIs are, how they work, and why they are essential for accessing LLMs.
- **OpenRouter API**: Introduction to the OpenRouter API, which provides access to various LLMs.
- **Understanding JSON**: An introduction to JSON (JavaScript Object Notation), the data format commonly used for API responses, and how to work with it in Python.

## What are LLMs?

**Large Language Models (LLMs)** are sophisticated artificial intelligence systems trained on vast collections of text data to understand, generate, and manipulate human language. Think of them as extremely well-read digital assistants that have absorbed millions of books, articles, websites, and documents, enabling them to engage with text in remarkably human-like ways.

### How LLMs Work: The Basics

LLMs use a technology called **transformer architecture** which wa introduced in landmark [2017 paper - Attention Is All You Need](https://arxiv.org/abs/1706.03762). This architecture allows them to:

1. **Predict the next word** in a sequence based on context
2. **Understand relationships** between words, sentences, and concepts
3. **Generate coherent text** that follows patterns learned from training data

Watch Javier de la Rosa explain LLMs in this video:
https://youtu.be/VtkPhFwF-2Q?si=6CrZDx5jiN-fcoeZ&t=127

### Key Terms

#### **Training Data**
The massive collection of texts used to teach the LLM. This typically includes:
- Web content and reference materials
- News articles and magazines
- Academic papers and journals
- Books and literature from various periods and cultures
- **Important**: The quality and diversity of training data affects what the model "knows"

#### **Tokens**
The basic units of text that LLMs process. A token can be:
- A whole word ("democracy")
- Part of a word ("demo" + "cratic")
- Punctuation marks
- **Why it matters**: API costs are often calculated per token

#### **Context Window**
The amount of text an LLM can "remember" at once, measured in tokens. Common sizes:
- **GPT-3.5**: ~4,000 tokens (≈3,000 words)
- **GPT-4**: ~8,000-32,000 tokens
- **Claude**: ~100,000+ tokens
- **Gemini**: ~1,000,000 tokens
- **Why it matters**: Determines how much text you can analyze at once

#### **Prompt**
The input text you give to an LLM to get a response. Effective prompting is crucial for good results.

#### **Fine-tuning**
The process of further training a model on specific data to improve performance for particular tasks.

### Applications in Social Science

#### **1. Text Analysis**
- **Sentiment analysis** of historical documents
- **Thematic analysis** across large corpora
- **Stylometric analysis** for authorship attribution
- **Content classification** and categorization

#### **2. Language Processing**
- **Translation** of texts
- **Transcription** assistance for handwritten documents
- **OCR error correction** in digitized materials

#### **3. Research Assistance**
- **Citation analysis** and bibliography generation
- **Concept mapping** and knowledge extraction
- **Hypothesis generation** from patterns in data

#### **4. Content Generation**
- **Metadata generation** for digital collections
- **Summary creation** for large document sets
- **Educational material** development

### Limitations and Considerations

#### **Accuracy Concerns**
- LLMs can generate plausible but incorrect information (**hallucinations**)
- Always verify important claims against primary sources
- Use multiple models and cross-check results

#### **Bias and Representation**
- Training data reflects societal biases
- May underrepresent certain cultures, languages, or perspectives
- Critical evaluation is essential, especially for sensitive topics

#### **Temporal Knowledge**
- Models have knowledge cutoff dates
- May not know about recent events or publications
- Historical accuracy varies by period and region

#### **Language Coverage**
- Performance varies significantly across languages
- Better results for well-represented languages (English, major European languages)
- Limited effectiveness for minority or historical languages

#### **Legal Considerations**
- Legality of data acquisition for training models
- Protection of user input; consider how your data is handled when using online AI applications

### Popular LLM Models for Research

#### **OpenAI's GPT Series**
- **GPT-3.5**: Fast, cost-effective for many tasks
- **GPT-4**: More capable, better reasoning, higher cost
- **Strengths**: General knowledge, writing quality
- **Best for**: Text generation, analysis, general research tasks

#### **Anthropic's Claude**
- **Claude-3**: Various sizes (Haiku, Sonnet, Opus)
- **Strengths**: Large context windows, careful reasoning
- **Best for**: Long document analysis, ethical considerations

#### **Google's Gemini**
- **Gemini Pro**: Competitive with GPT-4
- **Strengths**: Multimodal capabilities, integration with Google services
- **Best for**: Research integration, document processing

#### **Open Source Models**
- **Llama 2/3**: Meta's open-source models
- **Mistral**: European open-source alternative
- **Benefits**: Transparency, customization, data privacy, reproducibility

### Getting Started: Questions to Ask

Before using LLMs in your research, consider:

1. **What specific task** do you want to accomplish?
2. **How much text** will you be processing?
3. **What level of accuracy** do you need?
4. **Are there privacy concerns** with your data?
5. **What's your budget** for API usage?
6. **Do you need real-time results** or can processing take time?


## LLM through UI versus LLM through API

Using a **Large Language Model (LLM) through an API** means that instead of downloading or running the model on your own computer, you connect to a powerful AI system over the internet and *ask it to do tasks for you*—like analyzing text, summarizing articles, extracting names and places, or generating new content.

### Big Picture Overview

Think of it like this:

* **You** are the researcher with a question.
* **The LLM** is a very smart assistant that understands and works with language.
* **The API** is the bridge that lets you talk to this assistant in a structured, predictable way and get answers back in a structured, predictable way.

You send your questions or text to the LLM through this bridge, and it sends back responses. This might involve tasks like:

* Translating documents
* Summarizing long texts
* Identifying recurring themes in political speeches
* Extracting dates, names, and places from archival material
* Creating timelines or glossaries based on your sources


![Simple Client Server Arcitecture](https://raw.githubusercontent.com/ValRCS/BSSDH_2025_workshop_LLM_API/refs/heads/main/img/client_server.png)

Generic client-server architecture - applicable to using LLMs as well
[Src](https://medium.com/@tolanisilas3606/getting-started-with-llms-how-to-serve-llm-applications-as-api-endpoints-with-fastapi-in-python-af015399ef3e)

### Slightly Technical Overview: Using LLMs Through APIs

Using **Large Language Models (LLMs)** through an **API (Application Programming Interface)** means that your computer communicates with a powerful language model over the internet using a common format and protocol—usually **HTTP requests** and **JSON data**.

Here’s what that typically involves:

* 🔐 **API Key**: You first obtain an API key—a kind of password that identifies you to the LLM service (e.g., OpenAI, Cohere, Google, Anthropic). This ensures secure access and tracks your usage.

* 📦 **JSON**: You send your input (like a text prompt or document) in a format called **JSON (JavaScript Object Notation)**, which is a simple way to structure data—kind of like filling out a digital form.

* 🌐 **HTTP Request**: You send this JSON to the LLM using an **HTTP request**, which is the same basic method your browser uses to visit websites—but in this case, it's your script or notebook making the request.

* 🧠 **LLM Response**: The LLM processes your input and returns a **response** (also in JSON), containing the generated text, analysis, or extracted information.

* 🧪 **Python & Jupyter Notebooks**: Most Digital Humanities researchers use a **scripting interface** like **Python**, often working in **Jupyter Notebooks**. These notebooks let you write and run code step by step, making it easy to send queries to the API, process responses, and analyze results alongside your research notes.

In summary, using LLMs through an API gives you **programmatic, on-demand access to AI**, using simple web requests, structured data formats like JSON, and scripting environments like Python notebooks—ideal for batch processing or large-scale analysis in Digital Humanities projects.


### Comparing LLM Usage: API vs. UI (e.g., ChatGPT)

| **Feature / Aspect**            | **LLM via API**                                                                 | **LLM via UI (e.g., ChatGPT website)**                                  |
|----------------------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| **Interface**                   | Code-based (e.g., Python, Jupyter Notebooks)                                     | Web-based (graphical user interface)                                   |
| **Ease of Use**                 | Requires some technical knowledge (coding, HTTP, JSON)                           | Very easy to use — no coding required                                  |
| **Customization**              | Highly customizable (prompts, formatting, logic, parameters)                     | Limited to what the UI allows                                          |
| **System Prompt Control**      | You define your own system prompt (full control over model behavior)             | Often hidden or pre-set by provider — not user-visible or editable     |
| **Automation**                 | Supports batch processing, loops, and integration into workflows                 | Manual, one prompt at a time                                           |
| **Scalability**                | Good for large-scale or repetitive tasks (e.g., thousands of documents)          | Not ideal for repetitive or high-volume tasks                          |
| **Output Control**             | Easy to parse and post-process structured results (e.g., JSON, CSV)              | Output is plain text, harder to reuse automatically                    |
| **Learning Curve**             | Higher — needs understanding of programming, HTTP, JSON                          | Low — intuitive for most users                                         |
| **Cost Management**            | Detailed usage tracking, adjustable per-token budgets                            | Usage-based, but detailed per-task tracking may be limited             |
| **Use in Research Pipelines**  | Can be integrated into DH workflows and pipelines                                | Mostly standalone usage                                                |
| **Examples of Use**            | - Annotating texts in bulk  <br> - Named entity recognition  <br> - Concept tagging | - Asking one-off questions <br> - Drafting summaries or brainstorming  |
| **Reproducibility**            | Easy to document, version, and rerun scripts                                     | Harder to reproduce exact interactions                                 |
| **Collaboration & Sharing**    | Code and notebooks can be shared, versioned, and reused                          | Limited to screenshots or copy-pasting conversation                    |


### Dangerous instructions from UI

With Chat based interface the LLM provider supplies some extra instructions which user does not see and it can lead to less than ideal results:

April 2025 - [Update that made ChatGPT 'dangerously' sycophantic pulled](https://www.bbc.com/news/articles/cn4jnwdvg9qo)

## OpenRouter API

[OpenRouter](https://openrouter.ai/) is a unified API provider that provides access to multiple LLM providers through a single interface. This makes it convenient to experiment with different models and compare their performance for humanities research tasks.

### What is OpenRouter?

**OpenRouter** acts as a gateway to dozens of different LLM providers, allowing you to:
- **Access multiple models** through a single API interface
- **Compare performance** across different LLMs for the same task
- **Switch between models** without changing your code structure
- **Manage costs** by choosing models based on budget and performance needs

Note: OpenRouter is NOT related to OpenAI, both are companies/organizations providing AI services. OpenRouter does offer a way to access these OpenAI services.

### Key Advantages for Digital Humanities Research

#### **1. Model Diversity**
- **OpenAI models**: GPT-3.5, GPT-4, GPT-4 Turbo
- **Anthropic models**: Claude-3 Haiku, Sonnet, Opus
- **Google models**: Gemini Pro, Gemini Flash
- **Open source models**: Llama, Mistral, and many others
- **Specialized models**: Fine-tuned for specific tasks

#### **2. Cost Optimization**
- **Transparent pricing**: See exact costs per model
- **Choose by budget**: Use cheaper models for initial testing
- **Scale appropriately**: Use powerful models only when needed
- **Usage tracking**: Monitor your spending in real-time

#### **3. Unified Interface**
- **Consistent API**: Same request format for all models
- **Easy switching**: Change models by modifying one parameter
- **Standard responses**: Uniform JSON response structure
- **Simplified authentication**: One API key for all providers

### How to use after the workshop?

Register for an OpenRouter account at [OpenRouter](https://openrouter.ai/). You should fund the account with a small amount of money to be able to request API keys and use the models.

There are other aggregrators of LLMs, such as [Hugging Face](https://huggingface.co/) also providing other services, but OpenRouter is the most convenient for our purposes. It provides a unified interface to many models, including those from OpenAI, Anthropic(Claude), and Google(Gemini) and many others.

You can use OpenRouter to sample different models (https://openrouter.ai/models) and compare their performance for your specific tasks. For truly large tasks, you could then use the model provider's own API directly, as OpenRouter is a wrapper around the original APIs. This way, you can take advantage of the best features of each model while maintaining a consistent interface.

There are new models being release almost daily - over 400 to choose from as of mid 2025.

## Minimal example of LLM API usage

Let's use OpenAI style API to interact with a model like GPT-3.5.

Note: Similar example can be found at https://openrouter.ai/openai/gpt-3.5-turbo/api


In [8]:
from openai import OpenAI
import getpass # we do not want to show the API key in the code ANYWHERE!!

# we create a local variable to store the OpenRouter API key
open_router_api_key = getpass.getpass("Enter your OpenRouter API key: ") # getpass will hide the input

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=open_router_api_key,
)

completion = client.chat.completions.create(
  model="openai/gpt-3.5-turbo",  # this can be changed by the user to any other model available on OpenRouter
  messages=[
    {
      "role": "user",  # this is the role of the user in the conversation - similar to how you would use it in a chat application
      "content": "Where is National Library of Latvia located?" # this is so called user prompt, the query we ask the model
    }
  ]
)
print(completion.choices[0].message.content)

Enter your OpenRouter API key: ··········
The National Library of Latvia is located in Riga, the capital city of Latvia.


In [None]:
# to create a new query we do not need to get the key or client again as long as we have run the above cell in our current session
completion = client.chat.completions.create(
  model="openai/gpt-3.5-turbo",  # this can be changed by the user to any other model available on OpenRouter
  messages=[
    {
      "role": "user",  # this is the role of the user in the conversation - similar to how you would use it in a chat application
      "content": "What is BSSDH in Riga?" # this is so called user prompt, the query we ask the model
    }
  ]
)
print(completion.choices[0].message.content)

### Limitations of minimal example

Above shows a working code but there are multiple limitations of this approach:

* We always have to enter the API key manually
* We have hardcoded the model name
* We have hardcoded the user prompt
* We have no system prompt - meaning we cannot control the behavior of the model
* We have no other parameters for the model that might be useful
* We are trusting OpenAI not to break the library with an update
* We have no way to use the model in a batch mode
* How would we save the results?

## Setting Up Your Environment

To interact with LLM APIs effectively, we need to set up our programming environment with the necessary libraries and configurations. This includes installing required packages and setting up API credentials.

In [9]:
# Let's print some basic information about this interactive notebook
print("This is an interactive notebook for this workshop on LLMs and APIs.")
# first let's see what Python version we are using
import sys
print(f"Python version: {sys.version}")
# now today's date and time
from datetime import datetime
print(f"Today's date and time: {datetime.now()}")
# we will need to work with JSON data, so let's import the json module
import json
print("JSON module imported successfully.")
# we will need to read and write files so let's import pathlib
from pathlib import Path
print("Path from pathlib imported successfully.")
# TODO for those with some experience it can be useful to print more information about the environment, free memory, drives, etc.
print("Will import external libraries if available.")
# Let's also check if we have the requests library installed, which is commonly used for making API calls
try:
    import requests
    print(f"Requests library version: {requests.__version__}")
except ImportError:
    print("Requests library is not installed. You can install it using 'pip install requests'.")

# above were standard libraries part of any Python distribution
# below libraries are external libraries that are not part of the standard library
# however, they come preinstalled in Google Colab so *should* be available in the Colab environment
# if you are running this notebook locally, you may need to install them using pip

# let's install tqdm for progress bars if not already installed
try:
    from tqdm import tqdm
    # import version
    from tqdm import __version__ as tqdm_version
    print(f"TQDM library version: {tqdm_version}")
except ImportError:
    print("TQDM library is not installed. You can install it using 'pip install tqdm'.")

# now let's try importing OpenAI's library if available
try:
    import openai # we actually already imported a class from this library above but let's import the whole library
    print(f"OpenAI library version: {openai.__version__}")
except ImportError:
    print("OpenAI library is not installed. You can install it using 'pip install openai'.")



This is an interactive notebook for this workshop on LLMs and APIs.
Python version: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
Today's date and time: 2025-09-05 14:36:09.892017
JSON module imported successfully.
Path from pathlib imported successfully.
Will import external libraries if available.
Requests library version: 2.32.4
TQDM library version: 4.67.1
OpenAI library version: 1.104.2


### Why Check System Information and Library Versions?

**Environment Documentation** is crucial for reproducible research and troubleshooting. Here's why we print this information:

#### **1. Reproducibility**
- **Version consistency**: Different library versions can produce different results
- **Environment documentation**: Future researchers (including yourself) can recreate the exact same setup
- **Research integrity**: Ensures your findings can be validated by others

#### **2. Troubleshooting**
- **Debugging assistance**: When code doesn't work, version information helps identify compatibility issues
- **Support requests**: Technical support often requires knowing your exact environment setup
- **Error diagnosis**: Many errors are version-specific and can be quickly resolved with this information

#### **3. Best Practices**
- **Methodological transparency**: Document all tools and versions used in your research
- **Collaboration**: Team members can ensure they're using compatible environments
- **Publication standards**: Many journals now require detailed technical specifications

#### **4. API Compatibility**
- **Service requirements**: Different LLM APIs may require specific library versions
- **Feature availability**: Newer features might only be available in recent library versions
- **Security updates**: Ensures you're using libraries with the latest security patches

**💡 Pro Tip**: Always run this environment check at the beginning of your research sessions to catch any changes that might affect your results!

## Understanding APIs

**API (Application Programming Interface)** is a set of rules and protocols that allows different software applications to communicate with each other. Think of an API as a digital messenger that takes your request, tells a system what you want, and then brings the response back to you in a structured format.

### The Restaurant Analogy

Imagine you're at a restaurant:
- **You** (the client) want to order food
- **The kitchen** (the server) prepares the food
- **The waiter** (the API) takes your order to the kitchen and brings your food back

In the digital world:
- **Your Python script** (the client) wants data or a service
- **The LLM service** (the server) processes your request
- **The API** takes your request and returns the results

### Key API Concepts

#### **HTTP Methods**
APIs use standard web protocols:
- **GET**: Retrieve information (like downloading a file)
- **POST**: Send data for processing (like submitting a form)
- **PUT**: Update existing data
- **DELETE**: Remove data

For LLM APIs, we primarily use **POST** to send text for analysis.

#### **Request and Response**
Every API interaction involves:
1. **Request**: What you send to the API
   - URL (endpoint)
   - Headers (metadata like authorization)
   - Body (your actual data/text)
2. **Response**: What the API sends back
   - Status code (200 = success, 404 = not found, etc.)
   - Data (usually in JSON format)

#### **Authentication**
Most APIs require proof of identity:
- **API Keys**: Secret strings that identify you
- **Tokens**: Temporary credentials with specific permissions
- **Rate Limits**: Restrictions on how many requests you can make

### API Anatomy for LLM Services

#### **Base URL**
The main address of the API service:
```
https://openrouter.ai/api/v1/
```

#### **Endpoints**
Specific functions within the API:
```
/chat/completions  # For sending messages to LLMs
/models           # List available models
/usage            # Check your usage statistics
```

#### **Complete URL**
```
https://openrouter.ai/api/v1/chat/completions
```

### Headers: The API's Metadata

Headers provide essential information about your request:

```python
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
    "HTTP-Referer": "https://your-research-project.edu",
    "X-Title": "Text Analysis"
}
```

#### **Common Headers Explained**
- **Authorization**: Proves you're allowed to use the service
- **Content-Type**: Tells the API what format your data is in
- **HTTP-Referer**: (Optional) Identifies your project for tracking
- **X-Title**: (Optional) Describes your application

### Request Body: Your Actual Data

The request body contains your instructions and text:

```python
request_body = {
    "model": "openai/gpt-3.5-turbo",
    "messages": [
        {
            "role": "user",
            "content": "Analyze the sentiment of this historical document: [your text here]"
        }
    ],
    "max_tokens": 1000,
    "temperature": 0.1
}
```

#### **Key Parameters**
- **model**: Which LLM to use
- **messages**: Your conversation with the AI
- **max_tokens**: Maximum length of response
- **temperature**: Creativity level (0 = deterministic, 1 = creative)

### Common API Response Formats

#### **Successful Response (Status 200)**
```json
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "openai/gpt-3.5-turbo",
  "usage": {
    "prompt_tokens": 56,
    "completion_tokens": 31,
    "total_tokens": 87
  },
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "This historical document expresses predominantly negative sentiment regarding the economic policies..."
      },
      "finish_reason": "stop"
    }
  ]
}
```

#### **Error Response (Status 400+)**
```json
{
  "error": {
    "message": "You exceeded your rate limit",
    "type": "rate_limit_exceeded",
    "code": "rate_limit_exceeded"
  }
}
```

### Political Science Use Cases

#### **1. Political Speech Analysis**
```python
# Analyze political speech rhetoric
request = {
    "model": "openai/gpt-4",
    "messages": [
        {"role": "user", "content": "Identify the rhetorical frames used for immigration in this political speech: [manuscript text]"}
    ]
}
```

#### **2. Event Processing**
```python
# Extract entities from poilitical events
request = {
    "model": "anthropic/claude-3-sonnet",
    "messages": [
        {"role": "user", "content": "Extract all person names, places, and dates from this news story on a battle event: [letter text]"}
    ]
}
```

#### **3. Texts Comparison**
```python
# Compare texts
request = {
    "model": "meta-llama/llama-2-70b-chat",
    "messages": [
        {"role": "user", "content": "Compare the themes used in these two political ads: [ad 1] vs [ad 2]"}
    ]
}
```



In [10]:
# let's prompt user for OpenRouter API key
import getpass

# if open_router_api_key variable does not exist or is empty, we will prompt the user for it
if 'open_router_api_key' not in locals() or not open_router_api_key:
    open_router_api_key = getpass.getpass("Please enter your OpenRouter API key: ")
    # save it to .env file for future use
    # note Google Colab will destroy .env file after session ends, so you will need to enter it again next time
    # this can be useful if you re-run the notebook and want to avoid entering the key again
    print("Saving Open Router API key to .env file...")
    with open('.env', 'a') as f:
        f.write(f'OPENROUTER_API_KEY={open_router_api_key}\n')
    print("Open Router API key saved to .env file.")

# we now should have the OpenRouter API key available
if open_router_api_key:
    print("OpenRouter API key loaded successfully.")
else:
    print("OpenRouter API key not found. Please make sure you have it set in your environment variables or .env file.")
    print("You can also enter it manually when prompted during API calls.")

# key point we do not print it publicly it is stored as a variable under the name open_router_api_key - of course you can change the name to something more descriptive
# but do not print it to the console or logs, as it is sensitive information

OpenRouter API key loaded successfully.


### After loading the API key



Now that we have loaded the API key, let's learn a little bit about JSON and how to work with it in Python, as it is the format we will be using to communicate with the API.

## Understanding JSON

JSON (JavaScript Object Notation) is a lightweight data format commonly used for API responses. It's human-readable and easy to work with in Python, making it ideal for handling structured data from LLM APIs.

Official JSON website: [json.org](https://www.json.org/)


### JSON Syntax: Complete Guide

#### **Basic Structure Rules**
1. **Data is in name/value pairs**
2. **Data is separated by commas**
3. **Curly braces hold objects**
4. **Square brackets hold arrays**
5. **Strings must use double quotes**

#### **Data Types**

##### **1. Strings**
- Must be enclosed in **double quotes** (not single quotes)
- Can contain Unicode characters
- Escape sequences supported

```json
{
  "simple_string": "Hello World",
  "unicode_string": "Latvian flag! 🇱🇻",
  "escaped_string": "Quote: \"Hello\" and newline: \n",
  "empty_string": ""
}
```

##### **2. Numbers**
- Integer or floating point
- No leading zeros (except for decimal numbers)
- Scientific notation supported

```json
{
  "integer": 42,
  "negative": -17,
  "float": 3.14159,
  "scientific": 1.23e-10,
  "zero": 0
}
```

##### **3. Booleans**
- Only `true` or `false` (lowercase)
- No other boolean representations

```json
{
  "is_published": true,
  "is_draft": false
}
```

##### **4. Null**
- Represents empty value
- Written as `null` (lowercase)

```json
{
  "optional_field": null,
  "missing_data": null
}
```

##### **5. Objects**
- Collections of key/value pairs
- Keys must be strings in double quotes
- Values can be any JSON data type

```json
{
  "researcher": {
    "name": "Dr. Colin Henry",
    "institution": "University of Zurich",
    "specialization": "Sitting at the Computer",
    "contact": {
      "email": "colin.henry@ipz.uzh.ch",
      "phone": null
    }
  }
}
```

##### **6. Arrays**
- Ordered lists of values
- Values can be any JSON data type (mixed types allowed)
- Zero-indexed

```json
{
  "research_topics": [
    "Text Analysis",
    "Data Visualization",
    "Machine Learning"
  ],
  "mixed_array": [
    "string",
    42,
    true,
    null,
    {"nested": "object"},
    [1, 2, 3]
  ],
  "empty_array": []
}
```

#### **Nesting and Complex Structures**

JSON supports unlimited nesting of objects and arrays:

```json
{
  "computational_social_science_project": {
    "title": "Extremist Rhetoric Analysis",
    "metadata": {
      "created": "2025-01-15",
      "version": "1.2",
      "authors": [
        {
          "name": "Colin Henry",
          "role": "Lead Developer",
          "skills": ["Python", "Machine Learning", "APIs"]
        }
      ]
    },
    "datasets": [
      {
        "name": "Extremist Texts",
        "size": 1200,
        "languages": ["English", "German"],
        "analysis_results": {
          "sentiment_scores": [0.65, 0.72, 0.58],
          "themes": {
            "exile": 0.34,
            "identity": 0.78,
            "nationalism": 0.45
          }
        }
      }
    ]
  }
}
```

#### **LLM API Request Example**
```json
{
  "model": "openai/gpt-4",
  "messages": [
    {
      "role": "system",
      "content": "You are a political science expert specializing in political extremism."
    },
    {
      "role": "user",
      "content": "Analyze the following post for themes of political violence and identity: [text content here]"
    }
  ],
  "max_tokens": 1000,
  "temperature": 0.1,
  "metadata": {
    "research_project": "Extremist Rhetoric",
    "researcher": "UNM Workshop Participant",
    "date": "2025-09-03"
  }
}
```

### Common JSON Errors and How to Avoid Them

#### **1. Syntax Errors**
```json
// ❌ WRONG - Single quotes
{ 'author': 'Jane Smith' }

// ✅ CORRECT - Double quotes
{ "author": "Jane Smith" }

// ❌ WRONG - Trailing comma
{
  "title": "Book",
  "year": 2024,
}

// ✅ CORRECT - No trailing comma
{
  "title": "Book",
  "year": 2024
}

// ❌ WRONG - Comments (not allowed in strict JSON)
{
  "title": "Book", // This is a comment
  "year": 2024
}

// ✅ CORRECT - No comments
{
  "title": "Book",
  "year": 2024
}
```

#### **2. Data Type Errors**
```json
// ❌ WRONG - Undefined values
{
  "value": undefined
}

// ✅ CORRECT - Use null for missing values
{
  "value": null
}

// ❌ WRONG - Functions (not valid JSON)
{
  "calculate": function() { return 42; }
}

// ✅ CORRECT - Only data, no functions
{
  "result": 42
}
```

### Working with JSON in Python

#### **Basic Operations**
```python
import json

# Creating JSON from Python data
data = {
    "title": "Research",
    "authors": ["Henry", "Colin"],
    "published": True,
    "year": 2025
}

# Convert to JSON string
json_string = json.dumps(data)
print(json_string)

# Convert back to Python object
parsed_data = json.loads(json_string)
print(parsed_data["title"])
```

#### **Pretty Printing**
```python
# Format JSON nicely
pretty_json = json.dumps(data, indent=2, ensure_ascii=False)
print(pretty_json)
```

#### **Reading/Writing JSON Files**
```python
# Write to file
with open('research_data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# Read from file
with open('research_data.json', 'r', encoding='utf-8') as f:
    loaded_data = json.load(f)
```

#### **Handling API Responses**
```python
import requests

response = requests.post(api_url, headers=headers, json=request_data)
if response.status_code == 200:
    result = response.json()  # Automatically parses JSON
    content = result['choices'][0]['message']['content']
    print(content)
```





### Example: LLM API Response

A typical LLM API response in JSON:


In [11]:
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The main subject of this post is violence and identity."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 20
  }
}

{'choices': [{'message': {'role': 'assistant',
    'content': 'The main subject of this post is violence and identity.'}}],
 'usage': {'prompt_tokens': 50, 'completion_tokens': 20}}

In [12]:
import json
import requests
from datetime import datetime

def analyze_extremist_text_with_openrouter():
    """
    Demonstrate OpenRouter API usage for political extremism analysis
    """

    # # Step 1: Get API key from environment - we already loaded this globally earlier
    # api_key = os.getenv("OPENROUTER_API_KEY")

    if not open_router_api_key:
        print("❌ Error: OPENROUTER_API_KEY environment variable not found")
        print("\nTo set up your API key:")
        print("1. Get an API key from https://openrouter.ai/")
        print("2. Set environment variable: OPENROUTER_API_KEY_LNB=your_key_here")
        return None

    print("✅ API key loaded successfully")

    # Step 2: Set up the API endpoint and headers
    url = "https://openrouter.ai/api/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {open_router_api_key}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://www.henryhenryhenry.com",  # Your project URL
        "X-Title": "UNM 2025 LLM Workshop - Extremist Rhetoric Analysis"
    }

    # Step 3: Sample text for analysis
    post_text = """
    I am thrilled to report that the U.S. Space Command headquarters will
    move to the beautiful locale of a place called Huntsville, Alabama — f
    orever to be known, from this point forward, as ROCKET CITY. WE LOVE ALABAMA!
    """

    # Step 4: Create the request payload
    request_data = {
        "model": "openai/gpt-3.5-turbo",  # Using GPT-3.5 for cost-effectiveness
        "messages": [
            {
                "role": "system",
                "content": """You are a political scientist specializing in extremism
                and political violence. Analyze texts for extremism, emotional content,
                references to violence, rhetorical devices. Provide detailed, scholarly analysis."""
            },
            {
                "role": "user",
                "content": f"""Please analyze this Latvian text excerpt for the following elements:

1. Main themes
2. Emotional tone and sentiment
3. References to violence
4. Rhetorical devices used
5. Historical or social context suggested

Text to analyze:
{post_text}

Please provide your analysis in English, with specific references to the text."""
            }
        ],
        "max_tokens": 1000,
        "temperature": 0.1,  # Low temperature for consistent, analytical responses
        "top_p": 0.9
    }

    # Step 5: Make the API request
    try:
        print("🔄 Sending request to OpenRouter API...")
        print(f"📝 Model: {request_data['model']}")
        print(f"📊 Max tokens: {request_data['max_tokens']}")
        print(f"🌡️ Temperature: {request_data['temperature']}")
        print("-" * 50)

        response = requests.post(url, headers=headers, json=request_data, timeout=30)

        # Check if request was successful
        response.raise_for_status()

        # Step 6: Parse the JSON response
        result = response.json()

        # Step 7: Extract and display the analysis
        if 'choices' in result and len(result['choices']) > 0:
            analysis = result['choices'][0]['message']['content']

            print("✅ Analysis completed successfully!")
            print("=" * 60)
            print("📖 POST ANALYSIS RESULTS")
            print("=" * 60)
            print(analysis)
            print("=" * 60)

            # Display usage statistics
            if 'usage' in result:
                usage = result['usage']
                print(f"\n📊 API Usage Statistics:")
                print(f"   • Prompt tokens: {usage.get('prompt_tokens', 'N/A')}")
                print(f"   • Completion tokens: {usage.get('completion_tokens', 'N/A')}")
                print(f"   • Total tokens: {usage.get('total_tokens', 'N/A')}")

            # Return the full response for further processing
            return {
                'text_analyzed': post_text,
                'analysis': analysis,
                'model_used': request_data['model'],
                'timestamp': datetime.now().isoformat(),
                'usage_stats': result.get('usage', {}),
                'full_response': result
            }

        else:
            print("❌ Error: No analysis returned from the API")
            return None

    except requests.exceptions.Timeout:
        print("❌ Error: Request timed out. Please try again.")
        return None
    except requests.exceptions.HTTPError as e:
        print(f"❌ HTTP Error: {e}")
        if response.status_code == 401:
            print("   This usually means your API key is invalid or expired.")
        elif response.status_code == 429:
            print("   Rate limit exceeded. Please wait before making another request.")
        return None
    except requests.exceptions.RequestException as e:
        print(f"❌ Request Error: {e}")
        return None
    except json.JSONDecodeError:
        print("❌ Error: Invalid JSON response from API")
        return None

# Run the analysis
print("POLITICAL ANALYSIS WITH OPENROUTER API")
print("=" * 60)
result = analyze_extremist_text_with_openrouter()

if result:
    print(f"\n💾 Analysis completed at: {result['timestamp']}")
    print("You can now save this analysis to a file or database for your research.")

POLITICAL ANALYSIS WITH OPENROUTER API
✅ API key loaded successfully
🔄 Sending request to OpenRouter API...
📝 Model: openai/gpt-3.5-turbo
📊 Max tokens: 1000
🌡️ Temperature: 0.1
--------------------------------------------------
✅ Analysis completed successfully!
📖 POST ANALYSIS RESULTS
1. Main themes:
The main themes in this text excerpt are patriotism, pride in the United States, and excitement about the relocation of the U.S. Space Command headquarters to Huntsville, Alabama. The text also emphasizes the significance of this move by renaming Huntsville as "ROCKET CITY."

2. Emotional tone and sentiment:
The emotional tone of the text is highly positive and enthusiastic. The use of words like "thrilled," "beautiful locale," and "WE LOVE ALABAMA!" conveys a sense of excitement and pride. The writer's sentiment is one of celebration and admiration for the decision to relocate the Space Command headquarters to Huntsville.

3. References to violence:
There are no explicit references to vi

### Understanding the Code Structure

#### **1. Environment Variable Setup**
```python
open_router_api_key = os.getenv("OPENROUTER_API_KEY")
```
- **Secure access**: API key stored in environment variable
- **Error handling**: Graceful failure if key not found
- **Best practice**: Never hardcode sensitive credentials

#### **2. Request Headers**
```python
headers = {
    "Authorization": f"Bearer {open_router_api_key}",
    "Content-Type": "application/json",
    "HTTP-Referer": "https://bssdh.eu/",
    "X-Title": "BSSDH 2025 LLM Workshop - Latvian Literature Analysis"
}
```
- **Authorization**: Bearer token authentication
- **Content-Type**: Tells API we're sending JSON data
- **HTTP-Referer**: Identifies your project (optional but recommended)
- **X-Title**: Descriptive title for usage tracking

#### **3. JSON Request Structure**
```python
request_data = {
    "model": "openai/gpt-3.5-turbo",
    "messages": [
        {"role": "system", "content": "System instructions..."},
        {"role": "user", "content": "User query..."}
    ],
    "max_tokens": 1000,
    "temperature": 0.1
}
```
- **model**: Specifies which LLM to use
- **messages**: Conversation format with system and user roles
- **max_tokens**: Limits response length (controls cost)
- **temperature**: Controls creativity (0 = deterministic, 1 = creative)

#### **4. Error Handling**
The code includes comprehensive error handling for:
- **Authentication errors** (401): Invalid API key
- **Rate limiting** (429): Too many requests
- **Network timeouts**: Connection issues
- **JSON parsing errors**: Malformed responses

### Popular Models for Digital Humanities

#### **For Analysis Tasks**
```python
# Cost-effective for bulk analysis
"openai/gpt-3.5-turbo"

# More sophisticated analysis
"openai/gpt-4-turbo"

# Large context for long documents
"anthropic/claude-3-sonnet"

# Fast and economical
"google/gemini-flash-2.5"
```

#### **For Multilingual Tasks**
```python
# Strong multilingual capabilities
"openai/gpt-4"

# Good for European languages
"anthropic/claude-3-opus"

# Open source alternative
"meta-llama/llama-3-70b-instruct"
```

### Customizing for Your Research

#### **System Prompts for Different Tasks**
```python
# For sentiment analysis
system_prompt = """You are an expert in sentiment analysis of historical texts.
Analyze the emotional content and provide numerical scores for different emotions."""

# For named entity recognition
system_prompt = """You are a specialist in extracting names, places, and dates
from historical documents. Focus on accurate identification and categorization."""

# For thematic analysis
system_prompt = """You are a literary scholar specializing in thematic analysis.
Identify recurring themes, motifs, and symbolic elements in the text."""
```

#### **Adjusting Parameters for Different Goals**
```python
# For creative interpretation (higher temperature)
request_data["temperature"] = 0.7

# For factual analysis (lower temperature)
request_data["temperature"] = 0.1

# For longer analysis (more tokens)
request_data["max_tokens"] = 2000

# For concise summaries (fewer tokens)
request_data["max_tokens"] = 300
```

### Saving and Managing Results

#### **Save Analysis to File**
```python
def save_analysis_to_file(result, filename):
    """Save analysis results to JSON file"""
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(result, f, indent=2, ensure_ascii=False)
    print(f"✅ Analysis saved to {filename}")



### Troubleshooting Common Issues

#### **Authentication Problems**
- Verify API key is correct and active - if it is your private key you can print it and check with issued key
- Check environment variable name matches exactly
- Ensure no extra spaces in the API key



### Creating generic function for OpenRouter API requests

Now that we made a function with specific system prompt, user query and specific model, let's create a more generic function that can be used for any OpenRouter API request. This function will allow you to specify the system prompt, user query, model, and other parameters dynamically.


In [14]:
# let's define a generic function for OpenRouter API requests
# it should have tshould define a new function get_openrouter_response it should have following parameters system_prompt, user_prompt,
#  model defaulting to ChatGPT 3.5 and finally api_key which defaults to open_router_api_key .
#  The function get_openrouter_response should function just like analyze_latvian_text_with_openrouter except with parameters.

def get_openrouter_response(system_prompt, user_prompt, model="openai/gpt-3.5-turbo", api_key=open_router_api_key):
    """
    Generic function to make requests to OpenRouter API with specified parameters.

    :param system_prompt: The system prompt to guide the model's behavior.
    :param user_prompt: The user query or text to analyze.
    :param model: The model to use for the request (default is GPT-3.5).
    :param api_key: The OpenRouter API key (default is loaded from environment).
    :return: The response from the OpenRouter API.
    """

    # Set up the API endpoint and headers
    url = "https://openrouter.ai/api/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://www.digitalhumanities.lv/bssdh/2025/",  # Your project URL
        "X-Title": "BSSDH 2025 LLM Workshop - Generic OpenRouter Request"
    }

    # Create the request payload
    request_data = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": 1000,
        "temperature": 0.5,
        "top_p": 0.9
    }

    # Make the API request
    try:
        response = requests.post(url, headers=headers, json=request_data, timeout=30)
        response.raise_for_status()

        result = response.json()

        if 'choices' in result and len(result['choices']) > 0:
            return result['choices'][0]['message']['content']

        else:
            print("❌ Error: No response returned from the API")
            return None

    except requests.exceptions.RequestException as e:
        print(f"❌ Request Error: {e}")
        return None

# let's test it on simple Meaning of Life question

system_prompt = "You are a helpful research assistant that provides concise answers to political science questions. Your answers should be humourous."
user_prompt = "Are Congressional term limits a good idea?"

response = get_openrouter_response(system_prompt, user_prompt) # note we did not pass model or api_key, so it will use defaults of "openai/gpt-3.5-turbo" and open_router_api_key

if response:
    print("Response from OpenRouter API:")
    print(response)
else:
    print("Failed to get a response from OpenRouter API.")

Response from OpenRouter API:
Well, that depends on whether you enjoy watching the same politicians on C-SPAN for decades or if you prefer a revolving door of new faces and ideas. It's like deciding between rewatching your favorite movie a hundred times or taking a chance on a new blockbuster - variety is the spice of life, right?


## Class Exercise - Create your own LLM API request

You have all been shown an API key for the OpenRouter API. Your task is to create a new query that analyzes a document of your choice using the OpenRouter API.

For this exercise supply your document as string variable to get_openrouter_response function, and specify the system prompt and user query that you want to use for the analysis.

For those with more experience try changing model or even adjust temperature or max_tokens parameters to see how it affects the response. Note changing temperature and max_tokens will require rewriting or adjusting the get_openrouter_response function to accept these parameters.

In [15]:
# Adjust both system and content prompts for a class exercise to your liking!
my_system_prompt = "Identify the original author and also the modifications made to the text and who made them."
my_content_prompt = "It was the best of times, it was the blurst of times"
# calling the function with these prompts
response = get_openrouter_response(my_system_prompt, my_content_prompt)
# how many responses we got
if response:
    print("Response from OpenRouter API:")
    print(response)

Response from OpenRouter API:
The original author of the text is Charles Dickens, from the novel "A Tale of Two Cities." 
The modification made to the text is changing "best" to "blurst." 
The modification was made by the user.


## Best Practices for API usage

#### **1. Documentation**
- **Log all API calls**: Keep records of what models and parameters you used
- **Version control**: Track changes to your analysis methods
- **Reproducible scripts**: Write code that others can run and verify

#### **2. Error Handling**
```python
import requests

try:
    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()  # Raises an exception for bad status codes
    result = response.json()
except requests.exceptions.RequestException as e:
    print(f"API request failed: {e}")
```

#### **3. Rate Limiting and Costs**
- **Respect rate limits**: Don't overwhelm the service
- **Monitor usage**: Track your API costs
- **Batch efficiently**: Group similar requests when possible

#### **4. Data Privacy**
- **Sensitive data**: Be cautious with personal or confidential historical materials
- **Institutional policies**: Check your institution's data use guidelines
- **Terms of service**: Understand how API providers handle your data

### Popular APIs for Digital Humanities

#### **LLM APIs**
- **OpenRouter**: Access to multiple models through one interface - what we use in this workshop
- **OpenAI API**: Direct access to GPT models
- **Anthropic API**: Claude models with large context windows
- **Hugging Face API**: Open-source models

### Security Considerations

#### **API Key Management**
- **Never commit keys to version control**
- **Use environment variables** to store sensitive information
- **Rotate keys regularly**
- **Limit key permissions** where possible set budgets and access levels

#### **Example: Secure Key Storage**
```python
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Safely access your API key
api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
    raise ValueError("API key not found in environment variables")
```

### Testing and Development

#### **Start Small**
1. **Test with short texts** before processing large corpora
2. **Use cheaper models** for initial experiments
3. **Validate outputs** with known examples
4. **Compare multiple models** for the same task

#### **API Testing Tools**
- **Postman**: Visual interface for testing API calls
- **curl**: Command-line tool for simple tests
- **Python requests library**: For programmatic testing





## Securely loading API keys in your environment

When working with APIs, especially those that require authentication, it's essential to handle API keys securely. Exposing your API keys can lead to unauthorized access and potential misuse of your account. Here are some best practices for securely loading API keys in your environment:

### 1. Use Environment Variables
Store your API keys in system environment variables instead of hardcoding them in your scripts. This keeps sensitive information out of your codebase and version control.
```python
import os
# Load API key from environment variable
api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
    raise ValueError("API key not found in environment variables")
```

Above would require that you set up the environment variable `OPENROUTER_API_KEY` in your operating system or development environment.

Since most of us in this workshop are using Google Colab we do not have this particular system variable set up. Instead, we will use a `.env` file to store our API key and load it using the `python-dotenv` library.

### 2. Use a `.env` File
For local development, you can use a `.env` file to store your environment variables. This file should not be committed to version control (add it to your `.gitignore` file).
```plaintext
# .env file
OPENROUTER_API_KEY=your_api_key_here
```

### 3. Use a Library to Load Environment Variables
Use a library like `python-dotenv` to load environment variables from a `.env` file.
```python
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Now you can access your API key
api_key = os.getenv("OPENROUTER_API_KEY")
if not api_key:
    raise ValueError("API key not found in environment variables")
```

### 4. Keep Your `.env` File Secure
Ensure that your `.env` file is not accessible to unauthorized users. Set appropriate file permissions and avoid sharing it publicly.

### 5. Rotate Your API Keys Regularly
Regularly rotate your API keys to minimize the risk of unauthorized access. Most API providers allow you to generate new keys and revoke old ones.

### 6. Set limits on API Key Usage
If your API provider allows it, set usage limits on your API keys to prevent abuse. This can include rate limiting or restricting access to specific IP addresses or applications. In our workshop each individual API KEY has a limit of 1 Euro, which is sufficient for the workshop tasks.

### 7. Alternative use getpass
If you prefer not to use a `.env` file, you can use the `getpass` module to securely prompt for your API key at runtime. This way, the key is not stored in your code or a file.
```python
import getpass
# Prompt for API key securely without showing it on the screen
api_key = getpass.getpass("Enter your OpenRouter API key: ")
if not api_key:
    raise ValueError("API key cannot be empty")
```


