# Using LLMs in Humanities Research via API

## Session 2 14.00-15.30 - Working with LLMs via API

Through practical examples, we will explore prompt engineering techniques for tasks such as concept mining and named entity recognition in textual data.

## Session Outline

- **Prompt Engineering**: Techniques for crafting effective prompts to guide LLMs in generating relevant and accurate responses.
- **Concept Mining**: Using LLMs to extract key concepts from text, enabling researchers to identify important themes and ideas.
- **Named Entity Recognition (NER)**: Implementing NER to identify and classify entities in text, such as people, organizations, and locations.



## BSSDH 2025 Workshop Data

Before we start exploring the API, let's take a look at the corpus of documents we will be working. 
Data for workshops in [Baltic Summer School of Digital Humanities 2025](https://www.digitalhumanities.lv/bssdh/2025/about/)

**Repository:** https://github.com/LNB-DH/BSSDH_2025_workshop_data




## CORPUS OVERVIEW


1. SOURCE MATERIAL
------------------

| Periodical | Details |
|------------|---------|
| "Rigasche Zeitung" (RZei) (1918–1919) | - **Data file:** `Rigasche_Zeitung_1918_1919.zip`<br>- **Download Rigasche Zeitung:** https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Rigasche_Zeitung_1918_1919.zip<br>- Morning newspaper, intermittently published from 1778 to 1919 in Riga.<br>- Language: German (Fraktur script)<br>- Once the most popular morning paper in the Baltic provinces of the Russian Empire.<br>- Covered general political and economic news in Riga, the Baltics, the Russian Empire, and internationally.<br>- Historical context: World War I, Latvian War of Independence.<br>- Link: https://periodika.lv/#periodicalMeta:234;-1<br>- More info: https://enciklopedija.lv/skirklis/163962 |
| "Latvian Economic Review" (LERQ) (1936–1940) | - **Data file:** `Latvian_Economic_Review_1936_1940.zip`<br>- **Download Latvian Economic Review:** https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip<br>- Full title: "Latvian Economic Review: A quarterly review of trade, industry and agriculture".<br>- Language: English (modern)<br>- Published by the Latvian Chamber of Commerce and Industry (established 1934).<br>- Focused on cross-border representation of Latvian economy during the Great Depression, increasing state control, push for autarky, and start of WWII.<br>- Link: https://periodika.lv/#periodicalItem:620 |

2. CORPUS INFORMATION
----------------------

| Metric | RZei | LERQ |
|--------|------|------|
| Token Count (words) | 5.37 million | 0.5 million |
| Issue Count | 359 issues | 18 issues |
| Segment (Article <=> File) Count | 4,597 | 419 |
| Language | German | English |
| Script | Fraktur | Modern |

Filename Structure:
-------------------
Format: [periodical][year][volume#*][issue#]_[page#]_[[plaintext]]_[segment#]

Example: `lerq1936s01n02_031_plaintext_s17.txt`
         → 17th segment from LERQ, Issue 2, 1936, page 31.

*Volume value in corpus is one in all cases.

3. METHODOLOGY
---------------

| Step | Description |
|------|-------------|
| 3.1. Source Access | Digitised issues obtained from the National Library of Latvia (https://periodika.lv/) |
| 3.2. Processing & OCR | CCS docWORKS & ABBYY FineReader 9.0<br>- LERQ has better OCR quality than RZei<br>- No further data cleaning/normalization |
| 3.3. Metadata Added | Fields: title, author, uri<br>- Author info available in:<br>&nbsp;&nbsp;&nbsp;&nbsp;LERQ: 4 cases (0.95%)<br>&nbsp;&nbsp;&nbsp;&nbsp;RZei: 325 cases (7.05%)<br>- Title availability:<br>&nbsp;&nbsp;&nbsp;&nbsp;LERQ: 95.7%, RZei: 99.15%<br>- URI coverage: 100% for both<br>- URIs point to LNB DOM system |


## Extracting documents

We could extract documents manually by downloading the appropriate zip file and extracting files by *hand* using file extracting capabilities built into your Operating System(Windows has built in extractor) or using external program such as 7-zip, WinRAR, etc. However, it is more replicable and convenient to use a script that will do this for us. We would supply a url or file name and the script would download the file, extract it to approparite location, and return a list of files that were extracted.

### Additional considerations when extracting documents

* Where will be extracted files be stored? - Ideally we would have a same relative structure when extracting files locally and on remote server such as Google Colab.
* How will we handle file names? - Usually we would like to keep the original file names, but we might want to add some additional information such as source or date of extraction.
* How will we handle errors? - We should consider what to do if the file cannot be downloaded or extracted. Should we skip it or raise an error?

### Extracting Latvian Economic Review

For this session we will extract Latvian Economic Review (LERQ) corpus. We will use a script that will download the file, extract it to appropriate location, and return a list of files that were extracted.

We will write a function in Python that will do this for us. The function will take a URL or file name as an argument and will download file from url and then extract it. We will have a default location where the files will be extracted, but we can also specify a different location if needed.

```python



In [1]:
url = "https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip"
print("Will extract data from", url)
# next we define a function that will download and extract the zip file, this way we can reuse it later if needed
# we can also set some default values for the arguments, so we do not have to specify
# default values always come after the mandatory arguments in Python functions
def extract_zip(url, output_dir="data", verbose=False):
    # we could have imported these at the top, but we want to keep the script self-contained
    import requests  # this should be cached by notebooks, so it **should** not require importing it every time
    from zipfile import ZipFile
    from io import BytesIO

    # In verbose mode let's some extra information about the download and extraction process
    # This is useful for debugging and understanding the flow of the script
    from datetime import datetime
    if verbose:
        download_start = datetime.now()
        # we print start time including milliseconds
        print(f"Starting download at {download_start.strftime('%Y-%m-%d %H:%M:%S.%f')}")
        print("Starting download from", url)
    response = requests.get(url)
    if verbose:
        download_finish = datetime.now()
        print(f"Download finished at {download_finish.strftime('%Y-%m-%d %H:%M:%S.%f')} taking {download_finish - download_start} seconds")
    if response.status_code == 200: # it is possible a request fails, e.g. if the URL is incorrect
        if verbose:
            extract_start = datetime.now()
            print(f"Starting extraction to {output_dir} at {extract_start.strftime('%Y-%m-%d %H:%M:%S.%f')}")
        with ZipFile(BytesIO(response.content)) as zf:
            zf.extractall(output_dir)
        if verbose:
            extract_end = datetime.now()
            print(f"Extraction finished at {extract_end.strftime('%Y-%m-%d %H:%M:%S.%f')} taking {extract_end - extract_start} seconds")
            print(f"Total time taken: {extract_end - download_start} seconds")
    else:
        print("Failed to download data:", response.status_code)

# now that we have our function defined, we can call it immediately
# note we do not supply all arguments, first one is mandatory, the rest are optional
# so we skip over output_dir in this case
extract_zip(url, verbose=True)

Will extract data from https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip
Starting download at 2025-08-04 21:45:44.098537
Starting download from https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip
Download finished at 2025-08-04 21:45:44.895403 taking 0:00:00.796866 seconds
Starting extraction to data at 2025-08-04 21:45:44.895403
Extraction finished at 2025-08-04 21:45:45.059081 taking 0:00:00.163678 seconds
Total time taken: 0:00:00.960544 seconds


### Getting information about extracted files

It is a good practice to double check what files were extracted and where they are located. We can do this by listing the files in the directory where we extracted them. We can use Python pathlib to do this. 
The goal is to double check that what we extacted matches what we expected. We can also check the file names and their structure to make sure they are correct.

```python

In [2]:
from pathlib import Path
extract_dir = Path("data") # note this is a a relative path, relative to the current working directory for the notebook
# let's check if the directory exists and how many files it contains
if extract_dir.exists():
    files = list(extract_dir.glob("*"))  # this will list all files in the directory
    print(f"Extracted {len(files)} files to {extract_dir}")
    for file in files:
        print(file.name)  # print the name of each file 
else:
    print(f"Directory {extract_dir} does not exist. Please check the extraction process.")

Extracted 1 files to data
Latvian_Economic_Review


### Getting information about subfolders in the extracted directory
Looks like we only have a single file but it is actually not a file but a directory. This is because we extracted a zip file that contains files under a single directory. 

Next we want to check how many total files we have and also how many files we have with *.txt extension. This will help us to understand how many files we can work with and if there are any files that we might want to exclude from our analysis.


In [3]:
# let's check how many files total we have and how many files with *.txt extension counting all subfolders 
# this means we will perform a recursive search for all files in the directory

def analyze_directory_contents(directory_path, verbose=True):
    """
    Analyze the contents of a directory recursively and provide detailed information.
    
    Args:
        directory_path: Path object or string path to the directory
        verbose: If True, print detailed information about file types and structure
    
    Returns:
        dict: Dictionary containing analysis results
    """
    from pathlib import Path
    
    directory = Path(directory_path)
    
    if not directory.exists():
        print(f"Directory {directory} does not exist.")
        return None
    
    # Get all files recursively
    all_files = list(directory.rglob("*"))
    
    # Separate files from directories
    files_only = [f for f in all_files if f.is_file()]
    directories_only = [f for f in all_files if f.is_dir()]
    
    # Count files by extension
    file_extensions = {}
    for file in files_only:
        ext = file.suffix.lower()
        if ext == '':
            ext = '(no extension)'
        file_extensions[ext] = file_extensions.get(ext, 0) + 1
    
    # Count .txt files specifically
    txt_files = [f for f in files_only if f.suffix.lower() == '.txt']
    
    # Analysis results
    results = {
        'total_items': len(all_files),
        'total_files': len(files_only),
        'total_directories': len(directories_only),
        'txt_files': len(txt_files),
        'file_extensions': file_extensions,
        'txt_file_paths': txt_files
    }
    
    if verbose:
        print(f"📁 Directory Analysis: {directory}")
        print("=" * 50)
        print(f"Total items found (files + directories): {results['total_items']}")
        print(f"Total files: {results['total_files']}")
        print(f"Total directories: {results['total_directories']}")
        print(f"Text files (.txt): {results['txt_files']}")
        print()
        
        print("📊 File Extensions Summary:")
        print("-" * 30)
        for ext, count in sorted(file_extensions.items(), key=lambda x: x[1], reverse=True):
            print(f"  {ext}: {count} files")
        print()
        
        if len(directories_only) > 0:
            print("📂 Directory Structure:")
            print("-" * 30)
            for directory in sorted(directories_only):
                # Show relative path from the base directory
                relative_path = directory.relative_to(directory_path)
                print(f"  📁 {relative_path}")
            print()
        
        if len(txt_files) > 0 and len(txt_files) <= 10:
            print("📄 Sample .txt files:")
            print("-" * 30)
            for txt_file in sorted(txt_files)[:10]:
                relative_path = txt_file.relative_to(directory_path)
                file_size = txt_file.stat().st_size
                print(f"  📄 {relative_path} ({file_size:,} bytes)")
        elif len(txt_files) > 10:
            print(f"📄 First 10 .txt files (out of {len(txt_files)} total):")
            print("-" * 30)
            for txt_file in sorted(txt_files)[:10]:
                relative_path = txt_file.relative_to(directory_path)
                file_size = txt_file.stat().st_size
                print(f"  📄 {relative_path} ({file_size:,} bytes)")
            print(f"  ... and {len(txt_files) - 10} more .txt files")
    
    return results

# Now let's analyze our extracted directory
print("Analyzing extracted data directory...")
analysis_results = analyze_directory_contents(extract_dir, verbose=True)

Analyzing extracted data directory...
📁 Directory Analysis: data
Total items found (files + directories): 420
Total files: 419
Total directories: 1
Text files (.txt): 419

📊 File Extensions Summary:
------------------------------
  .txt: 419 files

📂 Directory Structure:
------------------------------
  📁 Latvian_Economic_Review

📄 First 10 .txt files (out of 419 total):
------------------------------
  📄 Latvian_Economic_Review\lerq1936s01n01_003_plaintext_s01.txt (8,662 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_006_plaintext_s02.txt (5,190 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_008_plaintext_s03.txt (6,066 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_009_plaintext_s04.txt (5,523 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_013_plaintext_s05.txt (3,866 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_014_plaintext_s06.txt (1,144 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_014_plaintext_s07.txt (629 bytes)
  📄 Latvian_Economic_Review\lerq1936s01n01_0

In [4]:
# let's get 6th text file from analysis_results dictionary text_files_paths key
# why 6th? because it seems a bit smaller
text_files_list = analysis_results['txt_file_paths']
if len(text_files_list) >= 6:
    sixth_text_file = text_files_list[5]  # 6th file, index starts from 0
    print(f"6th text file: {sixth_text_file.name} ({sixth_text_file.stat().st_size:,} bytes)")

6th text file: lerq1936s01n01_014_plaintext_s06.txt (1,144 bytes)


In [5]:
# let's print out its contents
with sixth_text_file.open('r', encoding='utf-8') as f:
    content = f.read() # read whole file content into memory
# file is closed here automatically due to the with statement
print(f"Contents of {sixth_text_file.name}:\n")
print(content[:500])  # print first 500 characters to avoid too much output
print("\n... (truncated output)")

Contents of lerq1936s01n01_014_plaintext_s06.txt:

title: Gypsum
author: 
uri: http://dom.lndb.lv/data/obj/159411



There are numerous extensive layers of gypseous
stone in Latvia, but only a few of them are being
exploited, viz., the quarries at Kalnciems, Sloka
(about 33 km. from Riga), Salaspils (about 20 km.
from Riga) and Naves sala (about 25 km. from Riga).
The gypseous stone is exported both in raw condition
(gypsum), principally for the manufacture of
cement, and in the form of Plaster of Paris. The export
of gypsum totalled 69,000 tons

... (truncated output)


## Setting up our LLM API functions

In [6]:
# let's try loading open_router_api_key first from system environment variables,
#  then from .env file
# and finally we will prompt user to enter it manually if not found

import os # we already imported this, but let's do it again for clarity - it is cached so no harm done
print("Checking for OPENROUTER_API_KEY in environment variables...")
# lets try loading the environment variable from system environment variables
open_router_api_key = os.getenv('OPENROUTER_API_KEY')
# if not found, try loading from .env file
if not open_router_api_key:
    print("OPENROUTER_API_KEY not found in environment variables. Trying to load from .env file...")
    try:
        from dotenv import load_dotenv
        # Load environment variables from .env file
        load_dotenv()
        print("Environment variables loaded from .env file.")
    except ImportError:
        print("dotenv module is not installed. You can install it using 'pip install python-dotenv'.")

    open_router_api_key = os.getenv('OPENROUTER_API_KEY') # we try loading again after loading .env file

# if still not found, prompt user to enter it manually
if not open_router_api_key:
    open_router_api_key = input("Please enter your OpenRouter API key: ")
    # save it to .env file for future use
    # note Google Colab will destroy .env file after session ends, so you will need to enter it again next time
    # this can be useful if you re-run the notebook and want to avoid entering the key again
    print("Saving Open Router API key to .env file...")
    with open('.env', 'a') as f:
        f.write(f'OPENROUTER_API_KEY={open_router_api_key}\n')
    print("Open Router API key saved to .env file.")

# we now should have the OpenRouter API key available
if open_router_api_key:
    print("OpenRouter API key loaded successfully.")
else:
    print("OpenRouter API key not found. Please make sure you have it set in your environment variables or .env file.")
    print("You can also enter it manually when prompted during API calls.")

# key point we do not print it publicly it is stored as a variable under the name open_router_api_key - of course you can change the name to something more descriptive
# but do not print it to the console or logs, as it is sensitive information

Checking for OPENROUTER_API_KEY in environment variables...
OpenRouter API key loaded successfully.


In [7]:
# let's define a generic function for OpenRouter API requests
# it should have tshould define a new function get_openrouter_response it should have following parameters system_prompt, user_prompt,
#  model defaulting to ChatGPT 3.5 and finally api_key which defaults to open_router_api_key .
#  The function get_openrouter_response should function just like analyze_latvian_text_with_openrouter except with parameters.
import requests  # we need to import requests to make API calls

def get_openrouter_response(system_prompt, user_prompt, 
                            model="google/gemini-2.5-flash-lite", 
                            api_key=open_router_api_key,
                            max_tokens=1000,
                            temperature=0.5,):
    """
    Generic function to make requests to OpenRouter API with specified parameters.
    
    :param system_prompt: The system prompt to guide the model's behavior.
    :param user_prompt: The user query or text to analyze.
    :param model: The model to use for the request (default is GPT-3.5).
    :param api_key: The OpenRouter API key (default is loaded from environment).
    :return: The response from the OpenRouter API.
    """
    
    # Set up the API endpoint and headers
    url = "https://openrouter.ai/api/v1/chat/completions"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://www.digitalhumanities.lv/bssdh/2025/",  # Your project URL
        "X-Title": "BSSDH 2025 LLM Workshop - Generic OpenRouter Request"
    }
    
    # Create the request payload
    request_data = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": 0.9
    }
    
    # Make the API request
    try:
        response = requests.post(url, headers=headers, json=request_data, timeout=30)
        response.raise_for_status()
        
        result = response.json()
        
        if 'choices' in result and len(result['choices']) > 0:
            return result['choices'][0]['message']['content']
        
        else:
            print("❌ Error: No response returned from the API")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"❌ Request Error: {e}")
        return None
    
# let's test it on simple Meaning of Life question

system_prompt = "You are a digital humanities researcher specializing in named entity recognition and text analysis. You will analyze the text and provide all named entities as a list."
user_prompt = content

response = get_openrouter_response(system_prompt, user_prompt) # note we did not pass model or api_key, so it will use defaults of "openai/gpt-3.5-turbo" and open_router_api_key

if response:
    print("Response from OpenRouter API:")
    print(response)
else:
    print("Failed to get a response from OpenRouter API.")

Response from OpenRouter API:
Here are the named entities found in the text:

*   Kalnciems
*   Sloka
*   Riga
*   Salaspils
*   Naves sala
*   Plaster of Paris
*   Norway
*   Sweden
*   Denmark
*   Finland
*   England


## Adjusting system prompts 

We got our named entities but let's also have categories for them.
We can do that by adjusting our system prompt to include categories for named entities.

In [8]:
# let's adjust our system prompt by adding extra instruction to add categories for named entities.
extra_instruction = "Please categorize the named entities into PERSON, ORGANIZATION, LOCATION, and MISC."
system_prompt = f"{system_prompt} {extra_instruction}"
print(f"Adjusted system prompt: {system_prompt}")

Adjusted system prompt: You are a digital humanities researcher specializing in named entity recognition and text analysis. You will analyze the text and provide all named entities as a list. Please categorize the named entities into PERSON, ORGANIZATION, LOCATION, and MISC.


In [9]:
# let's see our response with adjusted system prompt
response = get_openrouter_response(system_prompt, user_prompt)
if response:
    print("Response with adjusted system prompt:")
    print(response)

Response with adjusted system prompt:
Here are the named entities from the text, categorized as requested:

**PERSON:**
* None

**ORGANIZATION:**
* None

**LOCATION:**
* Latvia
* Kalnciems
* Sloka
* Riga
* Naves sala
* Norway
* Sweden
* Denmark
* Finland
* England

**MISC:**
* Gypsum
* Plaster of Paris


## Getting a summary of the text

One of the basic tasks we can do with LLMs is to get a summary of the text.



In [11]:
summary_system_prompt = """You are a digital humanities researcher specializing in text summarization. 
You will summarize the text and provide a concise summary in structured bullet point format."""
response = get_openrouter_response(summary_system_prompt, content)
if response:
    print("Summary of the text:")
    print(response)


Summary of the text:
Here's a summary of the provided text about gypsum:

*   **Gypsum Deposits:** Latvia possesses extensive layers of gypseous stone, with active exploitation occurring in quarries at Kalnciems, Sloka, Salaspils, and Naves sala.
*   **Exports:** The gypseous stone is exported both raw (as gypsum) and processed into Plaster of Paris.
*   **Raw Gypsum Exports (1934):**
    *   Totaled 69,000 tons.
    *   Generated Ls 395,000.
    *   Destinations included Norway, Sweden, Denmark, Finland, and England.
    *   The export volume was even higher in the preceding year.
*   **Gypsum Quality:** Latvian gypsum averages 93% purity, with some layers reaching up to 99%.
*   **Reasons for Popularity:** Its firm structure is highly valued abroad as it prevents machinery clogging during milling.
*   **Plaster of Paris Exports (1934):**
    *   Amounted to approximately 6,500 tons.
    *   Generated Ls 85,000.
    *   A 100% increase in exports was anticipated for the following year