# LLMs in Political Science Research via API

## Session 2 - Working with LLMs via API

Through practical examples, we will explore prompt engineering techniques for tasks such as concept mining and named entity recognition in textual data.

## Session Outline

- **Prompt Engineering**: Techniques for crafting effective prompts to guide LLMs in generating relevant and accurate responses.
- **Concept Mining**: Using LLMs to extract key concepts from text, enabling researchers to identify important themes and ideas.
- **Named Entity Recognition (NER)**: Implementing NER to identify and classify entities in text, such as people, organizations, and locations.



## UNM 2025 Workshop Data

Before we start exploring the API, let's take a look at the corpus of documents we will be working.

**Repository:** https://github.com/LNB-DH/BSSDH_2025_workshop_data




## CORPUS OVERVIEW


1. SOURCE MATERIAL
------------------

| Periodical | Details |
|------------|---------|
| "Rigasche Zeitung" (RZei) (1918–1919) | - **Data file:** `Rigasche_Zeitung_1918_1919.zip`<br>- **Download Rigasche Zeitung:** https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Rigasche_Zeitung_1918_1919.zip<br>- Morning newspaper, intermittently published from 1778 to 1919 in Riga.<br>- Language: German (Fraktur script)<br>- Once the most popular morning paper in the Baltic provinces of the Russian Empire.<br>- Covered general political and economic news in Riga, the Baltics, the Russian Empire, and internationally.<br>- Historical context: World War I, Latvian War of Independence.<br>- Link: https://periodika.lv/#periodicalMeta:234;-1<br>- More info: https://enciklopedija.lv/skirklis/163962 |
| "Latvian Economic Review" (LERQ) (1936–1940) | - **Data file:** `Latvian_Economic_Review_1936_1940.zip`<br>- **Download Latvian Economic Review:** https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip<br>- Full title: "Latvian Economic Review: A quarterly review of trade, industry and agriculture".<br>- Language: English (modern)<br>- Published by the Latvian Chamber of Commerce and Industry (established 1934).<br>- Focused on cross-border representation of Latvian economy during the Great Depression, increasing state control, push for autarky, and start of WWII.<br>- Link: https://periodika.lv/#periodicalItem:620 |

2. CORPUS INFORMATION
----------------------

| Metric | RZei | LERQ |
|--------|------|------|
| Token Count (words) | 5.37 million | 0.5 million |
| Issue Count | 359 issues | 18 issues |
| Segment (Article <=> File) Count | 4,597 | 419 |
| Language | German | English |
| Script | Fraktur | Modern |

Filename Structure:
-------------------
Format: [periodical][year][volume#*][issue#]_[page#]_[[plaintext]]_[segment#]

Example: `lerq1936s01n02_031_plaintext_s17.txt`
         → 17th segment from LERQ, Issue 2, 1936, page 31.

*Volume value in corpus is one in all cases.

3. METHODOLOGY
---------------

| Step | Description |
|------|-------------|
| 3.1. Source Access | Digitised issues obtained from the National Library of Latvia (https://periodika.lv/) |
| 3.2. Processing & OCR | CCS docWORKS & ABBYY FineReader 9.0<br>- LERQ has better OCR quality than RZei<br>- No further data cleaning/normalization |
| 3.3. Metadata Added | Fields: title, author, uri<br>- Author info available in:<br>&nbsp;&nbsp;&nbsp;&nbsp;LERQ: 4 cases (0.95%)<br>&nbsp;&nbsp;&nbsp;&nbsp;RZei: 325 cases (7.05%)<br>- Title availability:<br>&nbsp;&nbsp;&nbsp;&nbsp;LERQ: 95.7%, RZei: 99.15%<br>- URI coverage: 100% for both<br>- URIs point to LNB DOM system |


## Extracting documents

We could extract documents manually by downloading the appropriate zip file and extracting files by *hand* using file extracting capabilities built into your Operating System (Windows has built in extractor) or using external program such as 7-zip, WinRAR, etc. However, it is more replicable and convenient to use a script that will do this for us. We would supply a url or file name and the script would download the file, extract it to approparite location, and return a list of files that were extracted.

### Additional considerations when extracting documents

* Where will be extracted files be stored? - Ideally we would have a same relative structure when extracting files locally and on remote server such as Google Colab.
* How will we handle file names? - Usually we would like to keep the original file names, but we might want to add some additional information such as source or date of extraction.
* How will we handle errors? - We should consider what to do if the file cannot be downloaded or extracted. Should we skip it or raise an error?

### Extracting Latvian Economic Review

For this session we will extract Latvian Economic Review (LERQ) corpus. We will use a script that will download the file, extract it to appropriate location, and return a list of files that were extracted.

We will write a function in Python that will do this for us. The function will take a URL or file name as an argument and will download file from url and then extract it. We will have a default location where the files will be extracted, but we can also specify a different location if needed.

```python



In [1]:
url = "https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip"
print("Will extract data from", url)
# next we define a function that will download and extract the zip file, this way we can reuse it later if needed
# we can also set some default values for the arguments, so we do not have to specify
# default values always come after the mandatory arguments in Python functions
def extract_zip(url, output_dir="data", verbose=False):
    # we could have imported these at the top, but we want to keep the script self-contained
    import requests  # this should be cached by notebooks, so it **should** not require importing it every time
    from zipfile import ZipFile
    from io import BytesIO

    # In verbose mode let's some extra information about the download and extraction process
    # This is useful for debugging and understanding the flow of the script
    from datetime import datetime
    if verbose:
        download_start = datetime.now()
        # we print start time including milliseconds
        print(f"Starting download at {download_start.strftime('%Y-%m-%d %H:%M:%S.%f')}")
        print("Starting download from", url)
    response = requests.get(url)
    if verbose:
        download_finish = datetime.now()
        print(f"Download finished at {download_finish.strftime('%Y-%m-%d %H:%M:%S.%f')} taking {download_finish - download_start} seconds")
    if response.status_code == 200: # it is possible a request fails, e.g. if the URL is incorrect
        if verbose:
            extract_start = datetime.now()
            print(f"Starting extraction to {output_dir} at {extract_start.strftime('%Y-%m-%d %H:%M:%S.%f')}")
        with ZipFile(BytesIO(response.content)) as zf:
            zf.extractall(output_dir)
        if verbose:
            extract_end = datetime.now()
            print(f"Extraction finished at {extract_end.strftime('%Y-%m-%d %H:%M:%S.%f')} taking {extract_end - extract_start} seconds")
            print(f"Total time taken: {extract_end - download_start} seconds")
    else:
        print("Failed to download data:", response.status_code)

# now that we have our function defined, we can call it immediately
# note we do not supply all arguments, first one is mandatory, the rest are optional
# so we skip over output_dir in this case
extract_zip(url, verbose=True)

Will extract data from https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip
Starting download at 2025-09-05 15:08:49.630621
Starting download from https://github.com/LNB-DH/BSSDH_2025_workshop_data/raw/main/data/Latvian_Economic_Review_1936_1940.zip
Download finished at 2025-09-05 15:08:50.091615 taking 0:00:00.460994 seconds
Starting extraction to data at 2025-09-05 15:08:50.092022
Extraction finished at 2025-09-05 15:08:50.181857 taking 0:00:00.089835 seconds
Total time taken: 0:00:00.551236 seconds


### Getting information about extracted files

It is a good practice to double check what files were extracted and where they are located. We can do this by listing the files in the directory where we extracted them. We can use Python pathlib to do this.
The goal is to double check that what we extacted matches what we expected. We can also check the file names and their structure to make sure they are correct.

```python

In [2]:
from pathlib import Path
extract_dir = Path("data") # note this is a a relative path, relative to the current working directory for the notebook
# let's check if the directory exists and how many files it contains
if extract_dir.exists():
    files = list(extract_dir.glob("*"))  # this will list all files in the directory
    print(f"Extracted {len(files)} files to {extract_dir}")
    for file in files:
        print(file.name)  # print the name of each file
else:
    print(f"Directory {extract_dir} does not exist. Please check the extraction process.")

Extracted 1 files to data
Latvian_Economic_Review


### Getting information about subfolders in the extracted directory
Looks like we only have a single file but it is actually not a file but a directory. This is because we extracted a zip file that contains files under a single directory.

Next we want to check how many total files we have and also how many files we have with *.txt extension. This will help us to understand how many files we can work with and if there are any files that we might want to exclude from our analysis.


In [3]:
# let's check how many files total we have and how many files with *.txt extension counting all subfolders
# this means we will perform a recursive search for all files in the directory

def analyze_directory_contents(directory_path, verbose=True):
    """
    Analyze the contents of a directory recursively and provide detailed information.

    Args:
        directory_path: Path object or string path to the directory
        verbose: If True, print detailed information about file types and structure

    Returns:
        dict: Dictionary containing analysis results
    """
    from pathlib import Path

    directory = Path(directory_path)

    if not directory.exists():
        print(f"Directory {directory} does not exist.")
        return None

    # Get all files recursively
    all_files = list(directory.rglob("*"))

    # Separate files from directories
    files_only = [f for f in all_files if f.is_file()]
    directories_only = [f for f in all_files if f.is_dir()]

    # Count files by extension
    file_extensions = {}
    for file in files_only:
        ext = file.suffix.lower()
        if ext == '':
            ext = '(no extension)'
        file_extensions[ext] = file_extensions.get(ext, 0) + 1

    # Count .txt files specifically
    txt_files = [f for f in files_only if f.suffix.lower() == '.txt']

    # Analysis results
    results = {
        'total_items': len(all_files),
        'total_files': len(files_only),
        'total_directories': len(directories_only),
        'txt_files': len(txt_files),
        'file_extensions': file_extensions,
        'txt_file_paths': txt_files
    }

    if verbose:
        print(f"📁 Directory Analysis: {directory}")
        print("=" * 50)
        print(f"Total items found (files + directories): {results['total_items']}")
        print(f"Total files: {results['total_files']}")
        print(f"Total directories: {results['total_directories']}")
        print(f"Text files (.txt): {results['txt_files']}")
        print()

        print("📊 File Extensions Summary:")
        print("-" * 30)
        for ext, count in sorted(file_extensions.items(), key=lambda x: x[1], reverse=True):
            print(f"  {ext}: {count} files")
        print()

        if len(directories_only) > 0:
            print("📂 Directory Structure:")
            print("-" * 30)
            for directory in sorted(directories_only):
                # Show relative path from the base directory
                relative_path = directory.relative_to(directory_path)
                print(f"  📁 {relative_path}")
            print()

        if len(txt_files) > 0 and len(txt_files) <= 10:
            print("📄 Sample .txt files:")
            print("-" * 30)
            for txt_file in sorted(txt_files)[:10]:
                relative_path = txt_file.relative_to(directory_path)
                file_size = txt_file.stat().st_size
                print(f"  📄 {relative_path} ({file_size:,} bytes)")
        elif len(txt_files) > 10:
            print(f"📄 First 10 .txt files (out of {len(txt_files)} total):")
            print("-" * 30)
            for txt_file in sorted(txt_files)[:10]:
                relative_path = txt_file.relative_to(directory_path)
                file_size = txt_file.stat().st_size
                print(f"  📄 {relative_path} ({file_size:,} bytes)")
            print(f"  ... and {len(txt_files) - 10} more .txt files")

    return results

# Now let's analyze our extracted directory
print("Analyzing extracted data directory...")
analysis_results = analyze_directory_contents(extract_dir, verbose=True)

Analyzing extracted data directory...
📁 Directory Analysis: data
Total items found (files + directories): 420
Total files: 419
Total directories: 1
Text files (.txt): 419

📊 File Extensions Summary:
------------------------------
  .txt: 419 files

📂 Directory Structure:
------------------------------
  📁 Latvian_Economic_Review

📄 First 10 .txt files (out of 419 total):
------------------------------
  📄 Latvian_Economic_Review/lerq1936s01n01_003_plaintext_s01.txt (8,662 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_006_plaintext_s02.txt (5,190 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_008_plaintext_s03.txt (6,066 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_009_plaintext_s04.txt (5,523 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_013_plaintext_s05.txt (3,866 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_014_plaintext_s06.txt (1,144 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_014_plaintext_s07.txt (629 bytes)
  📄 Latvian_Economic_Review/lerq1936s01n01_0

In [4]:
# let's get 8th text file from analysis_results dictionary text_files_paths key
text_files_list = sorted(analysis_results['txt_file_paths'])
if len(text_files_list) >= 8:
    eighth_text_file = text_files_list[7]  # 8th file, index starts from 0
    print(f"8th text file: {eighth_text_file.name} ({eighth_text_file.stat().st_size:,} bytes)")

8th text file: lerq1936s01n01_015_plaintext_s08.txt (7,520 bytes)


In [5]:
# let's print out its contents
with eighth_text_file.open('r', encoding='utf-8') as f:
    content = f.read() # read whole file content into memory
# file is closed here automatically due to the with statement
print(f"Contents of {eighth_text_file.name}:\n")
print(content[:500])  # print first 500 characters to avoid too much output
print("\n... (truncated output)")

Contents of lerq1936s01n01_015_plaintext_s08.txt:

title: Central Union of Latvian Cooperative Dairy Societies
author: 
uri: http://dom.lndb.lv/data/obj/159411



The central union of Latvian Cooperative dairy
societies, established in August 1921, is the only
central organisation for promoting the dairy industry
in this country. It was founded by 13 of the 18
dairy societies then in existence. The number of
members has increased considerably in the meantime,
totalling at present 208 of the 289 dairy societies in
operation, or 72% of the coopera

... (truncated output)


## Setting up our LLM API functions

In [6]:
#  we will prompt user to enter it manually if not found
import getpass

import os # we already imported this, but let's do it again for clarity - it is cached so no harm done


# if still not found, prompt user to enter it manually
if 'open_router_api_key' not in locals() or not open_router_api_key:
    open_router_api_key = getpass.getpass("Please enter your OpenRouter API key: ")
    # save it to .env file for future use
    # note Google Colab will destroy .env file after session ends, so you will need to enter it again next time
    # this can be useful if you re-run the notebook and want to avoid entering the key again
    print("Saving Open Router API key to .env file...")
    with open('.env', 'a') as f:
        f.write(f'OPENROUTER_API_KEY={open_router_api_key}\n')
    print("Open Router API key saved to .env file.")

# we now should have the OpenRouter API key available
if open_router_api_key:
    print("OpenRouter API key loaded successfully.")
else:
    print("OpenRouter API key not found. Please make sure you have it set in your environment variables or .env file.")
    print("You can also enter it manually when prompted during API calls.")

# key point we do not print it publicly it is stored as a variable under the name open_router_api_key - of course you can change the name to something more descriptive
# but do not print it to the console or logs, as it is sensitive information

Please enter your OpenRouter API key: ··········
Saving Open Router API key to .env file...
Open Router API key saved to .env file.
OpenRouter API key loaded successfully.


### About the model - Google: Gemini 2.5 Flash Lite

Unlike previous session where we used OpenAI API, this session will focus on using [Google Gemini 2.5 Flash Lite model](https://openrouter.ai/google/gemini-2.5-flash-lite) via OpenRouter API. Again we could have used many different models, but we will use this one as it is lightweight and fast and **INEXPENSIVE**, which is ideal for our use case.

**Model ID:** `google/gemini-2.5-flash-lite`  
**Created:** July 22, 2025  
**Context Window:** 1,048,576 tokens  
**Pricing:**  
- **Input tokens:** $0.10 per million  
- **Output tokens:** $0.40 per million  

**Tags:**  
- Legal (#4)  
- Marketing/SEO (#4)  
- Translation (#9)  

**Description:**  
Gemini 2.5 Flash Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for **ultra-low latency** and **cost efficiency**. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models.

By default, "thinking" (i.e., multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the **Reasoning API** parameter to selectively trade off cost for intelligence.


In [9]:
# let's define a generic function for OpenRouter API requests
# it should define a new function get_openrouter_response with the following parameters system_prompt, user_prompt,
#  model defaulting to google/gemini-2.5-flash-lite and finally api_key which defaults to open_router_api_key .
#  The function get_openrouter_response should function just like analyze_latvian_text_with_openrouter except with parameters.
import requests  # we need to import requests to make API calls

def get_openrouter_response(system_prompt, user_prompt,
                            model="google/gemini-2.5-flash-lite",
                            api_key=open_router_api_key,
                            max_tokens=4096,
                            temperature=0.5,):
    """
    Generic function to make requests to OpenRouter API with specified parameters.

    :param system_prompt: The system prompt to guide the model's behavior.
    :param user_prompt: The user query or text to analyze.
    :param model: The model to use for the request (default is GPT-3.5).
    :param api_key: The OpenRouter API key (default is loaded from environment).
    :return: The response from the OpenRouter API.
    """

    # Set up the API endpoint and headers
    url = "https://openrouter.ai/api/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://github.com/cmhenry/unm_workshop_2025",  # Your project URL
        "X-Title": "UNM 2025 LLM Workshop - Generic OpenRouter Request"
    }

    # Create the request payload
    request_data = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": 0.9
    }

    # Make the API request
    try:
        response = requests.post(url, headers=headers, json=request_data, timeout=30)
        response.raise_for_status()

        result = response.json()

        if 'choices' in result and len(result['choices']) > 0:
            return result['choices'][0]['message']['content']

        else:
            print("❌ Error: No response returned from the API")
            return None

    except requests.exceptions.RequestException as e:
        print(f"❌ Request Error: {e}")
        return None

system_prompt = """You are a political science researcher specializing in named entity recognition and text analysis.
Named entities are cities, persons, organizations, etc. that are mentioned in the text.
You will analyze the text and provide all named entities as a list."""
user_prompt = content

response = get_openrouter_response(system_prompt, user_prompt) # note we did not pass model or api_key, so it will use defaults of "openai/gpt-3.5-turbo" and open_router_api_key

if response:
    print("Response from OpenRouter API:")
    print(response)
else:
    print("Failed to get a response from OpenRouter API.")

Response from OpenRouter API:
* Central Union of Latvian Cooperative Dairy Societies
* Latvia
* P/S „Latvijas centralais sviesta eksports" (United Butter Export of Latvia, Ltd.)
* Ministry of Agriculture
* Riga
* Latvian University
* Riga Strand


## Adjusting system prompts

We got our named entities but let's also have categories for them.
We can do that by adjusting our system prompt to include categories for named entities.

In [10]:
# let's adjust our system prompt by adding extra instruction to add categories for named entities.
extra_instruction = "Please categorize the named entities into PERSON, ORGANIZATION, LOCATION, and MISC."
system_prompt = f"{system_prompt} {extra_instruction}"
print(f"Adjusted system prompt: {system_prompt}")

Adjusted system prompt: You are a political science researcher specializing in named entity recognition and text analysis.
Named entities are cities, persons, organizations, etc. that are mentioned in the text.
You will analyze the text and provide all named entities as a list. Please categorize the named entities into PERSON, ORGANIZATION, LOCATION, and MISC.


In [11]:
# let's see our response with adjusted system prompt
response = get_openrouter_response(system_prompt, user_prompt)
if response:
    print("Response with adjusted system prompt:")
    print(response)

Response with adjusted system prompt:
Here's a breakdown of the named entities in the text:

**ORGANIZATION**

*   Central Union of Latvian Cooperative Dairy Societies
*   P/S „Latvijas centralais sviesta eksports" (United Butter Export of Latvia, Ltd.)
*   Ministry of Agriculture
*   microbiological institute of the Latvian University

**LOCATION**

*   Latvia
*   Riga

**MISC**

*   Ls (Latvian currency unit)


## Getting a summary of the text

One of the basic tasks we can do with LLMs is to get a summary of the text.



In [12]:
summary_system_prompt = """You are a political science researcher specializing in text summarization.
You will summarize the text and provide a concise summary in structured bullet point format."""
response = get_openrouter_response(summary_system_prompt, content)
if response:
    print("Summary of the text:")
    print(response)


Summary of the text:
Here's a summary of the provided text about the Central Union of Latvian Cooperative Dairy Societies:

*   **Establishment and Membership:** Founded in August 1921 by 13 dairy societies, it is the sole central organization for Latvia's dairy industry. It currently represents 72% of the nation's dairy societies (208 out of 289). Membership has decreased due to the world depression and low dairy prices, leading smaller dairies to merge.

*   **Butter Export and Marketing:** The Union has been successful in promoting butter exports and local consumption. Up to the end of 1934, it facilitated the sale of nearly 95 million kg of butter. In late 1934, a government-backed share company, P/S „Latvijas centralais sviesta eksports" (United Butter Export of Latvia, Ltd.), was formed for dairy export marketing. The Central Union owns 67% of this company, which handles the sole export and local marketing rights for Latvian dairy products.

*   **Supply and Development Activitie

## Concept mining

For those are unfamiliar with the term, concept mining is a process of extracting key concepts from a text.

Concept mining involves automatically identifying and extracting abstract ideas or key concepts from large amounts of unstructured text data. It goes beyond surface-level keyword extraction by attempting to detect the underlying themes, topics, or conceptual entities present in the text -- including those that might not be explicitly named but are implied or paraphrased.

Example would be "the rise of the internet" instead of just "internet". Instead of "potatoes" we could extract "agricultural products" or "food production". Something like "food anxiety" instead of just "food".



In [13]:
# let's make a system prompt for concept mining
system_prompt = """You are a political science researcher specializing in concept mining.
You will analyze the text and extract key concepts, themes, and patterns as a structured list.
Please provide the concepts in a clear and concise manner, categorizing them into relevant themes."""
response = get_openrouter_response(system_prompt, content)
if response:
    print("Concepts extracted from the text:")
    print(response)

Concepts extracted from the text:
Here are the key concepts, themes, and patterns extracted from the provided text, categorized for clarity:

**I. Organizational Structure and History**

*   **Establishment:** Founded in August 1921.
*   **Founding Members:** 13 out of 18 existing dairy societies.
*   **Current Membership:** 208 out of 289 dairy societies (72% of the cooperative dairy industry).
*   **Purpose:** Sole central organization for promoting the dairy industry in Latvia.
*   **Administrative Organs:**
    *   Assembly of authorized representatives (supreme administrative organ).
    *   Council (15 members, elected by the assembly).
    *   Board (Chairman and four members, elected from the council, administers property and conducts business).
*   **Role as a Link:** Connects the countryside (farmers) and the metropolis (consumers).

**II. Economic Context and Challenges**

*   **World Depression:** Impacted membership due to decreased dairy societies.
*   **Low Butter Prices

## 🧠 Concept Categories for 1930s Latvian Economic Reports

Above LLM output gave us structured but still flexible list of concepts. Usually we want to focus on a specific set of concepts that are relevant to our research. Below is a list of categories that we can use to organize the concepts extracted from the 1930s Latvian Economic Reports.

### 🏦 Finance & Banking
- Financial stability  
- Monetary policy  
- Currency regulation  
- Exchange rates  
- Foreign investment  
- Credit availability  
- Central banking  
- Gold reserves  

### 📈 Trade & Commerce
- Trade liberalization  
- Export promotion  
- Import substitution  
- Customs tariffs  
- Balance of trade  
- Foreign trade agreements  
- Trade protectionism  
- Trade surplus/deficit  
- Transit trade  
- Port activity (e.g., Riga, Liepāja)  

### 🚜 Agriculture & Rural Economy
- Agricultural modernization  
- Land reform  
- Crop yields  
- Agricultural exports  
- Collective farming (if present)  
- Peasant cooperatives  
- Grain storage and reserves  
- Rural credit  
- Mechanization of agriculture  
- Dairy and livestock production  

### 🏭 Industry & Infrastructure
- Industrialization  
- Manufacturing output  
- Raw material imports  
- Industrial policy  
- State-owned enterprises  
- Energy production (electricity, fuel)  
- Infrastructure investment  
- Transportation development  
- Railway modernization  
- Telecommunications expansion  

### 👷 Labor & Employment
- Unemployment  
- Labor migration  
- Wages and cost of living  
- Labor policy  
- Social insurance  
- Vocational training  
- Workforce productivity  
- State employment programs  

### 💰 Prices & Markets
- Price stability  
- Inflation control  
- Consumer goods availability  
- Market regulation  
- Food pricing  
- Speculation control  

### 🏛️ Economic Policy & Governance
- Five-year plans (if applicable)  
- Corporatism  
- Authoritarian economic planning  
- State intervention  
- Public-private partnerships  
- Fiscal policy  
- Budget deficit  
- Taxation policy  
- Economic nationalism  

### 🌍 International Relations & Geopolitics
- Trade with Germany, USSR, UK, Sweden, etc.  
- Regional trade blocs (e.g., Baltic Entente)  
- Neutrality in foreign policy  
- Geopolitical pressures on trade  
- Sanctions or economic treaties  

### 🧑‍🏫 Education & Human Capital
- Economic education  
- Technical schools  
- Business training  
- Scientific research for industry  
- Demographic skills gap  

### 🧱 Development & Urbanization
- Urban planning  
- Rural-urban migration  
- Public works  
- Housing policy  
- Regional economic disparity  
- Municipal finance  

---

### 🎯 High-Level & Cross-Cutting Concepts
- Economic self-sufficiency  
- National economic strategy  
- Resilience to global crisis  
- Preparation for war economy  
- Currency sovereignty  
- Export dependency  
- Shadow economy (if mentioned)  

## Focusing on specific concepts - agriculture

For this workshop let's focus on the agriculture category. We will extract concepts related to agriculture from the 1930s Latvian Economic Reports and analyze them in more detail.

### Making a larger system prompt

Our system prompt will be a bit more complex as we want to mention the specific category we are interested in. Also we want to provide example of format that we want the LLM to return so our analysis will be more structured and easier to work with.


In [14]:
agriculture_system_prompt = """You are an expert assistant trained to analyze historical economic texts from 1930s Latvia.
Your task is to read the input document or paragraph and extract any **concepts** that relate specifically to the domain of **Agriculture & Rural Economy**.
Focus only on concepts connected to the following subcategories:
- Agricultural modernization
- Land reform
- Crop yields
- Agricultural exports
- Collective farming (if present)
- Peasant cooperatives
- Grain storage and reserves
- Rural credit
- Mechanization of agriculture
- Dairy and livestock production
- Other agricultural practices

For each concept you extract, return:
- The **exact phrase** or **paraphrased expression** found in the text
- A **brief explanation** (1–2 sentences) summarizing the concept’s meaning or context
- The most relevant **subcategory** from the list above that it matches

Only return relevant agricultural or rural economy concepts. If no relevant concept is found, return an empty list.

Use the following output format (in JSON):

[
  {
    "phrase": "introduction of American tractors",
    "explanation": "Refers to adoption of mechanized equipment to improve farming productivity.",
    "subcategory": "Mechanization of agriculture"
  },
  {
    "phrase": "state-supported grain warehouses",
    "explanation": "Refers to government involvement in securing grain storage for future needs or trade.",
    "subcategory": "Grain storage and reserves"
  }
]
"""

response = get_openrouter_response(agriculture_system_prompt, content)
if response:
    print("Agriculture-related concepts extracted from the text:")
    print(response)

Agriculture-related concepts extracted from the text:
```json
[
  {
    "phrase": "promoting the dairy industry",
    "explanation": "Focuses on the development and advancement of dairy farming and related businesses.",
    "subcategory": "Dairy and livestock production"
  },
  {
    "phrase": "low prices fetched by dairy products in the world market",
    "explanation": "Indicates the impact of global market conditions on the profitability of dairy farming.",
    "subcategory": "Dairy and livestock production"
  },
  {
    "phrase": "facilitate and augment the export of butter",
    "explanation": "Refers to efforts to increase the volume and ease of selling butter to foreign markets.",
    "subcategory": "Agricultural exports"
  },
  {
    "phrase": "promoting the consumption of this commodity in the local market",
    "explanation": "Describes activities aimed at increasing the sales of dairy products within the country.",
    "subcategory": "Dairy and livestock production"
  },
  {

## Fine-tuning the system prompt and adjusting model

The LLM output is well structured and clear concepts are extracted -- but what if we wanted to adjust our system prompt to be more specific about the concepts we want to extract?

 We can also adjust the model or its parameters to be more focused on the specific task we are trying to achieve. For example, we can increase the maximum number of tokens to allow for more detailed responses or adjust the temperature to make the model more conservative in its responses.

In [15]:
### let's try a different model first - let's try the best (and most expensive) current Gemini model as of mid 2025 - 2.5 Pro

response = get_openrouter_response(agriculture_system_prompt, content, model="google/gemini-2.5-pro")
if response:
    print("Agriculture-related concepts extracted from the text using Gemini 2.5 Pro:")
    print(response)

Agriculture-related concepts extracted from the text using Gemini 2.5 Pro:
```json
[
  {
    "phrase": "central union of Latvian Cooperative dairy societies",
    "explanation": "This refers to the central organization representing and coordinating the activities of numerous local cooperative dairy societies, which were owned by farmers.",
    "subcategory": "Peasant cooperatives"
  },
  {
    "phrase": "smaller dairies to join together and form larger units",
    "explanation": "This describes a trend of consolidation among dairy cooperatives to improve product quality and reduce production costs in response to economic pressures.",
    "subcategory": "Agricultural modernization"
  },
  {
    "phrase": "facilitate and augment the export of butter",
    "explanation": "A key activity of the Central Union was to increase the sale of Latvian butter in foreign markets, highlighting the importance of agricultural exports to the economy.",
    "subcategory": "Agricultural exports"
  },
  {


### Trying other models - finding middle ground between speed and accuracy

Pro model provided accurate answer namely that there are no agricultural products in the text. However, it is not very fast and it is not very cheap to use. The above query cost us about 0.01 USD - that is 1 cent which might not seem like much, but if we want to analyze a large corpus of documents, the costs can add up quickly.

So let's see if we can find a model that is faster and cheaper but still provides accurate results. We will try different models and see how they perform on our task.

Let's try the `golden mean` as of today [Google Gemini 2.5 Flash](https://openrouter.ai/google/gemini-2.5-flash)

To do so we simply provide the model name in the `get_openrouter_response` function.



In [None]:
# let's try with google/gemini-2.5-flash model
response = get_openrouter_response(agriculture_system_prompt, content, model="google/gemini-2.5-flash")
if response:
    print("Agriculture-related concepts extracted from the text using Gemini 2.5 Flash:")
    print(response)

Agriculture-related concepts extracted from the text using Gemini 2.5 Flash:
[]


## Running model across multiple documents - our corpus

The response looks good (and 10x cheaper that Gemini 2.5 Pro due mostly to limiting reasoning outputs which we do not need for concept mining task), however the question remains, what will happen when we run the model across multiple documents? Will have have many false positives or will the model be able to filter out the noise and provide us with relevant concepts?

Now that we have gotten our results from a single document we can run the model across multiple documents in our corpus. We will iterate over the files in the directory where we extracted the LERQ corpus and apply the same system prompt to each file.

Usually I like to run my prompts across a smaller sample than full corpus. So let's pick 30 documents at random from the corpus and see how the model performs.

In [16]:
# let's pick random 30 files from the corpus and run the model across them
def run_model_across_corpus(directory_path, system_prompt,
                            model="google/gemini-2.5-flash",
                            sample_size=30,
                            seed=2025,
                            verbose=True,
                            delay=0.1):
    """
    Run the model across a sample of files in the specified directory.

    :param directory_path: Path to the directory containing text files.
    :param system_prompt: The system prompt to guide the model's behavior.
    :param model: The model to use for the request (default is Gemini 2.5 Flash).
    :param sample_size: Number of files to sample from the directory.
    :return: List of responses from the model for each file.
    """
    from pathlib import Path
    from datetime import datetime
    import time
    from tqdm import tqdm  # for progress bar
    import random  # for random sampling

    directory = Path(directory_path)
    txt_files = list(directory.glob("**/*.txt"))  # Get all .txt files recursively

    if verbose:
        print(f"Found {len(txt_files)} text files in the directory {directory}.")

    if len(txt_files) < sample_size:
        print(f"Not enough files in the directory. Found {len(txt_files)} files, but requested {sample_size}.")
        return []

    if seed is not None:
        random.seed(seed) # this ensures reproducibility of the random sample
        # most random operations in computer science are not truly random, but rather pseudo-random

    sampled_files = random.sample(txt_files, sample_size)  # Randomly sample files
    responses = {} # we will use a dictionary to store responses with file names as keys

    for file in tqdm(sampled_files):
        with file.open('r', encoding='utf-8') as f:
            content = f.read()
        if verbose:
            now = datetime.now()
            # let's print the time in a human-readable format
            print(f"{now.strftime('%Y-%m-%d %H:%M:%S')}  Processing file: {file.name} ({file.stat().st_size:,} bytes)")

        response = get_openrouter_response(system_prompt, content, model=model)
        if response:
            responses[file.name] = response
        else:
            print(f"Failed to get a response for {file.name}")

        time.sleep(delay)  # Delay to avoid hitting API rate limits if necessary

    return responses

In [17]:
# what was our extract_dir again?
# we actually want the subdirectory with Latvian_Economic_Review folder
extract_dir = Path("data/Latvian_Economic_Review")  # Adjusted to point to the correct subdirectory
responses = run_model_across_corpus(extract_dir, agriculture_system_prompt,
                                    model="google/gemini-2.5-flash", # actually the default but let's be explicit
                                    sample_size=30, # again default is 30, but we can change it
                                    seed=2025, # specific seed for reproducibility
                                    verbose=False # set to True to see progress and processing information
                                    )


100%|██████████| 30/30 [01:08<00:00,  2.28s/it]


## Analyzing responses

Now that we got 30 responses let's see if they are relevant and if the model was able to filter out the noise and provide us with relevant concepts.

Main thing is to remember the format of our responses. Each response is a string in JSON format that contains the concepts extracted from the text. We have stored all responses in a dictionary with file name as key and response as value.

In [18]:
# how many responses do we have?
print(f"Got {len(responses)} responses from the model across the sampled files.")

Got 30 responses from the model across the sampled files.


In [19]:
# let's print the first 3 responses from our responses dictionary
for i, (file_name, response) in enumerate(responses.items()):
    if i < 3:  # Print only the first 3 responses
        print(f"File: {file_name}")
        print(f"Response: {response}\n")
    else:
        break


File: lerq1936s01n01_008_plaintext_s03.txt
Response: [
  {
    "phrase": "flax monopoly",
    "explanation": "A state-controlled system for the production, collection, and sale of flax, indicating government intervention in a specific agricultural commodity market.",
    "subcategory": "Agricultural exports"
  },
  {
    "phrase": "cultivation of flax",
    "explanation": "Refers to the agricultural practice of growing flax, highlighting its suitability to the local soil and climate.",
    "subcategory": "Other agricultural practices"
  },
  {
    "phrase": "growers' organisations",
    "explanation": "Indicates the involvement and cooperation of farmer groups in the flax collection process.",
    "subcategory": "Peasant cooperatives"
  },
  {
    "phrase": "licensed flax collecting stations",
    "explanation": "Designated points where flax is purchased from growers, part of the centralized collection system.",
    "subcategory": "Other agricultural practices"
  },
  {
    "phrase": "

In [20]:
# let's see if we got any responses at all meaning responses that are not [] or None
for file_name, response in responses.items():
    if response and response != "[]":
        print(f"File: {file_name} has a valid response.")
    else:
        print(f"File: {file_name} has no valid response.")

File: lerq1936s01n01_008_plaintext_s03.txt has a valid response.
File: lerq1937s01n06_041_plaintext_s29.txt has no valid response.
File: lerq1936s01n03_023_plaintext_s13.txt has no valid response.
File: lerq1938s01n01_021_plaintext_s08.txt has a valid response.
File: lerq1940s01n02_014_plaintext_s05.txt has a valid response.
File: lerq1936s01n03_007_plaintext_s02.txt has a valid response.
File: lerq1938s01n01_038_plaintext_s29.txt has no valid response.
File: lerq1937s01n05_039_plaintext_s19.txt has a valid response.
File: lerq1940s01n01_024_plaintext_s09.txt has a valid response.
File: lerq1937s01n07_004_plaintext_s02.txt has a valid response.
File: lerq1936s01n01_009_plaintext_s04.txt has no valid response.
File: lerq1940s01n02_023_plaintext_s11.txt has a valid response.
File: lerq1937s01n06_014_plaintext_s14.txt has a valid response.
File: lerq1937s01n08_011_plaintext_s03.txt has a valid response.
File: lerq1936s01n03_008_plaintext_s03.txt has no valid response.
File: lerq1936s01n03

In [21]:
# Let's get valid responses only and print them
valid_response = {}
for file_name, response in responses.items():
    if response and response != "[]":
        valid_response[file_name] = response
# how many valid responses do we have?
print(f"Got {len(valid_response)} valid responses from the model across the sampled files.")

Got 18 valid responses from the model across the sampled files.


In [22]:
# let's print first 3 valid responses
for i, (file_name, response) in enumerate(valid_response.items()):
    if i < 3:  # Print only the first 3 valid responses
        print(f"File: {file_name}")
        print(f"Response: {response}\n")
    else:
        break

File: lerq1936s01n01_008_plaintext_s03.txt
Response: [
  {
    "phrase": "flax monopoly",
    "explanation": "A state-controlled system for the production, collection, and sale of flax, indicating government intervention in a specific agricultural commodity market.",
    "subcategory": "Agricultural exports"
  },
  {
    "phrase": "cultivation of flax",
    "explanation": "Refers to the agricultural practice of growing flax, highlighting its suitability to the local soil and climate.",
    "subcategory": "Other agricultural practices"
  },
  {
    "phrase": "growers' organisations",
    "explanation": "Indicates the involvement and cooperation of farmer groups in the flax collection process.",
    "subcategory": "Peasant cooperatives"
  },
  {
    "phrase": "licensed flax collecting stations",
    "explanation": "Designated points where flax is purchased from growers, part of the centralized collection system.",
    "subcategory": "Other agricultural practices"
  },
  {
    "phrase": "

## Converting responses from JSON to Python dictionary

Our LLM responses are formated as JSON strings, so we need to convert them to Python dictionaries and/or lists for easier manipulation and analysis. We can use the `json` module in Python to do this.



In [23]:
# let's go through all responses and convert them from JSON strings to Python data structures
# this will also be a good test on LLM output stability
import json
# let's convert all responses from JSON strings to Python data structures
converted_responses = {}
bad_responses = {}
for file_name, response in valid_response.items():
    try:
        converted_responses[file_name] = json.loads(response)  # Convert JSON string to Python data structure
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON for file {file_name}: {e}")
        bad_responses[file_name] = response

# let's print the first 3 converted responses
for i, (file_name, response) in enumerate(converted_responses.items()):
    if i < 3:  # Print only the first 3 converted responses
        print(f"File: {file_name}")
        print(f"Response: {response}\n")
    else:
        break

Error decoding JSON for file lerq1938s01n01_021_plaintext_s08.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1940s01n02_014_plaintext_s05.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1936s01n03_007_plaintext_s02.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1937s01n05_039_plaintext_s19.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1940s01n01_024_plaintext_s09.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1937s01n07_004_plaintext_s02.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1940s01n02_023_plaintext_s11.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1937s01n06_014_plaintext_s14.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1937s01n08_011_plaintext_s03.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON

In [24]:
# let's print the bad responses
for file_name, response in bad_responses.items():
    print(f"File: {file_name} has a bad response: {response}")

File: lerq1938s01n01_021_plaintext_s08.txt has a bad response: ```json
[
  {
    "phrase": "first agricultural society in 1886",
    "explanation": "Refers to the establishment of an organization focused on agricultural development and cooperation.",
    "subcategory": "Peasant cooperatives"
  },
  {
    "phrase": "first threshing machine society in 1888",
    "explanation": "Indicates the formation of a cooperative for the shared use of agricultural machinery, specifically threshing machines.",
    "subcategory": "Mechanization of agriculture"
  },
  {
    "phrase": "first cattle control society in 1901",
    "explanation": "Suggests the creation of a cooperative focused on managing or improving livestock, likely related to breeding, health, or quality control.",
    "subcategory": "Dairy and livestock production"
  },
  {
    "phrase": "first co-operative dairy in 1909",
    "explanation": "Marks the beginning of collective efforts in dairy production and processing.",
    "subcatego

### Fixing bad responses

We see that bad responses are actually are not truly bad they just start with \```json and end with \```. We can fix this by removing the \```json and \``` from the start and end of the response string.

In [25]:
# let's go through all responses and convert them from JSON strings to Python dictionaries
# we will also strip the response from ```json and ``` at the start and end of the response string
good_responses = {}
bad_responses = {}
for file_name, response in responses.items():
    response = response.lstrip("```json")  # Remove ```json from the start
    response = response.rstrip("```")  # Remove ``` from the end
    try:
        good_responses[file_name] = json.loads(response)  # Convert JSON string to Python data structure
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON for file {file_name}: {e}")
        bad_responses[file_name] = response

Error decoding JSON for file lerq1936s01n03_007_plaintext_s02.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1937s01n06_014_plaintext_s14.txt: Expecting value: line 1 column 1 (char 0)
Error decoding JSON for file lerq1939s01n02_035_plaintext_s15.txt: Expecting value: line 1 column 1 (char 0)


In [26]:
# let's print first bad responses again
for file_name, response in bad_responses.items():
    print(f"File: {file_name} has a bad response: {response}")

File: lerq1936s01n03_007_plaintext_s02.txt has a bad response: I am sorry, but the provided text does not contain any concepts related to Agriculture & Rural Economy. The text focuses exclusively on the shipping industry, freight rates, and the "Baltwhite Timber Scheme."
File: lerq1937s01n06_014_plaintext_s14.txt has a bad response: I'm sorry, but I could not find any concepts related to "Agriculture & Rural Economy" in the provided text. The document focuses exclusively on the development of international and internal air traffic in Latvia during the 1930s.
File: lerq1939s01n02_035_plaintext_s15.txt has a bad response: I'm sorry, but I cannot find any concepts related to Agriculture & Rural Economy in the provided text. The text primarily discusses the geological exploration and exploitation of natural resources in Latvia, such as minerals, peat, clay, and building stones, for industrial purposes. It does not mention agricultural practices, land reform, crop yields, or other related t

### Fixing bad JSON

We have gotten the easy cases fixed but the last malformed JSON is due to double quotes not being escaped properly in all places. We can fix this by replacing the double quotes with escaped double quotes.

Alternatively we could apply stricter instructions to the model to return valid JSON. However, this would require us to adjust our system prompt and possibly the model parameters as well.

We have a third option which is to supply the response to LLM again with different prompt and have LLM fix the JSON

In [27]:
# let's fix the malformed JSON responses
fix_json_prompt = """Please fix the following JSON response which should be an array of objects where double quotes are not escaped properly.
The response should be proper JSON string without ```json start and without ``` end with all other content intact."""
fixed_responses = {}
for file_name, response in bad_responses.items():
    # we will use the same get_openrouter_response function to fix the JSON
    fixed_response = get_openrouter_response(fix_json_prompt, response, model="google/gemini-2.5-flash")
    print(f"Fixed response for {file_name}: {fixed_response}")
    # try parsing response to ensure it is valid JSON
    try:
        json.loads(fixed_response)  # Validate the fixed response
    except json.JSONDecodeError as e:
        print(f"Error decoding fixed JSON for file {file_name}: {e}")
        fixed_response = None
    if fixed_response:
        fixed_responses[file_name] = fixed_response
    else:
        print(f"Failed to fix JSON for {file_name}")

Fixed response for lerq1936s01n03_007_plaintext_s02.txt: ```json
[
  {
    "concept": "Shipping Industry",
    "description": "The overarching industry discussed, encompassing all aspects of maritime transport."
  },
  {
    "concept": "Freight Rates",
    "description": "The cost charged by a carrier for transporting goods, a key economic indicator within the shipping industry."
  },
  {
    "concept": "Baltwhite Timber Scheme",
    "description": "A specific, named scheme related to the timber trade within the Baltic region, likely impacting shipping and freight."
  }
]
```
Error decoding fixed JSON for file lerq1936s01n03_007_plaintext_s02.txt: Expecting value: line 1 column 1 (char 0)
Failed to fix JSON for lerq1936s01n03_007_plaintext_s02.txt
Fixed response for lerq1937s01n06_014_plaintext_s14.txt: ```json
[
  {
    "concept": "International Air Traffic",
    "description": "Development of international air traffic in Latvia during the 1930s.",
    "keywords": ["international", "a

## Running custom system prompt on specific folder

Now that we have experimented with a small sample let's create a new function that takes the following parameters:


- system_prompt - our custom instructions
- file_folder where the files to be analyzed is needed
- file_extension where we want .txt to be default extenstion
- max_files where we want None to be default meaning unlimited files
- seed where we want None to be default this would be used when max_files is not None
- model where default would be "google/gemini-2.5-flash"

We could add a few more parameters such as temperature and max tokens but this is enough to get started


In [None]:
# we want a function that will take our system prompt and file folder and will save in save_folder all responses using original file name and custom suffix
# - system_prompt - our custom instructions
# - file_folder where the files to be analyzed is needed
# - file_extension where we want .txt to be default extenstion
# - max_files where we want None to be default meaning unlimited files
# - seed where we want None to be default this would be used when max_files is not None
# - model where default would be "google/gemini-2.5-flash"
# - save_folder where default would be "responses"
# - we could specific save suffix but instead let's have a filename friendly version of model meaning we will remove / and - and replace those with _
import json
import os
import random
import time
from pathlib import Path
from datetime import datetime
from tqdm import tqdm

def save_responses_from_folder(system_prompt,
                               file_folder,
                               file_extension=".txt",
                               max_files=None,
                               seed=None,
                               model="google/gemini-2.5-flash",
                               save_folder="responses",
                               delay=0.1,
                               verbose=True):
    """
    Processes files in a folder using an LLM and saves the responses.

    :param system_prompt: The system prompt to guide the model's behavior.
    :param file_folder: Path to the directory containing files to process.
    :param file_extension: The extension of the files to process (default is .txt).
    :param max_files: Maximum number of files to process (default is None for all files).
    :param seed: Seed for random sampling if max_files is not None.
    :param model: The model to use for the request (default is google/gemini-2.5-flash).
    :param save_folder: The folder to save the responses (default is "responses").
    :param delay: Delay in seconds between API calls to avoid rate limits.
    :param verbose: If True, print detailed information during processing.
    """
    file_folder_path = Path(file_folder)
    save_folder_path = Path(save_folder)

    if not file_folder_path.exists():
        print(f"Error: File folder '{file_folder}' does not exist.")
        return

    # Create save folder if it doesn't exist
    save_folder_path.mkdir(parents=True, exist_ok=True)

    # Create a filename-friendly version of the model name
    model_suffix = model.replace("/", "_").replace("-", "_")

    # Find files with the specified extension
    all_files = sorted(file_folder_path.rglob(f"*{file_extension}"))

    if verbose:
        print(f"Found {len(all_files)} '{file_extension}' files in '{file_folder}'.")

    files_to_process = all_files
    if seed is not None and max_files is not None and max_files < len(all_files):
        random.seed(seed)
        files_to_process = random.sample(all_files, max_files)
        if verbose:
            print(f"Obtaining a random sample of {len(files_to_process)} files.")
    if seed is None and max_files is not None:
        files_to_process = files_to_process[:max_files]


    if verbose:
        print(f"Processing a  {len(files_to_process)} files.")

    if not files_to_process:
        print("No files to process.")
        return

    for file in tqdm(files_to_process, desc="Processing files"):
        if verbose:
            now = datetime.now()
            print(f"\n{now.strftime('%Y-%m-%d %H:%M:%S')}  Processing file: {file.relative_to(file_folder_path)} ({file.stat().st_size:,} bytes)")

        # Construct the output filename
        output_filename = save_folder_path / f"{file.stem}_{model_suffix}{file.suffix}"


        # Create parent directories for the output file if they don't exist
        output_filename.parent.mkdir(parents=True, exist_ok=True)

        # Check if response already exists
        if output_filename.exists():
            if verbose:
                print(f"  Response already exists for {file.name}, skipping.")
            continue


        with file.open('r', encoding='utf-8') as f:
            content = f.read()

        response = get_openrouter_response(system_prompt, content, model=model)

        # Save the response to a file
        with open(output_filename, 'w', encoding='utf-8') as outfile:
            outfile.write(response)
            if verbose:
                print(f"  Response saved to {output_filename}")

        time.sleep(delay)  # Pause between requests

    if verbose:
        print("\nFinished processing files.")

In [None]:
# let's get a folder that we want to process
# it would be data/Latvian_Economic_Review folder here
# for now let's only get 10 responses
source_folder = "data/Latvian_Economic_Review"
# now we can call the function using same old agriculture prompt
save_responses_from_folder(agriculture_system_prompt,
                           source_folder,
                           max_files = 10)

Found 419 '.txt' files in 'data/Latvian_Economic_Review'.
Processing a  10 files.


Processing files:   0%|          | 0/10 [00:00<?, ?it/s]


2025-08-07 13:36:17  Processing file: lerq1936s01n01_003_plaintext_s01.txt (8,662 bytes)


Processing files:  10%|█         | 1/10 [00:05<00:47,  5.24s/it]

  Response saved to responses/lerq1936s01n01_003_plaintext_s01_google_gemini_2.5_flash.txt

2025-08-07 13:36:22  Processing file: lerq1936s01n01_006_plaintext_s02.txt (5,190 bytes)


Processing files:  20%|██        | 2/10 [00:07<00:25,  3.24s/it]

  Response saved to responses/lerq1936s01n01_006_plaintext_s02_google_gemini_2.5_flash.txt

2025-08-07 13:36:24  Processing file: lerq1936s01n01_008_plaintext_s03.txt (6,066 bytes)


Processing files:  30%|███       | 3/10 [00:11<00:27,  3.98s/it]

  Response saved to responses/lerq1936s01n01_008_plaintext_s03_google_gemini_2.5_flash.txt

2025-08-07 13:36:29  Processing file: lerq1936s01n01_009_plaintext_s04.txt (5,523 bytes)


Processing files:  40%|████      | 4/10 [00:13<00:16,  2.83s/it]

  Response saved to responses/lerq1936s01n01_009_plaintext_s04_google_gemini_2.5_flash.txt

2025-08-07 13:36:30  Processing file: lerq1936s01n01_013_plaintext_s05.txt (3,866 bytes)


Processing files:  50%|█████     | 5/10 [00:13<00:10,  2.16s/it]

  Response saved to responses/lerq1936s01n01_013_plaintext_s05_google_gemini_2.5_flash.txt

2025-08-07 13:36:31  Processing file: lerq1936s01n01_014_plaintext_s06.txt (1,144 bytes)


Processing files:  60%|██████    | 6/10 [00:14<00:06,  1.70s/it]

  Response saved to responses/lerq1936s01n01_014_plaintext_s06_google_gemini_2.5_flash.txt

2025-08-07 13:36:32  Processing file: lerq1936s01n01_014_plaintext_s07.txt (629 bytes)


Processing files:  70%|███████   | 7/10 [00:15<00:04,  1.41s/it]

  Response saved to responses/lerq1936s01n01_014_plaintext_s07_google_gemini_2.5_flash.txt

2025-08-07 13:36:33  Processing file: lerq1936s01n01_015_plaintext_s08.txt (7,520 bytes)


Processing files:  80%|████████  | 8/10 [00:22<00:06,  3.24s/it]

  Response saved to responses/lerq1936s01n01_015_plaintext_s08_google_gemini_2.5_flash.txt

2025-08-07 13:36:40  Processing file: lerq1936s01n01_017_plaintext_s09.txt (951 bytes)


Processing files:  90%|█████████ | 9/10 [00:23<00:02,  2.46s/it]

  Response saved to responses/lerq1936s01n01_017_plaintext_s09_google_gemini_2.5_flash.txt

2025-08-07 13:36:41  Processing file: lerq1936s01n01_018_plaintext_s10.txt (7,625 bytes)


Processing files: 100%|██████████| 10/10 [00:30<00:00,  3.09s/it]

  Response saved to responses/lerq1936s01n01_018_plaintext_s10_google_gemini_2.5_flash.txt

Finished processing files.





## Adjusting system prompt for simpler output

Now we see that sometimes instead of JSON we get nonstandard output.

This presents a problem for automatic processing, so again we would need to write more code to catch these non standard situations or make our prompt even stronger or change our model.

For now we will change prompt to be simpler and to produce simple CSV - comma separated values output

In [None]:
agriculture_system_prompt_list = """You are an expert assistant trained to analyze historical economic texts from 1930s Latvia.
Your task is to read the input document or paragraph and extract specific concepts related to **Agriculture & Rural Economy**.
Return the concepts as a list of single words enclosed in double quotes, each on a new line.
- Agricultural modernization return "modernization"
- Land reform return "reform"
- Crop yields return "yields"
- Agricultural exports return "exports"
- Collective farming return "collective"
- Peasant cooperatives return "cooperatives"
- Grain storage and reserves return "storage"
- Rural credit return "credit"
- Mechanization of agriculture return "mechanization"
- Dairy and livestock production return "dairy"

Example output:

"exports"
"modernization"
"reform"

If no relevant concept is found, return a single word "None" with quotes around it, like this: "None".


"""

In [None]:
my_prompt = """

"""
save_responses_from_folder(my_prompt,
                            source_folder,
                            max_files=30,  # let's limit to 10 files for now
                            save_folder="my_assignment",  # we will save to responses_list folder
                            model="google/gemini-2.5-flash", # using the same model as before
                            verbose=False)

In [None]:
# now let's run this function again with the new system prompt and extract to responses_csv folder
save_responses_from_folder(agriculture_system_prompt_list,
                            source_folder,
                            max_files=10,  # let's limit to 10 files for now
                            save_folder="responses_list",  # we will save to responses_list folder
                            model="google/gemini-2.5-flash", # using the same model as before
                            verbose=False)

Processing files: 100%|██████████| 10/10 [00:13<00:00,  1.32s/it]


Try it yourself by coming up with your own prompt!

In [None]:
# we can come up with our own prompt
my_custom_prompt = """
Change me!
"""
save_responses_from_folder(my_custom_prompt,
                            source_folder,
                            max_files=4,  # let's limit to 4 files for now
                            save_folder="workshop_exercise",  # we will save to responses_list folder
                            model="google/gemini-2.5-flash", # maybe change model?
                            verbose=False)