<a target="_blank" href="https://colab.research.google.com/github/UCSB-Library-Research-Data-Services/intro2APIs/blob/main/chapters/coding/exercise-DLPA.ipynb" style='text-decoration: none;'>
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> 

[![](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/UCSB-Library-Research-Data-Services/intro2APIs/main?urlpath=%2Fdoc%2Ftree%2Fchapters%2Fcoding%2Fexercise-DLPA.ipynb)

# DPLA API Exercise: Exploring Artificial Intelligence in Cultural Heritage

This notebook demonstrates how to interact with the Digital Public Library of America (DPLA) API to query, analyze, and visualize cultural heritage data about artificial intelligence across different time periods.

## Learning Objectives

- Set up and configure a Python environment for API interaction
- Make authenticated requests to the DPLA API
- Handle pagination to collect large datasets
- Use facets to analyze temporal distributions
- Extract and visualize keywords from metadata

### Import Libraries

Import all necessary libraries for API interaction, data processing, and visualization.

In [None]:
%pip install httpx yake tqdm matplotlib

### Install Required Packages

We'll use `httpx` for HTTP requests, `yake` for keyword extraction, `tqdm` for progress bars, and `matplotlib` for visualization.

In [None]:
import os
import time
import json
from getpass import getpass

import httpx
from urllib.parse import urlsplit, parse_qsl, urlencode, urlunsplit
import yake
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

## Configuration

### Define Constants

Set up base URLs and configuration variables that will be used throughout the notebook.

In [None]:
API_BASE_URL = "https://api.dp.la/v2/" 
ENV_VAR_NAME = "DPLA_API_KEY" 
FALLBACK_DATA_URL = "https://raw.githubusercontent.com/UCSB-Library-Research-Data-Services/intro2APIs/main/data/" 

### Secure API Key Setup

Store your API key securely using environment variables. This approach avoids hardcoding sensitive credentials in the notebook.

In [None]:
key= os.getenv(ENV_VAR_NAME)

# This method avoids hardcoding the API key in the script
# The variable is persistent during the session
if not key:
    key = getpass(f"Enter your DPLA API key: ").strip()
    if not key:
        raise ValueError("No API key provided.")
    os.environ[ENV_VAR_NAME] = key
    
print(f"API key set in environment variable {ENV_VAR_NAME}.")

## Helper Functions

These utility functions help keep our code clean and reusable.

In [None]:
def _join_list(x, sep="; ", keep_first_only=False):
    """
    Helper function to join a list of values into a single string. If the input is not a list, it will return the string representation of the input. If the input is None, it will return an empty string.
    """
    if isinstance(x, list):
        if keep_first_only and len(x) > 0:
            return str(x[0])
        return sep.join(str(v) for v in x if v is not None)
    return "" if x is None else str(x)

def _top_n(d, n=10):
    """Helper function to return the top n items from a dictionary, sorted by value in descending order."""
    return dict(sorted(d.items(), key=lambda x: x[1], reverse=True)[:n])

def _redact_request_url(url):
    """Remove the api_key parameter from the URL for display purposes."""
    parsed_url = urlsplit(str(url))  # Convert httpx.URL to string
    query_params = parse_qsl(parsed_url.query)
    filtered_params = [(name, value) for name, value in query_params if name != "api_key"]
    redacted_query = urlencode(filtered_params)
    redacted_url = parsed_url._replace(query=redacted_query)
    return urlunsplit(redacted_url)

### URL Redaction Helper

Removes API keys from URLs before printing (security best practice).

In [None]:
def search_items(query, resource_type='items', verbose=False, timeout=30.0, **parameters):
    """
    Search DPLA items with given query and parameters.
    
    Args:
        query (str): The search query string. It's possible to use logical operators (AND, OR, NOT). Additionally, you can use wildcards (*) for partial matches.
        resource_type (str): The type of resource to search for. Default is 'items'.
        verbose (bool): If True, prints the request URL. Default is False.
        timeout (float): The timeout for the HTTP request in seconds. Default is 30.0.
        **parameters: Facets and filter parameters from the DPLA API documentation: https://pro.dp.la/developers/requests
                    Dotted keywords and values can be passed using dictionary unpacking. For example, to filter by sourceResource.title, you can pass:
                    **{"sourceResource.title": "example title"}
    Returns:
        dict: The JSON response from the DPLA API as a Python dictionary.
    """
    
    # Build the request URL and minimal parameters
    base_url = f"{API_BASE_URL}{resource_type}"
    params = {
        "q": query,
        "api_key": os.getenv(ENV_VAR_NAME),
    }
    
    # Add additional parameters if any
    for key, value in parameters.items():
        params[key] = value
        
    # Make the request
    with httpx.Client(timeout=timeout) as client:
        response = client.get(base_url, params=params)
    
    if verbose:
        print(f"Request URL [redacted]: {_redact_request_url(response.url)}")
    
    response.raise_for_status() 
    return response.json()

## Main API Functions

### Search Items Function

This function queries the DPLA API and returns a single page of results. It handles query parameters, facets, and filters.

In [None]:
def search_all_items(query, resource_type='items', max_items=100, sleep=0.5, verbose=False, timeout=30.0, **parameters):
    """
    Collect up to max_items across pages.
    
    Args:
        query (str): The search query string. It's possible to use logical operators 
            (AND, OR, NOT). Additionally, you can use wildcards (*) for partial matches.
        max_items (int): Maximum number of items to retrieve. For number of elements per page, 
            use the page_size parameter in **parameters.
        sleep (float): Time to wait between requests to avoid hitting rate limits.
        **parameters: Facets and filter parameters from the DPLA API documentation: https://pro.dp.la/developers/requests
    """
    all_docs = []
    page = 1
    page_size = int(parameters.get("page_size", 100))
    if page_size > 500:
        page_size = 500
        print("page_size cannot exceed 500. Setting to 500.")
        
    while len(all_docs) < max_items:
        parameters['page'] = page
        data = search_items(
            query,
            resource_type=resource_type,
            verbose=verbose,
            timeout=timeout,
            **parameters
        )
        docs = data.get('docs', [])
        if not docs:
            break  # No more results
        all_docs.extend(docs)
        
        # stop if we've reached max_items
        if len(all_docs) >= max_items:
            break
        
        page += 1
        time.sleep(sleep)
        
    return all_docs[:max_items]

    

### Search All Items Function

This function handles pagination and collects multiple pages of results up to a specified maximum.

ðŸŽ¯ **Challenge 1: Experimenting with Pagination**

Let's explore how pagination works! APIs often return results in "pages" to avoid overwhelming the server and your computer with too much data at once.

In the cell below, try changing `page_size=5` to `page_size=10` and `page=1` to `page=2`. Before running it, think about: which items will you see now? Will they be the same as before, or different ones?

**Bonus:** What happens if you try `page_size=1000`? (Hint: Check the output message!)

In [None]:
# Defining the fields, facets, and filters outside the function call for better readability
fields = [
    "sourceResource.title",
    "sourceResource.description",
    "sourceResource.date.begin",
    "sourceResource.date.end"
]

dotted_fields = {
    "sourceResource.subject.name": "artificial intelligence"
}

try:
    ai_search = search_items(
        "artificial AND intelligence", # search query
        fields=_join_list(fields, sep=","), # fields to include in the response
        facets="sourceResource.date.begin", # facets to include in the response
        **dotted_fields, # additional parameters (e.g. filters),
        page_size=5, # items per page
        sort_by="sourceResource.date.begin", # sort by date
        sort_order="asc", # oldest to newest
        page=1, # page number to retrieve
        verbose=True # print the request URL for debugging purposes
        )
    
except httpx.HTTPStatusError as e:
    print(f"HTTP error occurred: {e}. Using preloaded data instead.")
    r = httpx.get(f"{FALLBACK_DATA_URL}dpla_search_results.json")
    r.raise_for_status()
    ai_search = r.json()


# download the preloaded data for the next steps
print(f"{ai_search.get('count')} results found.") if isinstance(ai_search, dict) else print(f"{len(ai_search)} results found.")

## Exploring the Data

### Initial Search Query

Let's make our first request to explore items about artificial intelligence. We'll use facets to understand the temporal distribution of items.

In [None]:
facets_entries = ai_search.get("facets", {}).get("sourceResource.date.begin", {}).get("entries", [])

# Print a sample (we use 'time' because that's the label for date facets)
for entry in facets_entries[:5]:
    print(f"Year: {entry.get('time')}, Count: {entry.get('count')}")

### Visualizing Temporal Distribution

Facets provide aggregated counts by specific fields. Let's visualize how AI-related items are distributed across time.

In [None]:
# Extract the year and count information from the facet entries
years = [entry.get("time") for entry in facets_entries][::-1] # We use [::-1] to reverse the order
counts = [entry.get("count") for entry in facets_entries][::-1]

# Create a bar chart using seaborn
sns.barplot(x=years, y=counts)
plt.xlabel('Year')
plt.ylabel('Number of Items')
plt.title('Items about Artificial Intelligence by Year')
plt.xticks(range(0, len(years), 5), [years[i] for i in range(0, len(years), 5)], rotation=45)
plt.tight_layout()
plt.show()

In [None]:
periods = [
    ("preCovid", 1844, 2018),
    ("Covid", 2019, 2021),
    ("postCovid", 2022, 2026),
]

## Comparative Analysis Across Time Periods

### Define Time Periods

Based on the temporal distribution, we can identify three distinct periods of interest.

ðŸŽ¯ **Challenge 2: Testing Date Filters**

Before we collect data for all three periods, let's practice using date filters with a single year! Date filters help you narrow down API results to specific time ranges.

Try creating a new cell below this one and write a `search_items()` call that retrieves items from **2020 only**. 

**Hint:** Use these parameters:
```python
**{"sourceResource.date.after": "2020", "sourceResource.date.before": "2020"}
```

Before running your query, predict: How many results do you expect compared to the full Covid period (2019-2021)? Will it be more, less, or about the same?

In [None]:
# create a pool of results for each period
ai_results = {}

fields = [
    "sourceResource.title",
    "sourceResource.description",
    "sourceResource.date.begin",
    "sourceResource.date.end"
]

dotted_fields = {
    "sourceResource.subject.name": "artificial intelligence",
}

try:
    for period_name, start_date, end_date in tqdm(periods):
        ai_results[period_name] = search_all_items(
            "artificial AND intelligence", # search query
            max_items=400, # maximum number of items to retrieve for each period
            fields=_join_list(fields, sep=","), # fields to include in the response
            facets="sourceResource.date.begin", # Retrieve facets for date ranges
            page_size=100, # items per page
            **dotted_fields, # filter to ensure results are about AI, not just using AI in metadata
            **{"sourceResource.date.after": str(start_date)}, # Between year
            **{"sourceResource.date.before": str(end_date)}, # and Year
            sort_by="sourceResource.date.begin", # sort by date
            sort_order="asc", # oldest to newest
            verbose=False 
        )
        
except httpx.HTTPStatusError as e:
    print(f"HTTP error occurred: {e}. Using preloaded data instead.")
    r = httpx.get(f"{FALLBACK_DATA_URL}ai_results_by_wave.json")
    r.raise_for_status()
    ai_results = r.json()

### Collect Items for Each Period

Fetch up to 3000 items for each time period using date filters and pagination.

In [None]:
ai_results_summary = {period: len(items) for period, items in ai_results.items()}
print("AI Results Summary by Period:")
for period, count in ai_results_summary.items():
    print(f"{period}: {count} items")

### Summary of Results by Period

Check how many items we retrieved for each time period.

In [None]:
ai_results.get("postCovid")[:5]

### Exploring Item Structure

Let's examine the structure of a sample item to understand what metadata is available.

In [None]:
def extract_keywords(items, skip=None, ngram=2, max_keywords=5, language="en"):
    """Extract keywords from a list of items using YAKE."""

    ai_keywords = {}

    kw_extractor = yake.KeywordExtractor(lan=language, n=ngram, top=max_keywords)

    if skip:
        skip_keywords = set(skip)

    for period, items in tqdm(items.items(), desc="Extracting keywords"):
        period_keywords = {}
        for item in items:
            title = _join_list(item.get("sourceResource.title", ""))
            description = _join_list(item.get("sourceResource.description", ""))
            text = f"{title} {description}".lower()

            keywords = kw_extractor.extract_keywords(text)
            for kw, score in keywords:
                if skip and kw in skip_keywords:
                    continue
                period_keywords[kw] = period_keywords.get(kw, 0) + 1
                
        ai_keywords[period] = period_keywords
    
    return ai_keywords

## Keyword Extraction

### Extract Keywords Using YAKE

Use YAKE (Yet Another Keyword Extractor) to identify important terms in titles and descriptions for each period.

In [None]:
# Define a list of common words to skip (optional)
skip_words = ["artificial intelligence", "ai", "intelligence", "artificial"]

top = 10

ai_keywords = extract_keywords(ai_results, skip=skip_words, ngram=2, max_keywords=top)

for period, keywords in ai_keywords.items():
    sorted_keywords = sorted(keywords.items(), key=lambda x: x[1], reverse=True)[:top]
    print(f"Top {top} keywords for {period}:")
    for kw, count in sorted_keywords:
        print(f"  {kw}: {count}")
    print()

### Display Top Keywords

Print the most frequent keywords for each time period.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(14, 5), sharex=False)

for ax, (period_name, start_year, end_year) in zip(axes, periods):
    data = _top_n(ai_keywords.get(period_name, {}), 10)
    terms = list(data.keys())
    counts = list(data.values())

    sns.barplot(x=counts, y=terms, ax=ax, palette="viridis", hue=counts)
    ax.set_title(f"{period_name} ({start_year}â€“{end_year})")
    ax.set_xlabel("Frequency")

plt.suptitle("How 'Artificial Intelligence' appears across time in DPLA", fontsize=14)
plt.tight_layout()
plt.show()

## Conclusion

You've successfully:
- âœ… Configured secure API authentication
- âœ… Queried the DPLA API with complex filters and facets
- âœ… Handled pagination to collect large datasets
- âœ… Analyzed temporal patterns in cultural heritage data
- âœ… Extracted and visualized keywords across different time periods

These techniques are transferable to many other APIs. The patterns you've learnedâ€”building requests programmatically, handling responses, and combining API data with text analysisâ€”can be applied to museum collections, scientific datasets, social media archives, and more.