# 🚀 Data Engineer's Guide to API Data Retrieval & Error Handling

## 📌 Overview
As a data engineer, your mission is clear: **extract, transform, and load (ETL) data from APIs** efficiently while ensuring robust error handling. APIs are powerful, but they come with challenges—timeouts, rate limits, unexpected responses, and more. This notebook equips you with the tools to **fetch data reliably, handle errors gracefully, and log issues effectively**.

## 🔍 What You'll Learn
- How to **fetch data from APIs** using Python's `requests` library.
- Implementing **try-except blocks** to catch and manage errors.
- Setting up **logging** to track API failures and debugging issues.

## 🛠️ Why This Matters
Data pipelines depend on **consistent and reliable data ingestion**. Without proper error handling, a single failed request can disrupt workflows, leading to incomplete datasets or broken processes. By mastering API error handling, you ensure **data integrity, reliability, and efficiency** in your engineering tasks.

---

We are using the Application Programming Interface(API) from the United States Library of Congress: https://github.com/LibraryOfCongress



In [0]:
import requests  # Library for making HTTP requests
import json  # Library for parsing JSON data
import logging  # Library for logging information
import time  # Library for time-related functions
from pyspark.sql import Row  # Class for creating Row objects
from pyspark.sql.functions import to_date, col
from pyspark.sql.types import StructType, StructField, StringType  # Classes for defining DataFrame schema

# Set up logging
logger = logging.getLogger('databricks_api_logging')
logger.setLevel(logging.DEBUG)

console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG)

# Define a formatter
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)

logger.addHandler(console_handler)

# Define parameters for the API request and table
state = "Washington"
subject = "Tacoma"
catalog = "generaldata"
schema = "dataanalysis"
table_name = "tacoma_articles"

In [0]:
try:
    for page in range(0, 10):
        rows = []  # Initialize an empty list to store Row objects
        # Make an initial API request to fetch data for the specified subject and state
        response = requests.get(f"http://chroniclingamerica.loc.gov/search/pages/results/?proxtext={subject}&state={state}&format=json")
        response.raise_for_status()  # Raise an error for bad status codes
        # Parse the JSON response from the initial API request
        state_json = response.json()
        json_keys = state_json['items'][0].keys()  # Extract keys for schema

        df_schema = StructType([StructField(key, StringType(), True) for key in json_keys])  # Define schema
   
        response = requests.get(f"http://chroniclingamerica.loc.gov/search/pages/results/?proxtext={subject}&state={state}&format=json&page={page+1}")
        response.raise_for_status()  # Raise an error for bad status codes
        article_data = json.loads(json.dumps(response.json()))
        for article in article_data["items"]:
            rows.append(Row(**article))  # Append each article as a Row

        df = spark.createDataFrame(rows, schema=df_schema)  # Create DataFrame with schema
        df.write.mode("append").saveAsTable(f"{catalog}.{schema}.{table_name}")  # Save DataFrame as a table
        time.sleep(20)
except requests.exceptions.HTTPError as err:
    logger.error(f"HTTP error occurred: {err}")  # Log HTTP errors
except requests.exceptions.RequestException as err:
    logger.error(f"Error occurred: {err}")  # Log other request errors

In [0]:
# Read the table into a DataFrame
df = spark.read.table(f"{catalog}.{schema}.{table_name}")

# Select specific columns and generate summaries using AI query
df_out = df.selectExpr(
  "date",  # Select the date column
  "id",  # Select the id column
  "subject",  # Select the subject column
  "title",  # Select the title column
  "ocr_eng",  # Select the OCR text column
  "ai_query('databricks-meta-llama-3-3-70b-instruct', CONCAT('Please provide a summary of the following text: ', ocr_eng), named_struct('max_tokens', 100, 'temperature', 0.7)) as summary"  # Generate summaries using AI model
)
df_out = df_out.withColumn("date", to_date(col("date"), "yyyyMMdd"))  

In [0]:
display(df_out.take(5))