# 🚀 Data Engineer's Guide to API Data Retrieval & Error Handling

## 📌 Overview
As a data engineer, your mission is clear: **extract, transform, and load (ETL) data from APIs** efficiently while ensuring robust error handling. APIs are powerful, but they come with challenges—timeouts, rate limits, unexpected responses, and more. This notebook equips you with the tools to **fetch data reliably, handle errors gracefully, and log issues effectively**.

## 🔍 What You'll Learn
- How to **fetch data from APIs** using Python's `requests` library.
- Implementing **try-except blocks** to catch and manage errors.
- Setting up **logging** to track API failures and debugging issues.

## 🛠️ Why This Matters
Data pipelines depend on **consistent and reliable data ingestion**. Without proper error handling, a single failed request can disrupt workflows, leading to incomplete datasets or broken processes. By mastering API error handling, you ensure **data integrity, reliability, and efficiency** in your engineering tasks.

---

We are using the Application Programming Interface(API) from the United States Library of Congress: https://github.com/LibraryOfCongress



In [0]:
import requests
import json
import logging
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType

# Set up logging
logger = logging.getLogger('databricks_api_logging')
logger.setLevel(logging.DEBUG)

console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG)

# Define a formatter
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)

logger.addHandler(console_handler)

# Define parameters for the API request and table
state = "Washington"
subject = "Tacoma"
catalog = "generaldata"
schema = "dataanalysis"
table_name = "tacoma_articles"

In [0]:
try:
    rows = []
    response = requests.get(f"http://chroniclingamerica.loc.gov/search/pages/results/?proxtext={subject}&state={state}&format=json")
    response.raise_for_status()  # Raise an error for bad status codes
    # Parse the JSON response from the initial API request
    state_json = response.json()
    # Define the number of pages to fetch
    numPages = 50
    json_keys = state_json['items'][0].keys()  # Extract keys for schema

    df_schema = StructType([StructField(key, StringType(), True) for key in json_keys])  # Define schema

    for p in range(0, numPages):
        response = requests.get(f"http://chroniclingamerica.loc.gov/search/pages/results/?proxtext={subject}&state={state}&format=json&page={p+1}")
        response.raise_for_status()  # Raise an error for bad status codes
        article_data = json.loads(json.dumps(response.json()))
        for article in article_data["items"]:
            rows.append(Row(**article))  # Append each article as a Row

    df = spark.createDataFrame(rows, schema=df_schema)  # Create DataFrame with schema
    df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{subject}")  # Save DataFrame as a table
except requests.exceptions.HTTPError as err:
    logger.error(f"HTTP error occurred: {err}")  # Log HTTP errors
except requests.exceptions.RequestException as err:
    logger.error(f"Error occurred: {err}")  # Log other request errors