### **Archive Metadata Review Tool**

Pulls Internet Archive metadata and audits missing fields, dates, duplicates, whitespace, etc.


**Outputs:**
- `ia_metadata_snapshot.csv` - Complete metadata export
- `ia_metadata_issues_report.csv` - Issues with severity ratings

**Known Limitations:**
- Dates are lightly validated; some historical/non-standard date strings may be flagged
- Advanced search returns a subset of fields; `/metadata/{identifier}` provides richer per-item data
---
**Disclaimer:** I am not affiliated with or work for the Internet Archive or its other projects. This is just a project completed on my own time.

## How to use this notebook

**To run the notebook:**
1. Enter an Internet Archive collection ID or item identifiers in the configuration cell below
2. Run all cells in order (Runtime → Run all)
3. Review the generated reports
4. Download the CSV files at the end


In [None]:
COLLECTION_ID = "internetarchivebooks"
MAX_ITEMS = 500

IDENTIFIERS = [
    # elif mode == "identifiers":
   # if not IDENTIFIERS:
     #   raise ValueError("MODE is 'identifiers' but IDENTIFIERS is empty.")
]

MODE = "collection"
EXPECTED_FIELDS = ["identifier", "title", "creator", "date", "collection", "language",
    "publisher", "subject", "description", "mediatype", "licenseurl"
    ]

REQUIRED_IF_PRESENT = ["identifier", "title"]

In [None]:
# CONFIGURATION - Edit these values

# Choose analysis mode: "collection" or "identifiers"
MODE = "collection"

# For collection mode: specify collection ID and max items
COLLECTION_ID = "internetarchivebooks"
MAX_ITEMS = 500

# For identifiers mode: list specific item IDs
IDENTIFIERS = [
    # Example: "seeingwhatissacr0000gire",
    # Add your identifiers here
]

# Fields to check in the audit
EXPECTED_FIELDS = [
    "identifier", "title", "creator", "date", "collection",
    "language", "publisher", "subject", "description",
    "mediatype", "licenseurl"
]

# Fields that must not be empty
REQUIRED_IF_PRESENT = ["identifier", "title"]

In [None]:
# FETCH METADATA FROM INTERNET ARCHIVE

import requests
import pandas as pd

def ia_advancedsearch(collection_id: str, max_items: int = 500):
    """
    Uses Internet Archive advancedsearch to fetch item metadata in JSON.
    """
    url = "https://archive.org/advancedsearch.php"
    params = {
        "q": f'collection:({collection_id})',
        "fl[]": EXPECTED_FIELDS,     # fields list
        "rows": max_items,
        "page": 1,
        "output": "json",
    }
    r = requests.get(url, params=params, timeout=60)
    r.raise_for_status()
    data = r.json()
    docs = data.get("response", {}).get("docs", [])
    return pd.DataFrame(docs)

def ia_metadata_for_identifier(identifier: str):
    """
    Fetches item metadata JSON for a single identifier.
    """
    url = f"https://archive.org/metadata/{identifier}"
    r = requests.get(url, timeout=60)
    r.raise_for_status()
    data = r.json()

    # The "metadata" dict is the most useful core
    md = data.get("metadata", {}) or {}
    # Ensure identifier is present
    md["identifier"] = md.get("identifier", identifier)
    return md

def fetch_ia_dataframe(mode: str):
    mode = mode.lower().strip()
    if mode == "collection":
        df = ia_advancedsearch(COLLECTION_ID, MAX_ITEMS)
        return df
    elif mode == "identifiers":
        rows = []
        for ident in IDENTIFIERS:
            rows.append(ia_metadata_for_identifier(ident))
        return pd.DataFrame(rows)
    else:
        raise ValueError("MODE must be 'collection' or 'identifiers'")

df = fetch_ia_dataframe(MODE)

print("Fetched rows:", len(df))
print("Columns:", len(df.columns))
display(df.head(10))


Fetched rows: 500
Columns: 10


Unnamed: 0,collection,creator,date,description,identifier,language,mediatype,publisher,subject,title
0,"[internetarchivebooks, printdisabled, fav-star...","Gire, Ken",2006-01-01T00:00:00Z,"[200 p. ; 22 cm, Includes bibliographical refe...",seeingwhatissacr0000gire,eng,texts,"Nashville, Tenn. : W Publishing Group",Spirituality,Seeing what is sacred : becoming more spiritua...
1,"[internetarchivebooks, inlibrary, printdisabled]",,1975-01-01T00:00:00Z,,walkingtourswash0000unse,eng,texts,,,walking tours: washington d.c.
2,"[internetarchivebooks, inlibrary, printdisabled]",,1965-01-01T00:00:00Z,,underpressure0000unse,eng,texts,,,under pressure
3,"[trent_university, internetarchivebooks, print...","Loomis, John B",1993-01-01T00:00:00Z,"[xxi, 474 p. : 26 cm, Includes bibliographical...",integratedpublic0000loom,eng,texts,New York : Columbia University Press,[United States. Bureau of Land Management -- P...,Integrated public lands management : principle...
4,"[internetarchivebooks, inlibrary, printdisabled]","Kim, Daniel, author",2013-01-01T00:00:00Z,239 pages : 21 cm,ironmankoreanedi0000dani,kor,texts,Sŏul : Kyujang,"[Christianity -- Korea (South), Christian life]",Ch'ŏrin : Sesang i kamdangch'i mothal midŭm ŭi...
5,"[internetarchivebooks, inlibrary, printdisabled]",,1992-01-01T00:00:00Z,"[x, 305 p. : 24 cm, ""Co-sponsored by European ...",isbn_3764327227,eng,texts,Basel ; Boston : Birkhäuser Verlag,"[AIDS (Disease) -- Prevention -- Congresses, A...",Assessing AIDS prevention : selected papers pr...
6,"[internetarchivebooks, inlibrary, printdisabled]",,2011-01-01T00:00:00Z,,isbn_9780205036042,eng,texts,,,Exam Copy for Cultural Anthropology
7,"[internetarchivebooks, printdisabled]","[Tim Britt, Michael Sullivan, Michael Sullivan...",2015-01-01T00:00:00Z,,isbn_9780321926241_3,English,texts,Pearson,,Student's Solutions Manual
8,"[internetarchivebooks, printdisabled]",,2002-07-06T00:00:00Z,,isbn_9780324297836,eng,texts,International Thomson Publishing,,ACP KENNESAW ST U ECON 1100
9,"[internetarchivebooks, inlibrary, printdisabled]",,2013-01-01T00:00:00Z,,isbn_9780325035727,eng,texts,,,The story of naismith's game


In [None]:
# ============================================
# SAVE METADATA SNAPSHOT
# ============================================

snapshot_name = "ia_metadata_snapshot.csv"
df.to_csv(snapshot_name, index=False)
print(f"✓ Saved metadata snapshot: {snapshot_name}")
print(f"  Total items: {len(df)}")
print(f"  Columns: {', '.join(df.columns.tolist()[:5])}...")

snapshot_name = "ia_metadata_snapshot3_v2.csv"
df.to_csv(snapshot_name, index=False)
print("Wrote snapshot:", snapshot_name)


Wrote snapshot: ia_metadata_snapshot3_v2.csv


In [None]:
# VALIDATE METADATA & IDENTIFY ISSUES

import re
from collections import Counter
import numpy as np # Added numpy import for array checks

DATE_PATTERNS = [
    re.compile(r"^\d{4}$"),                 # YYYY
    re.compile(r"^\d{4}-\d{2}$"),            # YYYY-MM
    re.compile(r"^\d{4}-\d{2}-\d{2}$ "),      # YYYY-MM-DD
]

def normalize_date(s: str):
    """
    Accepts:
      - YYYY
      - YYYY-MM
      - YYYY-MM-DD
      - YYYY/MM[/DD]
      - ISO timestamps like YYYY-MM-DDT00:00:00Z

    Returns:
      (normalized_date, status, notes)

    status:
      "ok"            → clean, accepted
      "normalized"    → valid but transformed
      "invalid"       → not parseable
    """
    if s is None:
        return ("", "invalid", "Empty date")

    raw = str(s).strip()
    if raw == "" or raw.lower() in {"nan", "none"}:
        return ("", "invalid", "Empty date")

    # Handle full ISO timestamps by stripping time (common in IA exports)
    if "T" in raw:
        date_part = raw.split("T")[0]
        # quick sanity: must look like YYYY-MM-DD after stripping
        if re.match(r"^\d{4}-\d{2}-\d{2}$", date_part):
            return (date_part, "normalized", "ISO timestamp normalized to date")
        # if it has T but doesn't split cleanly, treat as invalid
        return (raw, "invalid", "Timestamp-like value but could not extract YYYY-MM-DD")

    # Normalize slashes
    candidate = raw.replace("/", "-")

    # Pattern check
    if not any(p.match(candidate) for p in DATE_PATTERNS):
        return (raw, "invalid", "Unrecognized date format")

    # Light range validation
    parts = candidate.split("-")
    year = int(parts[0])

    if len(parts) >= 2:
        month = int(parts[1])
        if month < 1 or month > 12:
            return (candidate, "invalid", "Month out of range 1–12")

    if len(parts) == 3:
        day = int(parts[2])
        if day < 1 or day > 31:
            return (candidate, "invalid", "Day out of range 1–31")

    return (candidate, "ok", "")


def add_issue(issues, row_idx, col, issue_type, value, notes=""):
    issues.append({
        "row_number": int(row_idx) + 2,  # +2 approximates spreadsheet row numbers (header + 1-index)
        "column": col,
        "issue_type": issue_type,
        "value": "" if is_blank(value) else str(value), # Used is_blank for consistency
        "notes": notes
    })

def is_blank(x):
    # Check for pandas NA values (NaN, NaT, None)
    isna_result = pd.isna(x)

    # If pd.isna returns an array-like object (e.g., for a list or Series input)
    if isinstance(isna_result, np.ndarray):
        # We consider it "blank" if all elements within the array-like are NA
        return isna_result.all()
    # If pd.isna returns a single boolean (scalar)
    elif isna_result:
        return True

    # Otherwise, convert to string and check for empty or "nan"/"none" string representation
    s = str(x).strip()
    return s == "" or s.lower() in {"nan", "none"}

issues = []

# 1) Missing required-ish fields (only if column exists)
for col in REQUIRED_IF_PRESENT:
    if col in df.columns:
        for i, v in df[col].items():
            if is_blank(v):
                add_issue(issues, i, col, "missing_required", v, notes=f"{col} is empty")

# 2) Date checks (timestamp-aware)
if "date" in df.columns:
    for i, v in df["date"].items():
        if is_blank(v):
            continue

        normalized, status, note = normalize_date(v)

        if status == "invalid":
            add_issue(issues, i, "date", "bad_date_format", v, notes=note)
        elif status == "normalized":
            add_issue(
                issues, i, "date", "date_normalized", v,
                notes=f"{note}. Suggested normalized form: {normalized}"
            )

        # year sanity check (use normalized if we have it)
        m = re.match(r"^(\d{4})", normalized)
        if m:
            year = int(m.group(1))
            if year < 1000 or year > 2100:
                add_issue(issues, i, "date", "suspicious_year", v, notes="Year outside 1000–2100")


        # Year sanity check
        m = re.match(r"^(\d{4})", normalized)
        if m:
            year = int(m.group(1))
            if year < 1000 or year > 2100:
                add_issue(
                    issues, i, "date", "suspicious_year", v,
                    notes="Year outside 1000–2100"
                )

# 3) Whitespace checks (text-y columns)
for col in df.columns:
    if df[col].dtype == "object":
        for i, v in df[col].items():
            if is_blank(v):
                continue
            s = str(v)
            if s != s.strip():
                add_issue(issues, i, col, "whitespace", v, notes="Leading/trailing whitespace")

# 4) Duplicate identifiers (if present)
if "identifier" in df.columns:
    ids = df["identifier"].astype(str).str.strip()
    # treat blanks and 'nan' as non-identifiers
    valid = (ids != "") & (ids.str.lower() != "nan") & (ids.str.lower() != "none")
    dupes = ids.duplicated(keep=False) & valid
    for i, is_dupe in dupes.items():
        if is_dupe:
            add_issue(issues, i, "identifier", "duplicate_identifier", df.loc[i, "identifier"], notes="Identifier appears multiple times")

# 5) Multi-valued field consistency hints (IA often stores lists)
# We flag when a field mixes list-like and string-like values across rows.
multi_candidates = ["subject", "collection", "creator", "language"]
for col in multi_candidates:
    if col in df.columns:
        types_seen = set()
        for v in df[col].dropna().head(200):  # sample a bit
            types_seen.add(type(v).__name__)
        if len(types_seen) > 1:
            add_issue(issues, 0, col, "mixed_value_types", "", notes=f"Field has mixed types in sample: {sorted(types_seen)}")

issues_df = pd.DataFrame(issues)

severity_map = {
    "missing_required": "HIGH",
    "duplicate_identifier": "HIGH",
    "bad_date_format": "HIGH",
    "suspicious_year": "MEDIUM",
    "mixed_value_types": "MEDIUM",
    "whitespace": "LOW",
    "date_normalized": "LOW",   # ← make sure this stays LOW

}
if len(issues_df) > 0:
    issues_df["severity"] = issues_df["issue_type"].map(severity_map).fillna("MEDIUM")
    issues_df = issues_df[["severity","row_number","column","issue_type","value","notes"]]

print("Total issues found:", len(issues_df))
if len(issues_df) == 0:
    print("No issues found. Your IA metadata is vibing. ✅")
else:
    counts = Counter(issues_df["issue_type"])
    print("\nSummary by issue_type:")
    for k, v in counts.most_common():
        print(f"- {k}: {v}")

display(issues_df.head(30))

Total issues found: 500

Summary by issue_type:
- date_normalized: 497
- mixed_value_types: 3


Unnamed: 0,severity,row_number,column,issue_type,value,notes
0,LOW,2,date,date_normalized,2006-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
1,LOW,3,date,date_normalized,1975-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
2,LOW,4,date,date_normalized,1965-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
3,LOW,5,date,date_normalized,1993-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
4,LOW,6,date,date_normalized,2013-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
5,LOW,7,date,date_normalized,1992-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
6,LOW,8,date,date_normalized,2011-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
7,LOW,9,date,date_normalized,2015-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...
8,LOW,10,date,date_normalized,2002-07-06T00:00:00Z,ISO timestamp normalized to date. Suggested no...
9,LOW,11,date,date_normalized,2013-01-01T00:00:00Z,ISO timestamp normalized to date. Suggested no...


In [None]:
# DOWNLOAD REPORTS


from google.colab import files

report_name = "ia_metadata_issues_report.csv"
issues_df.to_csv(report_name, index=False)
print(f"✓ Saved issues report: {report_name}")

print("Downloading files...")
files.download(report_name)
files.download(snapshot_name)
print("Downloads complete!")



Wrote report: ia_metadata_issues3_v2report.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Summary

This tool audits large amounts of Internet Archive metadata for quality issues.

**Key Findings:**
- ISO 8601 timestamps (e.g., `2006-01-01T00:00:00Z`) are flagged as "date_normalized" with LOW severity
- HIGH severity issues indicate critical problems requiring attention
- Uses the CSV reports to prioritize and track metadata corrections

---

**Disclaimer:** I am not affiliated with or work for the Internet Archive or its other projects. This is just a project completed on my own time.