# 01 — Data Collection

Fetch all pages transcluding `{{Playlist}}` from utaite.wiki via pywikibot.

**Output:** Raw wikitext files in `data/raw/` + `data/raw/manifest.json`

## Setup

In [4]:
import os
import sys
import json
import re
from pathlib import Path
from datetime import datetime

# Project root
PROJECT_ROOT = Path(os.getcwd()).parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_RAW.mkdir(parents=True, exist_ok=True)

# Configure pywikibot BEFORE importing it
os.environ["PYWIKIBOT_DIR"] = str(PROJECT_ROOT / "src" / "pwb_config")

# Add src to path so pwb can find the family file
sys.path.insert(0, str(PROJECT_ROOT / "src" / "pwb_config"))

print(f"Project root: {PROJECT_ROOT}")
print(f"Data output:  {DATA_RAW}")
print(f"PWB config:   {os.environ['PYWIKIBOT_DIR']}")

Project root: /home/akkun/utaitewiki-songlist-migration
Data output:  /home/akkun/utaitewiki-songlist-migration/data/raw
PWB config:   /home/akkun/utaitewiki-songlist-migration/src/pwb_config


In [5]:
import pywikibot
from pywikibot import pagegenerators

# Connect to utaite.wiki
site = pywikibot.Site("en", "utaitewiki")

print(f"Connected to: {site.hostname()}")
print("\nNote: We're running in read-only mode (no login required).")
print("The wiki's public API allows fetching page content without authentication.")

Connected to: utaite.wiki

Note: We're running in read-only mode (no login required).
The wiki's public API allows fetching page content without authentication.


## Step 1: Get All Pages Transcluding `{{Playlist}}`

In [6]:
# Fetch all pages that transclude Template:Playlist
# Namespaces: 0 (main) and 14 (category)
template_page = pywikibot.Page(site, "Template:Playlist")

print("Fetching pages transcluding Template:Playlist...")
print("Namespaces: 0 (Main), 14 (Category)")
print("Filtering out redirects...")
print()

all_pages = []
for page in template_page.embeddedin(namespaces=[0, 14], filter_redirects=False):
    if page.isRedirectPage():
        continue
    all_pages.append(page)

print(f"Found {len(all_pages)} non-redirect pages transcluding Template:Playlist")
print(f"  - ns=0 (Main): {sum(1 for p in all_pages if p.namespace() == 0)}")
print(f"  - ns=14 (Category): {sum(1 for p in all_pages if p.namespace() == 14)}")

Fetching pages transcluding Template:Playlist...
Namespaces: 0 (Main), 14 (Category)
Filtering out redirects...

Found 925 non-redirect pages transcluding Template:Playlist
  - ns=0 (Main): 891
  - ns=14 (Category): 34


## Step 2: Build Exclusion Set (Song Article Categories)

In [7]:
# Exclude pages in "Utattemita Songs" and "Famous Utattemita Songs" categories
# These are individual song articles, not artist playlists

exclude_categories = [
    "Utattemita Songs",
    "Famous Utattemita Songs",
]

excluded_pages = set()
for cat_name in exclude_categories:
    cat = pywikibot.Category(site, f"Category:{cat_name}")
    members = list(cat.members(namespaces=[0]))
    excluded_pages.update(p.title() for p in members)
    print(f"Category '{cat_name}': {len(members)} pages to exclude")

print(f"\nTotal exclusion set: {len(excluded_pages)} unique pages")

Category 'Utattemita Songs': 35 pages to exclude
Category 'Famous Utattemita Songs': 667 pages to exclude

Total exclusion set: 667 unique pages


In [8]:
# Apply exclusion filter
filtered_pages = [p for p in all_pages if p.title() not in excluded_pages]
excluded_count = len(all_pages) - len(filtered_pages)

print(f"Before filter: {len(all_pages)} pages")
print(f"Excluded:      {excluded_count} song article pages")
print(f"After filter:  {len(filtered_pages)} pages to collect")

Before filter: 925 pages
Excluded:      0 song article pages
After filter:  925 pages to collect


## Step 3: Associate Subpages to Root Artists

In [9]:
def get_root_artist(page_title: str) -> str:
    """Extract root artist name from a page title.
    
    Examples:
        'Ameno Kosame' -> 'Ameno Kosame'
        'Mafumafu/Songs/Cover' -> 'Mafumafu'
        'Mafumafu/Songs/Original' -> 'Mafumafu'
        'Mafumafu/Discography/2021-2025' -> 'Mafumafu'
        'Category:Some Category' -> 'Category:Some Category'
    """
    # Don't process category pages
    if page_title.startswith("Category:"):
        return page_title
    
    # Split on '/' and take the first part
    return page_title.split("/")[0]


# Test the function
test_cases = [
    "Ameno Kosame",
    "Mafumafu/Songs/Cover",
    "Mafumafu/Songs/Original",
    "Mafumafu/Songs/Privated",
    "Mafumafu/Discography/2021-2025",
    "Category:Some Category",
]
for tc in test_cases:
    print(f"  '{tc}' -> '{get_root_artist(tc)}'")

  'Ameno Kosame' -> 'Ameno Kosame'
  'Mafumafu/Songs/Cover' -> 'Mafumafu'
  'Mafumafu/Songs/Original' -> 'Mafumafu'
  'Mafumafu/Songs/Privated' -> 'Mafumafu'
  'Mafumafu/Discography/2021-2025' -> 'Mafumafu'
  'Category:Some Category' -> 'Category:Some Category'


## Step 4: Fetch Wikitext and Save to Disk

In [10]:
def sanitize_filename(title: str) -> str:
    """Convert a page title to a safe filename.
    
    Replaces characters that are invalid in filenames.
    """
    # Replace / with __
    safe = title.replace("/", "__")
    # Replace other problematic characters
    safe = re.sub(r'[<>:"\\|?*]', '_', safe)
    return safe


# Test
print(sanitize_filename("Mafumafu/Songs/Cover"))  # Mafumafu__Songs__Cover
print(sanitize_filename("Category:SIXFONIA"))       # Category_SIXFONIA

Mafumafu__Songs__Cover
Category_SIXFONIA


In [11]:
# Fetch wikitext for all filtered pages and save to data/raw/
manifest = {
    "collection_date": datetime.now().isoformat(),
    "total_pages_found": len(all_pages),
    "excluded_count": excluded_count,
    "collected_count": 0,
    "errors": [],
    "pages": [],
}

print(f"Fetching wikitext for {len(filtered_pages)} pages...")
print("="*60)

for i, page in enumerate(filtered_pages, 1):
    title = page.title()
    root_artist = get_root_artist(title)
    filename = sanitize_filename(title) + ".txt"
    filepath = DATA_RAW / filename
    
    try:
        # Fetch raw wikitext (latest revision)
        wikitext = page.text
        
        # Save to disk
        filepath.write_text(wikitext, encoding="utf-8")
        
        # Add to manifest
        manifest["pages"].append({
            "title": title,
            "root_artist": root_artist,
            "namespace": page.namespace().id,
            "filename": filename,
            "size_bytes": len(wikitext.encode("utf-8")),
            "is_subpage": "/" in title and not title.startswith("Category:"),
        })
        
        if i % 50 == 0 or i == len(filtered_pages):
            print(f"  [{i}/{len(filtered_pages)}] Fetched: {title}")
            
    except Exception as e:
        manifest["errors"].append({"title": title, "error": str(e)})
        print(f"  [ERROR] {title}: {e}")

manifest["collected_count"] = len(manifest["pages"])

print("="*60)
print(f"Done! Collected {manifest['collected_count']} pages.")
if manifest["errors"]:
    print(f"Errors: {len(manifest['errors'])}")

Fetching wikitext for 925 pages...
  [50/925] Fetched: Hachi
  [100/925] Fetched: 000
  [150/925] Fetched: Enn
  [200/925] Fetched: Jess Nuno
  [250/925] Fetched: Rena
  [300/925] Fetched: Nanamori
  [350/925] Fetched: Rimokon (NND)
  [400/925] Fetched: SymaG
  [450/925] Fetched: Kurohina
  [500/925] Fetched: HAKURO
  [550/925] Fetched: Kurousagi Uru
  [600/925] Fetched: Rin Rin
  [650/925] Fetched: Rere
  [700/925] Fetched: Root/Songs
  [750/925] Fetched: Meloa
  [800/925] Fetched: Pokota/Songs
  [850/925] Fetched: For The More
  [900/925] Fetched: Category:Circle of Friends
  [925/925] Fetched: Category:Zessei Bijin!
Done! Collected 925 pages.


In [12]:
# Save manifest
manifest_path = DATA_RAW / "manifest.json"
with open(manifest_path, "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2, ensure_ascii=False)

print(f"Manifest saved to: {manifest_path}")
print(f"Total files in data/raw/: {len(list(DATA_RAW.glob('*.txt')))}")

Manifest saved to: /home/akkun/utaitewiki-songlist-migration/data/raw/manifest.json
Total files in data/raw/: 925


## Step 5: Quick Summary Statistics

In [13]:
import pandas as pd

# Load manifest into a DataFrame for quick analysis
df = pd.DataFrame(manifest["pages"])

print("=== Collection Summary ===")
print(f"Total pages collected: {len(df)}")
print(f"Unique root artists:  {df['root_artist'].nunique()}")
print(f"Subpages:             {df['is_subpage'].sum()}")
print(f"Main pages:           {(~df['is_subpage']).sum()}")
print(f"")
print(f"Namespace breakdown:")
for ns, count in df['namespace'].value_counts().items():
    ns_name = {0: 'Main', 14: 'Category'}.get(ns, f'ns={ns}')
    print(f"  {ns_name}: {count}")
print(f"")
print(f"Total wikitext size: {df['size_bytes'].sum() / 1024 / 1024:.1f} MB")
print(f"Average page size:   {df['size_bytes'].mean() / 1024:.1f} KB")
print(f"Largest page:        {df.loc[df['size_bytes'].idxmax(), 'title']} ({df['size_bytes'].max() / 1024:.1f} KB)")

=== Collection Summary ===
Total pages collected: 925
Unique root artists:  911
Subpages:             128
Main pages:           797

Namespace breakdown:
  Main: 891
  Category: 34

Total wikitext size: 11.2 MB
Average page size:   12.4 KB
Largest page:        Hanatan/Songs (91.7 KB)


In [14]:
# Show artists with the most subpages (Type 2 pattern)
subpage_counts = df[df["is_subpage"]].groupby("root_artist").size().sort_values(ascending=False)

if len(subpage_counts) > 0:
    print("=== Top Artists by Subpage Count ===")
    print(subpage_counts.head(20).to_string())
else:
    print("No subpages found (all pages are Type 1 monolithic).")

=== Top Artists by Subpage Count ===
root_artist
Mafumafu          5
Soraru            4
Hanatan           2
Xea               2
After the Rain    1
Ado               1
3bu               1
Alfakyun.         1
Amatsuki          1
Amu               1
Araki             1
Ayaponzu*         1
Buzz Panda        1
Chogakusei        1
Chomaiyo          1
Choumiryou        1
Chrono Reverse    1
Clear             1
CleeNoah          1
Colon             1


In [15]:
# Quick check: count # lines (song entries) per file
entry_counts = []
for _, row in df.iterrows():
    filepath = DATA_RAW / row["filename"]
    text = filepath.read_text(encoding="utf-8")
    # Count lines starting with # (song entries inside {{Playlist}})
    # Simple heuristic — we'll do proper extraction in notebook 02
    count = len(re.findall(r'^\s*#\s*"', text, re.MULTILINE))
    entry_counts.append(count)

df["approx_entry_count"] = entry_counts

print("=== Song Entry Distribution (approximate) ===")
print(df["approx_entry_count"].describe())
print(f"\nTotal estimated song entries: {df['approx_entry_count'].sum():,}")
print(f"\nTop 10 pages by entry count:")
top10 = df.nlargest(10, "approx_entry_count")[["title", "root_artist", "approx_entry_count"]]
print(top10.to_string(index=False))

=== Song Entry Distribution (approximate) ===
count    925.000000
mean      91.043243
std       88.301458
min        0.000000
25%       25.000000
50%       68.000000
75%      129.000000
max      745.000000
Name: approx_entry_count, dtype: float64

Total estimated song entries: 84,215

Top 10 pages by entry count:
                                                            title root_artist  approx_entry_count
                                                    Hanatan/Songs     Hanatan                 745
Hanatan/Songs/Covers on YouTube Livestreams and TwitCasting Lives     Hanatan                 638
                                                   Kanipan./Songs    Kanipan.                 558
                                               Soraru/Songs/Cover      Soraru                 516
                                                  Uratanuki/Songs   Uratanuki                 489
                                                  Meramipop/Songs   Meramipop                 485

## Done!

Raw data is saved in `data/raw/`. Next step: `02_exploratory_analysis.ipynb` for detailed statistics.