# 01 - Data Collection

Notebook ini mengumpulkan metadata dari repositori Universitas Widyatama menggunakan protokol OAI-PMH.

## Overview / Tujuan
- Mengonfigurasi harvester OAI-PMH
- Mengumpulkan metadata dari repositori
- Menyimpan data mentah ke file CSV untuk analisis lebih lanjut

In [1]:
import sys
import warnings
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.config import get_settings, ensure_directories
from src.harvester import OAIPMHHarvester

warnings.filterwarnings('ignore')

In [2]:
settings = get_settings()
ensure_directories(settings)

print(f"OAI-PMH Endpoint : {settings.oaipmh_endpoint}")
print(f"Direktori Data Mentah : {settings.raw_data_dir}")

OAI-PMH Endpoint : https://repository.widyatama.ac.id/oai/request
Direktori Data Mentah : c:\Users\alifn\Code\topic-modeling-utama\data\raw


## 1. Repository Identification

First, let's identify the repository and see what information is available.

In [3]:
harvester = OAIPMHHarvester(settings)
repo_info = harvester.identify()

print("Repository Information:")
print("-" * 50)
for key, value in repo_info.items():
    print(f"{key}: {value}")

Repository Information:
--------------------------------------------------
repositoryName: Widyatama University
baseURL: https://repository.widyatama.ac.id/server/oai/request
protocolVersion: 2.0
adminEmail: dspace@widyatama.ac.id
earliestDatestamp: 2007-11-19T01:40:55Z
deletedRecord: transient
granularity: YYYY-MM-DDThh:mm:ssZ


## 2. List Available Sets

Check what collections/sets are available in the repository.

In [4]:
# List available sets
sets = harvester.list_sets()

print(f"Found {len(sets)} sets:")
print("-" * 50)
for s in sets[:20]:  # Show first 20
    print(f"  {s['setSpec']}: {s['setName']}")

if len(sets) > 20:
    print(f"  ... and {len(sets) - 20} more")

Found 41 sets:
--------------------------------------------------
  com_123456789_3: ACADEMIC PUBLICATIONS
  com_123456789_15: FINAL ASSIGNMENT (Bachelor and Vocational Degree)
  com_123456789_9: LECTURER JOURNAL
  com_123456789_1: MASTER THESIS
  com_123456789_27: STUDENT JOURNAL
  com_123456789_15196: STUDENT WORKING PAPER
  col_123456789_2: Accounting
  col_123456789_10: Accredited National Journal
  col_123456789_15241: Civil Engineering - Bachelor
  col_123456789_17: Design And Visual Communication
  col_123456789_10722: Economics - Bachelor
  col_123456789_23: Economics - Vocational
  col_123456789_15262: Electrical Engineering
  col_123456789_22: English - Bachelor
  col_123456789_108720: Film And Television Production
  col_123456789_108925: INAUGURAL SPEECH SCRIPT
  col_123456789_19: Industrial Engineering - Bachelor
  col_123456789_15399: Informatics - Bachelor
  col_123456789_21: Informatics - Bachelor
  col_123456789_15263: Information System - Bachelor
  ... and 21 more


## 3. Harvest Metadata

Now let's harvest the metadata records. This may take some time depending on the repository size.

**Note**: The harvester uses cloudscraper to bypass any anti-bot protections.

In [5]:
# Configuration for harvesting
# Set max_records to None for full harvest, or a number for testing
MAX_RECORDS = settings.oaipmh_max_records if settings.oaipmh_max_records is not None else None
# SET_SPEC = None  # Set to specific set spec if needed, e.g., "com_123456789_1"
SET_SPEC = settings.oaipmh_set_spec if settings.oaipmh_set_spec is not None else None

print(f"Harvesting configuration:")
print(f"  Max records: {MAX_RECORDS or 'All'}")
print(f"  Set filter: {SET_SPEC or 'None (all sets)'}")

Harvesting configuration:
  Max records: All
  Set filter: com_123456789_15


In [6]:
# Harvest and save
output_path = settings.raw_data_dir / settings.raw_metadata_file

print(f"\nStarting harvest...")
print(f"Output will be saved to: {output_path}")
print("-" * 50)

df = harvester.harvest_and_save(
    output_path=output_path,
    set_spec=SET_SPEC,
    max_records=MAX_RECORDS,
    show_progress=True,
)

print(f"\nHarvest complete!")
print(f"Total records: {len(df)}")


Starting harvest...
Output will be saved to: c:\Users\alifn\Code\topic-modeling-utama\data\raw\raw_metadata.csv
--------------------------------------------------


Harvesting records: 0records [00:00, ?records/s]INFO:src.harvester:Starting harvest from https://repository.widyatama.ac.id/oai/request
INFO:src.harvester:Parameters: {'metadataPrefix': 'oai_dc', 'set': 'com_123456789_15'}
Harvesting records: 12647records [00:10, 1203.98records/s]
INFO:src.harvester:Harvested 12647 records
INFO:src.harvester:Saved 12647 records to c:\Users\alifn\Code\topic-modeling-utama\data\raw\raw_metadata.csv



Harvest complete!
Total records: 12647


## 4. Quick Data Overview

Let's take a quick look at the harvested data.

In [7]:
# Display basic info
print("Dataset Shape:", df.shape)
print("\nColumns:")
for col in df.columns:
    print(f"  - {col}")

Dataset Shape: (12647, 10)

Columns:
  - identifier
  - title
  - abstract
  - authors
  - date
  - subjects
  - publisher
  - types
  - language
  - source


In [8]:
# Display first few records
df.head()

Unnamed: 0,identifier,title,abstract,authors,date,subjects,publisher,types,language,source
0,oai:repository.widyatama.ac.id:123456789/14397,PENGARUH BUDAYA KESELAMATAN DAN KESEHATAN KERJ...,Tujuan penelitian ini adalah untuk mengetahui ...,"Falyana, Diki Hendra",2022-01-05T05:10:37Z,budaya keselamatan dan kesehatan (K3); prosedu...,Program Studi Manajemen S1 Universitas Widyatama,Thesis,other,
1,oai:repository.widyatama.ac.id:123456789/859,Pengaruh Kompensasi terhadap Motivasi Kerja Ka...,"Skripsi ini disusun oleh Andri Tanjung, NRP 02...","Tanjung, Andri",2009-03-11T02:35:44Z,Pengaruh Kompensasi terhadap Motivasi Kerja Ka...,Universitas Widyatama,Thesis,other,
2,oai:repository.widyatama.ac.id:123456789/5337,PERANAN SISTEM INFORMASI AKUNTANSI DALAM MENUN...,Setiap organisasi didirikan untuk mencapai tuj...,"Setiawan, David",2015-06-17T06:18:20Z,Sistem Informasi Akuntansi; Pengendalian Inter...,Universitas Widyatama,Thesis,other,
3,oai:repository.widyatama.ac.id:123456789/107890,PENGARUH USIA DAN MASA KERJA TERHADAP PRODUKTI...,Penelitian ini bertujuan untuk mengetahui peng...,"Fauzia, Galih Eza",2024-04-25T03:35:03Z,,,Thesis,other,
4,oai:repository.widyatama.ac.id:123456789/8665,PENGARUH SISTEM PENGENDALIAN INTERNAL PEMERINT...,Penelitian ini bertujuan untuk mengetahui peng...,"Aruan, Hicca Maria Gandi Putri",2017-10-18T23:53:16Z,Sistem Pengendalian Internal Pemerintah; Kuali...,Universitas Widyatama,Thesis,other,


In [9]:
# Check for missing values
print("Missing values per column:")
print("-" * 50)
missing = df.isnull().sum()
for col in df.columns:
    count = missing[col]
    pct = (count / len(df)) * 100
    print(f"{col}: {count} ({pct:.1f}%)")

Missing values per column:
--------------------------------------------------
identifier: 0 (0.0%)
title: 0 (0.0%)
abstract: 0 (0.0%)
authors: 0 (0.0%)
date: 0 (0.0%)
subjects: 0 (0.0%)
publisher: 0 (0.0%)
types: 0 (0.0%)
language: 0 (0.0%)
source: 0 (0.0%)


In [10]:
# Sample abstract
sample = df[df['abstract'].notna()].sample(1).iloc[0]

print("Sample Record:")
print("=" * 50)
print(f"Title: {sample['title']}")
print(f"\nAuthors: {sample['authors']}")
print(f"\nDate: {sample['date']}")
print(f"\nAbstract:\n{sample['abstract'][:500]}...")

Sample Record:
Title: PENGARUH MANAJEMEN LABA TERHADAP TINGKAT PENGUNGKAPAN LAPORAN KEUANGAN PADA PERUSAHAAN MANUFAKTUR YANG TERDAFTAR DI BURSA EFEK INDONESIA

Authors: Utami, Citra Kharisma

Date: 2015-11-13T03:30:14Z

Abstract:
Perusahaan selalui ingin dipandang memiliki kinerja yang baik. Namun
apabila perusahaan tidak dapat mencapai kinerja yang ditentukan, maka
manajemen akan memanfaatkan fleksibilitas yang diperbolehkan oleh standar
akuntansi dalam menyusun laporan keuangan untuk memodifikasi laba yang
dilaporkan. Manajemen termotivasi untuk memperlihatkan kinerja yang baik
dalam menghasilkan nilai atau keuntungan maksimal bagi perusahaan dengan cara
memberikan informasi laba lebih baik, praktik ini dikenal dengan ...


## Summary

Data collection is complete. The raw metadata has been saved to:
- `data/raw/raw_metadata.csv`

**Next Steps:**
1. Run `01b_eda_raw_data.ipynb` for exploratory data analysis
2. Run `02_data_cleaning.ipynb` to clean the data based on EDA findings

In [11]:
print(f"\nâœ… Data saved to: {output_path}")
print(f"ðŸ“Š Total records: {len(df)}")
print(f"\nðŸ‘‰ Next: Run 01b_eda_raw_data.ipynb for exploratory data analysis")


âœ… Data saved to: c:\Users\alifn\Code\topic-modeling-utama\data\raw\raw_metadata.csv
ðŸ“Š Total records: 12647

ðŸ‘‰ Next: Run 01b_eda_raw_data.ipynb for exploratory data analysis
