A robust data pipeline to scrape, process, and analyze data from the Yearbook of International Organizations (YBIO). This tool extracts detailed information about international organizations, handles authentication via cookies, and provides analytical insights.
Diwas Puri
Duke University
📧 diwas.puri@duke.edu
- Automated Scraping: Iterates through thousands of pages on YBIO to extract organization data.
- Robust Error Handling: Retries failed pages and saves data in chunks to prevent loss.
- Data Cleaning: Deduplicates entries, removes artifacts, and standardizes formats.
- Analysis: Jupyter notebooks for visualizing geographic distribution, founding timelines, and organization types.
├── analysis/
│ ├── analysis.ipynb # Visualizations and insights
│ └── data_cleanup.ipynb # Data cleaning logic
├── data/
│ ├── organizations_clean.csv # Final cleaned dataset
│ └── raw_chunks/ # Raw scraped data chunks
├── utils/
│ ├── merge_csv.py # Script to merge raw chunks
│ └── analyze_html_coverage.py # Checks for missing pages
├── scrape_html_table.py # Main scraper script
├── requirements.txt # Python dependencies
└── README.md
-
Clone the repository:
git clone https://github.com/androidilicious/ybio-scraper.git cd ybio-scraper -
Install dependencies:
pip install -r requirements.txt
Access to YBIO requires institutional login (e.g., via Duke University). This scraper uses browser cookies for authentication.
- Log in to YBIO in your web browser.
- Export your cookies to a file named
cookies.pklin the root directory.- Note: Cookie extraction scripts are provided in
utils/but are excluded from the repo for security.
- Note: Cookie extraction scripts are provided in
Run the main scraper to fetch data from the website.
python scrape_html_table.py --workers 5Data is saved to data/raw_chunks/.
Combine all raw chunks into a single CSV file.
python utils/merge_csv.pyRun the cleanup notebook or script to remove duplicates and fix formatting.
- Open
analysis/data_cleanup.ipynband run all cells.
Explore the dataset using the analysis notebook.
- Open
analysis/analysis.ipynbto see charts and statistics.
The final dataset includes:
- Name: Organization name
- Acronym: Abbreviation
- Founded: Year of establishment
- Location: City and Country
- Type: Classification (Type I/II)
Explore the global distribution of international organizations:
- Organization Map - Interactive map with country-level markers showing organization counts (189 countries, 47K+ organizations)
- Heatmap - Density heatmap of organization concentration worldwide
Click the links above to interact with the maps - zoom, pan, and click markers for details!
Disclaimer: This tool is for educational and research purposes only. Please respect the website's terms of service and crawl rate limits.