Integrate arXiv Processing and Reporting Scripts by Goziee-git · Pull Request #228 · creativecommons/quantifying

Goziee-git · 2025-11-03T03:53:56Z

Fixes

Fixes Integrate arXiv processing script with standardized architecture #188 by @Goziee-git

Description

This pull request implements comprehensive arXiv data source integration for the Quantifying the Commons project. The implementation includes both data processing and report generation capabilities to analyze open access academic publications from the arXiv repository, contributing to our understanding of the academic commons landscape. The arXiv integration represents a significant expansion of our commons analysis capabilities, bringing academic publication data into our comprehensive view of the global commons. The implementation prioritizes data quality, processing efficiency, and meaningful insights generation.

The changes introduce two new scripts that follow the established project architecture: arxiv_process.py for data extraction and processing, and arxiv_report.py for generating meaningful insights and visualizations from the processed arXiv metadata.

Technical details

Scripts Modified

scripts/2-process/arxiv_process.py ArXiv processing script
scripts/3-report/arxiv_report.py ArXiv reporting script

Processing Changes (arxiv_process.py)

License Data Processing: Reads and aggregates arXiv license count data from arxiv_1_count.csv, groups by license identifier and generates totals by license type
Category Classification: Processes arXiv subject category data with comprehensive mapping of 100+ arXiv category codes (cs.AI, math.NT, physics.gen-ph, etc.) to human-readable names
across Computer Science, Mathematics, Physics, Statistics, Biology, and Finance domain
Temporal Analysis: Aggregates publication counts by year from submission data, sorting chronologically for time-series analysis
Author Collaboration Analysis: Processes author bucket data (1, 2, 3, 4, 5+ authors) with proper categorical sorting and handles missing/empty author data
Data Validation and Cleaning: Implements null value filtering, empty string removal, and proper DataFrame structure maintenance for all processed datasets
CSV Output Generation: Standardizes all processed data into CSV format with consistent column naming and proper quoting for downstream analysis

Reporting Changes (arxiv_report.py)

Temporal Analysis: Generated time-series visualizations showing the growth of open access publications on arXiv over time, highlighting trends in academic commons adoption
Subject Category Analysis: Created comprehensive breakdowns of open access content by arXiv subject categories (physics, mathematics, computer science, etc.) to understand domain-specific patterns.
License Classification: Developed reporting on different types of open licenses used in arXiv submissions, contributing to our understanding of commons licensing preferences
Data Visualizations: Created matplotlib-based charts and graphs using the existing plot.py infrastructure to visualize arXiv commons trends

Tests

Sample Data Testing: Thoroughly tested with representative samples of arXiv metadata to ensure accurate parsing and processing
Output Format Validation: Confirmed that processed data follows established project schemas and integrates seamlessly with existing reporting infrastructure
Report Generation Verification: Validated that all generated reports are properly formatted and contain meaningful insights about the academic commons
Static Analysis Compliance: Ensured all code passes project static analysis requirements using ./dev/check.sh

Data Impact

New Data Source Addition: Introduces arXiv as a significant new data source for academic commons analysis, expanding our coverage of scholarly open access content
Schema Extension: Adds new data fields specific to academic publications while maintaining compatibility with existing data structures
Backward Compatibility: All changes maintain full backward compatibility with existing data processing and reporting workflows
Enhanced Analytics: Provides new dimensions for commons analysis including academic subject areas and scholarly publication patterns

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

scripts/2-process/arxiv_process.py

TimidRobot · 2025-11-03T08:07:50Z

The scripts are not executable.

Please don't submit work that you have not tested according to the documentation (Running the scripts).

Goziee-git · 2025-11-05T08:49:20Z

The scripts are not executable.

Please don't submit work that you have not tested according to the documentation (Running the scripts).

@TimidRobot, both file have now been made executable, apologies for the omission. will take notes on that

TimidRobot · 2025-11-05T14:37:52Z

scripts/3-report/arxiv_report.py

Please remember to perform Static analysis before committing changes.

TimidRobot · 2025-11-05T14:52:02Z

scripts/2-process/arxiv_process.py

+    # Filter out rows with empty/null author buckets
+    author_data = author_data.dropna(subset=["AUTHOR_BUCKET"])
+    author_data = author_data[author_data["AUTHOR_BUCKET"].str.strip() != ""]


Filtering out rows with empty/null author buckets should be done in processing, not reporting.

TimidRobot · 2025-11-05T14:53:53Z

scripts/2-process/arxiv_process.py

+        # Define bucket order for proper sorting
+        bucket_order = ["1", "2", "3", "4", "5+", "Unknown"]


Why is sorting being done manually?

Why is there an "Unknown" entry?

TimidRobot · 2025-11-05T14:56:14Z

scripts/2-process/arxiv_process.py

+        data.reset_index(drop=True, inplace=True)
+        data.rename(
+            columns={
+                "AUTHOR_BUCKET": "Author_Bucket",


Labels should use sentence case should not replace spaces with underscores

TimidRobot · 2025-11-05T14:56:54Z

scripts/2-process/arxiv_process.py

+            "CATEGORY_CODE": "Category_Code",
+            "CATEGORY_LABEL": "Category_Name",


Labels should use sentence case should not replace spaces with underscores

TimidRobot · 2025-11-05T14:58:21Z

scripts/2-process/arxiv_process.py

+def process_license_totals(args, count_data):
+    """
+    Processing count data: totals by license


Please use "legal tool" instead of "license".

CC0 is a public domain dedication with a fallback license.

UNKNOWN CC legal tool may include PDM, which is not a license and does not have a fallback license.

TimidRobot · 2025-11-05T16:14:54Z

@Goziee-git I recommend you pause development of this pull request, until further notice:

arXiv fetch is unreliable #236

Goziee-git requested review from a team as code owners November 3, 2025 03:53

Goziee-git requested review from TimidRobot and possumbilities and removed request for a team November 3, 2025 03:53

Goziee-git changed the title ~~Feature/arxiv~~ Integrate arXiv Processing and Reporting Scripts Nov 3, 2025

cc-open-source-bot added this to TimidRobot Nov 3, 2025

cc-open-source-bot moved this to In review in TimidRobot Nov 3, 2025

TimidRobot reviewed Nov 3, 2025

View reviewed changes

scripts/2-process/arxiv_process.py Outdated Show resolved Hide resolved

TimidRobot reviewed Nov 3, 2025

View reviewed changes

scripts/2-process/arxiv_process.py Outdated Show resolved Hide resolved

TimidRobot self-assigned this Nov 3, 2025

Goziee-git force-pushed the feature/arxiv branch from 3261981 to 1eb0b12 Compare November 3, 2025 09:06

TimidRobot reviewed Nov 5, 2025

View reviewed changes

scripts/3-report/arxiv_report.py Outdated

Copy link
Copy Markdown

Member

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to perform Static analysis before committing changes.

TimidRobot reviewed Nov 5, 2025

View reviewed changes

Goziee-git closed this Nov 6, 2025

Goziee-git force-pushed the feature/arxiv branch from 77b20a3 to 6482d63 Compare November 6, 2025 13:37

github-project-automation bot moved this from In review to Done in TimidRobot Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate arXiv Processing and Reporting Scripts#228

Integrate arXiv Processing and Reporting Scripts#228
Goziee-git wants to merge 0 commit intocreativecommons:mainfrom
Goziee-git:feature/arxiv

Goziee-git commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

TimidRobot commented Nov 3, 2025

Uh oh!

Goziee-git commented Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Uh oh!

TimidRobot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Define bucket order for proper sorting
		bucket_order = ["1", "2", "3", "4", "5+", "Unknown"]

		"CATEGORY_CODE": "Category_Code",
		"CATEGORY_LABEL": "Category_Name",

Uh oh!

Conversation

Goziee-git commented Nov 3, 2025

Fixes

Description

Technical details

Scripts Modified

Processing Changes (arxiv_process.py)

Reporting Changes (arxiv_report.py)

Tests

Data Impact

Checklist

Developer Certificate of Origin

Uh oh!

Uh oh!

Uh oh!

TimidRobot commented Nov 3, 2025

Uh oh!

Goziee-git commented Nov 5, 2025

Uh oh!

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants