Skip to content

Integrate arXiv Processing and Reporting Scripts#228

Closed
Goziee-git wants to merge 0 commit intocreativecommons:mainfrom
Goziee-git:feature/arxiv
Closed

Integrate arXiv Processing and Reporting Scripts#228
Goziee-git wants to merge 0 commit intocreativecommons:mainfrom
Goziee-git:feature/arxiv

Conversation

@Goziee-git
Copy link
Copy Markdown
Contributor

Fixes

Description

This pull request implements comprehensive arXiv data source integration for the Quantifying the Commons project. The implementation includes both data processing and report generation capabilities to analyze open access academic publications from the arXiv repository, contributing to our understanding of the academic commons landscape. The arXiv integration represents a significant expansion of our commons analysis capabilities, bringing academic publication data into our comprehensive view of the global commons. The implementation prioritizes data quality, processing efficiency, and meaningful insights generation.

The changes introduce two new scripts that follow the established project architecture: arxiv_process.py for data extraction and processing, and arxiv_report.py for generating meaningful insights and visualizations from the processed arXiv metadata.

Technical details

Scripts Modified

  • scripts/2-process/arxiv_process.py ArXiv processing script
  • scripts/3-report/arxiv_report.py ArXiv reporting script

Processing Changes (arxiv_process.py)

License Data Processing: Reads and aggregates arXiv license count data from arxiv_1_count.csv, groups by license identifier and generates totals by license type
Category Classification: Processes arXiv subject category data with comprehensive mapping of 100+ arXiv category codes (cs.AI, math.NT, physics.gen-ph, etc.) to human-readable names
across Computer Science, Mathematics, Physics, Statistics, Biology, and Finance domain
Temporal Analysis: Aggregates publication counts by year from submission data, sorting chronologically for time-series analysis
Author Collaboration Analysis: Processes author bucket data (1, 2, 3, 4, 5+ authors) with proper categorical sorting and handles missing/empty author data
Data Validation and Cleaning: Implements null value filtering, empty string removal, and proper DataFrame structure maintenance for all processed datasets
CSV Output Generation: Standardizes all processed data into CSV format with consistent column naming and proper quoting for downstream analysis

Reporting Changes (arxiv_report.py)

Temporal Analysis: Generated time-series visualizations showing the growth of open access publications on arXiv over time, highlighting trends in academic commons adoption
Subject Category Analysis: Created comprehensive breakdowns of open access content by arXiv subject categories (physics, mathematics, computer science, etc.) to understand domain-specific patterns.
License Classification: Developed reporting on different types of open licenses used in arXiv submissions, contributing to our understanding of commons licensing preferences
Data Visualizations: Created matplotlib-based charts and graphs using the existing plot.py infrastructure to visualize arXiv commons trends

Tests

Sample Data Testing: Thoroughly tested with representative samples of arXiv metadata to ensure accurate parsing and processing
Output Format Validation: Confirmed that processed data follows established project schemas and integrates seamlessly with existing reporting infrastructure
Report Generation Verification: Validated that all generated reports are properly formatted and contain meaningful insights about the academic commons
Static Analysis Compliance: Ensured all code passes project static analysis requirements using ./dev/check.sh

Data Impact

New Data Source Addition: Introduces arXiv as a significant new data source for academic commons analysis, expanding our coverage of scholarly open access content
Schema Extension: Adds new data fields specific to academic publications while maintaining compatibility with existing data structures
Backward Compatibility: All changes maintain full backward compatibility with existing data processing and reporting workflows
Enhanced Analytics: Provides new dimensions for commons analysis including academic subject areas and scholarly publication patterns

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@Goziee-git Goziee-git requested review from a team as code owners November 3, 2025 03:53
@Goziee-git Goziee-git requested review from TimidRobot and possumbilities and removed request for a team November 3, 2025 03:53
@Goziee-git Goziee-git changed the title Feature/arxiv Integrate arXiv Processing and Reporting Scripts Nov 3, 2025
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Nov 3, 2025
@TimidRobot
Copy link
Copy Markdown
Member

The scripts are not executable.

Please don't submit work that you have not tested according to the documentation (Running the scripts).

@TimidRobot TimidRobot self-assigned this Nov 3, 2025
@Goziee-git
Copy link
Copy Markdown
Contributor Author

The scripts are not executable.

Please don't submit work that you have not tested according to the documentation (Running the scripts).

@TimidRobot, both file have now been made executable, apologies for the omission. will take notes on that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to perform Static analysis before committing changes.

Comment on lines +145 to +147
# Filter out rows with empty/null author buckets
author_data = author_data.dropna(subset=["AUTHOR_BUCKET"])
author_data = author_data[author_data["AUTHOR_BUCKET"].str.strip() != ""]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering out rows with empty/null author buckets should be done in processing, not reporting.

Comment on lines +157 to +158
# Define bucket order for proper sorting
bucket_order = ["1", "2", "3", "4", "5+", "Unknown"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is sorting being done manually?

Why is there an "Unknown" entry?

data.reset_index(drop=True, inplace=True)
data.rename(
columns={
"AUTHOR_BUCKET": "Author_Bucket",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels should use sentence case should not replace spaces with underscores

Comment on lines +104 to +105
"CATEGORY_CODE": "Category_Code",
"CATEGORY_LABEL": "Category_Name",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels should use sentence case should not replace spaces with underscores

Comment on lines +68 to +70
def process_license_totals(args, count_data):
"""
Processing count data: totals by license
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use "legal tool" instead of "license".

CC0 is a public domain dedication with a fallback license.

UNKNOWN CC legal tool may include PDM, which is not a license and does not have a fallback license.

@TimidRobot
Copy link
Copy Markdown
Member

@Goziee-git I recommend you pause development of this pull request, until further notice:

@Goziee-git Goziee-git closed this Nov 6, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in TimidRobot Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Integrate arXiv processing script with standardized architecture

3 participants