Integrate arXiv Processing and Reporting Scripts#228
Integrate arXiv Processing and Reporting Scripts#228Goziee-git wants to merge 0 commit intocreativecommons:mainfrom
Conversation
|
The scripts are not executable. Please don't submit work that you have not tested according to the documentation (Running the scripts). |
3261981 to
1eb0b12
Compare
@TimidRobot, both file have now been made executable, apologies for the omission. will take notes on that |
scripts/3-report/arxiv_report.py
Outdated
There was a problem hiding this comment.
Please remember to perform Static analysis before committing changes.
scripts/2-process/arxiv_process.py
Outdated
| # Filter out rows with empty/null author buckets | ||
| author_data = author_data.dropna(subset=["AUTHOR_BUCKET"]) | ||
| author_data = author_data[author_data["AUTHOR_BUCKET"].str.strip() != ""] |
There was a problem hiding this comment.
Filtering out rows with empty/null author buckets should be done in processing, not reporting.
scripts/2-process/arxiv_process.py
Outdated
| # Define bucket order for proper sorting | ||
| bucket_order = ["1", "2", "3", "4", "5+", "Unknown"] |
There was a problem hiding this comment.
Why is sorting being done manually?
Why is there an "Unknown" entry?
scripts/2-process/arxiv_process.py
Outdated
| data.reset_index(drop=True, inplace=True) | ||
| data.rename( | ||
| columns={ | ||
| "AUTHOR_BUCKET": "Author_Bucket", |
There was a problem hiding this comment.
Labels should use sentence case should not replace spaces with underscores
scripts/2-process/arxiv_process.py
Outdated
| "CATEGORY_CODE": "Category_Code", | ||
| "CATEGORY_LABEL": "Category_Name", |
There was a problem hiding this comment.
Labels should use sentence case should not replace spaces with underscores
scripts/2-process/arxiv_process.py
Outdated
| def process_license_totals(args, count_data): | ||
| """ | ||
| Processing count data: totals by license |
There was a problem hiding this comment.
Please use "legal tool" instead of "license".
CC0 is a public domain dedication with a fallback license.
UNKNOWN CC legal tool may include PDM, which is not a license and does not have a fallback license.
|
@Goziee-git I recommend you pause development of this pull request, until further notice: |
77b20a3 to
6482d63
Compare
Fixes
Description
This pull request implements comprehensive arXiv data source integration for the Quantifying the Commons project. The implementation includes both data processing and report generation capabilities to analyze open access academic publications from the arXiv repository, contributing to our understanding of the academic commons landscape. The arXiv integration represents a significant expansion of our commons analysis capabilities, bringing academic publication data into our comprehensive view of the global commons. The implementation prioritizes data quality, processing efficiency, and meaningful insights generation.
The changes introduce two new scripts that follow the established project architecture:
arxiv_process.pyfor data extraction and processing, andarxiv_report.pyfor generating meaningful insights and visualizations from the processed arXiv metadata.Technical details
Scripts Modified
scripts/2-process/arxiv_process.pyArXiv processing scriptscripts/3-report/arxiv_report.pyArXiv reporting scriptProcessing Changes (arxiv_process.py)
License Data Processing: Reads and aggregates arXiv license count data from arxiv_1_count.csv, groups by license identifier and generates totals by license type
Category Classification: Processes arXiv subject category data with comprehensive mapping of 100+ arXiv category codes (cs.AI, math.NT, physics.gen-ph, etc.) to human-readable names
across Computer Science, Mathematics, Physics, Statistics, Biology, and Finance domain
Temporal Analysis: Aggregates publication counts by year from submission data, sorting chronologically for time-series analysis
Author Collaboration Analysis: Processes author bucket data (1, 2, 3, 4, 5+ authors) with proper categorical sorting and handles missing/empty author data
Data Validation and Cleaning: Implements null value filtering, empty string removal, and proper DataFrame structure maintenance for all processed datasets
CSV Output Generation: Standardizes all processed data into CSV format with consistent column naming and proper quoting for downstream analysis
Reporting Changes (arxiv_report.py)
Temporal Analysis: Generated time-series visualizations showing the growth of open access publications on arXiv over time, highlighting trends in academic commons adoption
Subject Category Analysis: Created comprehensive breakdowns of open access content by arXiv subject categories (physics, mathematics, computer science, etc.) to understand domain-specific patterns.
License Classification: Developed reporting on different types of open licenses used in arXiv submissions, contributing to our understanding of commons licensing preferences
Data Visualizations: Created matplotlib-based charts and graphs using the existing
plot.pyinfrastructure to visualize arXiv commons trendsTests
Sample Data Testing: Thoroughly tested with representative samples of arXiv metadata to ensure accurate parsing and processing
Output Format Validation: Confirmed that processed data follows established project schemas and integrates seamlessly with existing reporting infrastructure
Report Generation Verification: Validated that all generated reports are properly formatted and contain meaningful insights about the academic commons
Static Analysis Compliance: Ensured all code passes project static analysis requirements using
./dev/check.shData Impact
New Data Source Addition: Introduces arXiv as a significant new data source for academic commons analysis, expanding our coverage of scholarly open access content
Schema Extension: Adds new data fields specific to academic publications while maintaining compatibility with existing data structures
Backward Compatibility: All changes maintain full backward compatibility with existing data processing and reporting workflows
Enhanced Analytics: Provides new dimensions for commons analysis including academic subject areas and scholarly publication patterns
Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin