CSO Classifier v4.0.0
We are excited to announce the release of CSO Classifier v4.0.0, a major update that brings significant enhancements to semantic understanding, modern data standards compliance, and compatibility with recent Python versions. At the core of this release is a completely retrained and substantially larger Word2Vec model. Trained on a massive collection of scholarly metadata from the Semantic Scholar dataset encompassing nearly 100 million papers up to 2025 and over 264 million lines of text, the new model provides a much deeper semantic grasp of recent research topics and emerging technologies.
To better support the machine learning community and data reproducibility, the classifier now allows users to easily extract Croissant metadata specifications. By using the newly introduced methods, you can generate a JSON-LD file that describes the dataset produced by your classification pipeline, perfectly adhering to the standard Croissant format for ML-ready datasets. Additionally, the codebase has been thoroughly updated and refactored to officially support Python 3.11 and 3.12, alongside general under-the-hood improvements to ensure better performance, stability, and long-term maintainability.
We have also significantly expanded our documentation, providing comprehensive guides on running the classifier in various modes and outlining how anyone can adapt and extend the CSO Classifier for use in other domains of science by swapping the underlying ontology and word embeddings. Finally, we would like to express our gratitude to Faisal Ramzan, PhD Student at the University of Cagliari, for his valuable support in the development of this new version and the deployment of the updated Word2Vec model. You can upgrade to the latest version via pip, and be sure to run the setup or force update methods afterward to download the latest model and ontology assets.