Skip to content

b-vitamins/ml-services

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ml-services

This project is in its infancy, and I haven't scoped it enough to write anything here that is both meaningful and precise. Thank you for your interest, and check back in some time to (hopefully) see some progress. Regardless, a brief description of the repository as it stands would probably not be amiss. Broadly, this project aims to use concepts from the field of Information Science to extract, organize, manage, extend, and synthesize knowledge, within the scope of research in Machine Learning.

Current Structure

Scripts

Scripts for wrangling data:

Data

Harvests generated by the wranglers:

  • data/mloss/bibliography/mloss.bib: generated using scripts/mloss.pl. A single mloss.bib holds BibTeX entries for entries for JMLR Machine Learning Open Source Software (MLOSS).
  • data/tmlr/bibliography/tmlr.bib: generated using scripts/tmlr.pl. A single tmlr.bib holds about 850 odd entries from the recently introduced Transactions on Machine Learning Research (TLMR).
  • data/jmlr/bibliography: generated using scripts/jmlr.pl. Contains files v6.bib, v7.bib, ..., v24.bib, v25.bib, each holding the BibTeX entries for all papers from that year (v1.bib through v5.bib are missing because the .bib files of the individual papers are not available. These will be compiled via other means soon).
  • data/neurips/bibliography: generated using scripts/neurips.pl. Contains files 1987.bib, 1988.bib, ..., 2022.bib, 2023.bib, each holding the BibTeX entries for all papers from that year,
  • data/pmlr/bibliography: generated using scripts/pmlr.pl. Contains files v1.bib, v2.bib, ..., v237.bib, v238.bib, each holding the BibTeX entries for all papers from that volume,
  • data/pmlr/citeproc: same data as data/pmlr/bibliography but in YAML format.

Resources

Working list of tools on which future development, more probably than not, going to depend on:

  1. The Open Cognition Project: OpenCog is a unique and ambitious open-source software project taking a serious effort to build a thinking machine. It aims to create an open source framework for Artificial General Intelligence, intended to one day express general intelligence at the human level and beyond.
  2. OpenAlex: OpenAlex is a free and open catalog of the global research system. It's named after the ancient Library of Alexandria and made by the nonprofit OurResearch. OpenAlex is replacement for the retired Microsoft Academic Graph project.
  3. GROBID: GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.
  4. Apache Solr: Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.
  5. Crossref: Crossref is a not-for-profit membership organization that exists to make scholarly communications better. It makes research objects easy to find, cite, link, assess, and reuse.
  6. Semantic Scholar: A free, AI-powered research tool for scientific literature.

Resources related to the tools above:

  1. Atomspace: The OpenCog AtomSpace is a knowledge representation (KR) database and the associated query/reasoning engine to fetch and manipulate that data, and perform reasoning on it.
  2. Graphs, Metagraphs, RAM, CPU: A technical exposition on the Atomspace.
  3. OpenAlex API Documentation: documentation for the OpenAlex API.
  4. ICML Conference Analytics: Microsoft Academic Graph demonstration.
  5. Semantic Scholar API Overview: overview of the Semantic Scholar API, in particular the Academic Graph API and the Recommendations API. Both pages have OpenAPI specifications (OAS) which can be used as inputs to a tool like Swagger Codegen for automatic client SDK code generation.
  6. GROBID Documentation: Documentation for GROBID. Particularly relevant and of immediate interest are the sections GROBID Service API, How GROBID works, TEI encoding of results, and Coordinates of structures in PDF.

Working list of resources that serve as foundational reading relevant to this project (for each page, scroll down and explore the See also section and the Navigation templates):

  1. OpenCog AtomSpace: The README.md is an excellent starting point. So is Luke’s Atomspace Bootstrap Guide. The examples are a good next target.
  2. Information science: A field focused on the systematic study and management of data and information across systems and processes.
  3. Library and Information Science: An interdisciplinary field that applies the practices, perspectives, and tools of management, information technology, education, and other areas to libraries.
  4. Natural Language Processing: The technology used to aid computers to understand the human’s natural language.
  5. Text Mining: The process of deriving high-quality information from text through computational means.
  6. Knowledge Management: Strategies and processes to manage organizational knowledge effectively.
  7. Bibliometrics: Analyzes scientific and technological literature quantitatively.
  8. Information Extraction: Techniques for automatically extracting structured information from unstructured/semi-structured data sources.
  9. Information Retrieval: The science of searching for information in documents, searching for documents themselves, and also searching for metadata which describes documents.
  10. Automatic Summarization: The process of reducing a text document with a computer program to create a summary that retains the most important points of the original document.
  11. Distributional Semantics: Approaches to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data.
  12. Topic Model: Statistical models to discover abstract topics that occur in a collection of documents.
  13. Computer-assisted Reviewing: The use of automated processes to aid in the review of documents and data.
  14. Automated Reasoning: The use of computers to emulate human reasoning, typically within the realm of mathematics or logic.
  15. Language Resource: Operational datasets in various forms used to build, improve, or evaluate natural language processing applications. In particular, see Text Encoding Initiative: a comprehensive encoding standard for machine-readable texts. A Gentle Introduction to XML is also a useful read.

Working list of relevant books:

  1. Designing Data-Intensive Applications.
  2. Taming Text.
  3. Solr in Action.

About

Resources for machine learning related publications

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published