Skip to content

arosbio/cpsign

Repository files navigation

CPSign

Conformal Prediction
with the signatures molecular descriptor and SVM.
(C) Copyright 2022, Aros Bio AB, arosbio.com

Table of Contents

Introduction

CPSign is a machine learning and cheminformatics software package written purely in Java, leveraging the popular LIBSVM and LIBLINEAR packages for machine learning and the Chemistry Development Kit (CDK) for handling chemistry. CPSign allows directly reading in molecular data in CSV format (requiring SMILES for molecular structure) or SDF (v2000 and v3000), computing descriptors and building machine learning models. The generated model files contain all information for later predicting new compounds without having to manually compute descriptors and apply data transformations. CPSign implements the inductive Conformal Prediction algorithms ICP and ACP (or their more recent name; Split Conformal Predictors) for both classification and regression problems, as well as transductive conformal prediction (TCP) and Cross Venn-ABERS probabilistic prediction for binary classification problems. Being written in Java makes it platform-independent, and the LIBLINEAR/LIBSVM methods runs on CPU which makes it easy to run without requiring any special hardware.

Further reading

Cite

If you use CPSign in a scientific publication, we appreciate citations made to our pre-print paper:

Arvidsson McShane, S., Norinder, U., Alvarsson, J., Ahlberg, E., Carlsson, L., & Spjuth, O. (2023). CPSign-Conformal Prediction for Cheminformatics Modeling. bioRxiv, 2023-11.

BibTex entry:

@article{mcshane2023cpsign,
  title={CPSign-Conformal Prediction for Cheminformatics Modeling},
  author={Arvidsson McShane, Staffan and Norinder, Ulf and Alvarsson, Jonathan and Ahlberg, Ernst and Carlsson, Lars and Spjuth, Ola},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

License

CPSign is dual licensed, where the user can choose between the GNU General Public License with additional terms (which can be found at the Aros Bio website) or a commercial license. See further details at the Aros Bio website.

How to run CPSign

CPSign can be run in two main ways; 1. as a Command Line Interface (CLI) tool and 2. using the Java API. In the following sections there are more details of how to run it in each of these ways.

Running from CLI

To run CPSign from the CLI you need both the CPSign code and all the required dependencies. For user convenience a "fat jar" (i.e., a JAR including all dependencies) can be downloaded from the GitHub release page for each released version of CPSign. The software is currently named using a cpsign-{version}-fatjar.jar naming scheme. This JAR file and Java of version 11 or higher is all you need to run the application from the CLI. Once downloaded you can run CPSign using the standard way of running java in terminal environment;

java -jar <path-to-jar> <arguments>

Which supports adding any relevant JVM parameters. For Unix-based platforms the JAR supports running it as an executable file, but GitHub strips execute-privileges from the file when downloading it. If the user adds execute privilege (e.g., "chmod +x <file>") then CPSign can be run using;

./<path-to-jar> <arguments>

Working with CPSign from the CLI the general workflow is running the following programs: precompute, train, test/validate, with the optional step of performing hyper-parameter tuning using either tune-scorer or tune. More information is found on the CPSign readthedocs.

Running from Java API

To run CPSign from the Java API we recommend that you use maven or similar build system to handle package dependencies. All released versions from version 2.0.0-rc1 are found on maven central repository and can be included in your own code base by adding the dependency:

<dependency>
    <groupId>com.arosbio</groupId>
    <artifactId>cpsign</artifactId>
    <version>2.0.0-rc1</version>
</dependency>

or by replacing the artifactId to the required sub-module of your liking (confai, cpsign-api, etc). Programming examples can be found in the separate CPSign examples repository to get your started.

Sub-projects / maven child modules

In order to minimize memory footprint, the project has been split up into several sub-modules, making it possible to only use as much of the code base that is needed. I.e., the ConfAI includes all data processing and modeling, but without including the large CDK package which is only needed when handling chemistry. The CPSign-API is for users that deals with chemical data, but wish to use the Java API instead of the terminal CLI.

The available child-projects:

  • depict - An extension of the CDK depictions code which allows for generating 'blooming' molecules - i.e. visually appealing depictions with highlights that fade of and mix. Intended mainly for displaying which atoms contributed the most in a given prediction.
  • fast-ivap - An implementation of the algorithms detailed in Vovk et al. 2015 that pre-computes the isotonic regression used in the Venn-ABERS algorithm.
  • encrypt-api - Moved to a separate project, not a child module any more. A single interface for including encryption of saved prediction models or precomputed data sets. Contact Aros Bio in case you wish to purchase such an extension that secures models by only allowing predictions when also having the encryption key.
  • test-utils - Project that include test resources such as SVMLight files and some QSAR datasets used in the tests.
  • confai - Conformal and probabilistic predictors, data, processing etc, excluding CDK and chemistry specific code. Thus provides a software package that allows training CP and Venn-ABERS models for non-chemical data, without the overhead of including the complete CDK package.
  • cpsign-api - Java API of CPSign, including chemistry and the CDK library.
  • cpsign - Final CLI version of CPSign.

Note: The documentation of CPSign is located in a separate CPSign docs repo.

Java version

CPSign is written and developed on Java 11, but can with a few changes compile and run on Java 8. Note: the tests for the cpsign project relies on the System Rules library, which in turn uses the deprecated (from Java 17) SecurityManager interface - causing either an excessive amount of error logs or completely fails, depending on which Java version (17 or later) that you use. However, cpsign can still be built and run using the latest versions of Java.

Changelog

A changelog can now be found in changelog.

Extension packages

Plot_utils

In order to visualize the predictions and predictive performance from the conformal predictors we have created a python library for this task, located at GitHub Plot_utils, building on top of the Matplotlib and numpy libraries. This way visualizations can be customized easily and in a high abstraction level. We have also created functions for loading results from CPSign to make it easy for users to evaluate their predictive models. Example figures generated using Plot_utils for displaying the prediction intervals at a single confidence level;

Regression prediction intervals

and the distribution of prediction sets depending on the confidence level, which allows the user to pick an appropriate confidence level as well as assessing the predictive efficiency of a given model. Classification label distribution

More examples of figures and metrics that can be calculated are shown in the jupyter notebooks, which also detail how to work with results from CPSign.

Micro services

CPSign predictive models can be deployed as micro services using the Predict services extension, which exposes a REST interface for each deployed model. The servers can optionally be extended with a drawing interface:

Micro server Draw GUI

Where molecules can be inserted and altered in the JSME editor in the upper left corner and the atom-contributions are rendered in the bottom figure. Further information can be found in the dedicated repository (https://github.com/arosbio/cpsign_predict_services).

DeepLearning4J extension

CPSign can be extended with custom extensions for, e.g., chemical descriptors and machine learning models. We have created such an extension for including deep neural networks as underlying machine learning model: CPSign-DL4J. This extension requires building the extension for your intended platform/hardware and is more dependent on hyper-parameter tuning in order to achieve good predictive performance. The DeepLearning4J package requires running native code and is considerably larger than the other packages included in CPSign, and is thus not included in the standard version of CPSign.

Building

This section is for the advanced user, that themselves wish to change and update the code. For standard usage we refer to downloading the released versions from either GitHub release page or the maven central repository. The project is built and managed using Maven. The build has three profiles;

  1. default (id = thinjar) builds and package only the application code of each maven module.
  2. id = fatjar which builds an über/fat jar with all third-party dependencies bundled in a single JAR file. (only applicable for the cpsign module).
  3. id = deploy which assembles all sources, javadoc and thin jars, signs all artifacts and deploys everything to maven central repository. Note: this should only be run once there is a new release to be made. This also assumes configurations that are set in the .github/workflows and requires GitHub secrets to run successfully.

Building thin jars

This is the default build which used for only assembling the application codes and the pom's of each sub project. This works well in case you wish to use cpsign programmatically and using maven for resolving dependencies. This build is triggered by running;

mvn package 

or, alternatively (if you do not wish to run the tests):

mvn package -DskipTests=true

from either the parent (root) project, or from one of the sub projects.

Building a fat/über jar

This build is triggered by adding the profile fatjar by running;

mvn package -P fatjar

from either the parent or the cpsign project. Note that this only applies to the cpsign project - this profile has no effect in any of the other projects. This build also makes the jar an "really executable jar" - i.e. on linux systems you can run the application using ./cpsign-{version}-fatjar.jar typ calling.

Note: As there is a dependence between the projects you may need to use mvn install instead of the package goal, for the "earlier" modules to be available to latter ones in the hierarchy.

Generating Javadoc

Depending on your use case you can either build all docs as separate jar files, e.g. by running;

mvn javadoc:jar

This can be run either from the parent (root project) in order to assemble javadoc for all sub-projects, or from the directory of the project you wish to generate javadoc for. In case you wish to have all documentation in a single jar, e.g. if you put the fat jar on your classpath instead of using maven, you can achieve this by running e.g.;

mvn javadoc:aggregate
# or to assemble everything into a JAR
mvn javadoc:aggregate-jar 

you can also alter the doc title by adding the argument -Ddoctitle='CPSign bundle javadoc' (or change the name to your liking) to not get the default "Parent {version} API" title. Note that this command must be executed from the parent "aggregator project".

Developer info

Managing releases

This section is completely for the developers to aid in deployment. Once it is time for a new release the artifact versions must be updated to be non-SNAPSHOT. This can for instance be performed running the command

mvn versions:set -DnewVersion=2.0.0-rc1

from the root project. Once a new version is set, run the verify_deployment.sh script that runs the build including java-doc generation - to mimic the build in GitHub servers. Then a new git-tag that starts with v should be created and pushed to GitHub, e.g., by running git tag -a v2.0.0-rc1 -m 'release candidate 1 for 2.0.0 version' and pushed by running git push origin --tags. This step will trigger the build and deployment to maven central repo described in the following section.

Deploying to Maven central

This is an automated step that is triggered by pushing a new git-tag that starts with v to GitHub as described above. This will trigger the github action defined in create_release. This performs all necessary steps in assembly of artifacts and signing of these, as well as uploading it to the staging area for maven central through nexus, described e.g. in the maven central guide.

GitHub release

This is currently a manual task that is comprised by running the following steps (run from the root project);

# Building the fat jar
mvn clean package -DskipTests=true -P fatjar
# Assembly of all java doc
mvn javadoc:aggregate-jar -Ddoctitle='CPSign bundle javadoc'

The output of the second command will be something starting with "parent" which is not what we want, instead rename it to e.g. cpsign-{version}-fatjar-javadoc.jar.

Once these steps have been performed, manually create the release on github and link to the tag that you created above, and include the two generated 'fat jar' artifacts.

Future work

  • Implement isotonic regression in a Java project, to replace the pairAdjacentViolators dependency which also require bundling in Kotlin standard lib - leading to an increased memory footprint of the jars.