Scorepyo

Scorepyo is a python package for binarizing features, and creating risk-score type models for binary classification, based on data. The created models can be used like other ML models, with fit and predict methods.

Example on Scikit-learn breast cancer dataset

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import average_precision_score
from scorepyo.models import EBMRiskScore

# Getting data
data = load_breast_cancer()
data_X, data_y = data.data, data.target

X = pd.DataFrame(data=data_X, columns=data.feature_names)
y = pd.Series(data_y)

scorepyo_model = EBMRiskScore()

scorepyo_model.fit(X, y)

scorepyo_model.summary()

Feature-point card

Feature	Description	Point(s)
worst concave points	worst concave points >= 0.14	-2	...
worst radius	worst radius >= 16.8	-2	+ ...
mean texture	mean texture >= 20.72	-1	+ ...
worst area	worst area < 553.3	1	+ ...

		SCORE =	...

Score card

SCORE	-5.0	-4.0	-3.0	-2.0	-1.0	0.0	1.0
RISK	0.00%	1.03%	1.13%	45.95%	84.09%	98.10%	100.00%

Installation

Python 3.8, 3.9

pip install scorepyo

Documentation

Want to know more?

Check the documentation!

Risk-Score model

Risk-score model are mostly used in medicine, justice, psychology or even credit application. They are widely appreciated, as the computation of the risk is fully explained with two simple tables:

a point-card, with points to sum depending on features value;
a score-card that associate a score to a risk.
Final score is computed by summing the points defined by the point-card.

The points should be small integers, in order to be easily manipulated and remembered by people using them.

You can find hereafter another example of such risk-score model for assessing a stroke risk:

Source

The extreme interpretability of such model is especially useful since it helps to understand and trust the model decision, but also to easily investigate fairness issues and make sure to satisfy legal requirements, and eventually remember it.

The simple computation also allows to write it down on a piece of paper for usage.

Components of Scorepyo

The Scorepyo package provides two components that can be used independently:

Automatic feature binarizer
Risk-score model

Automatic feature binarizer

Datasets usually comes with features of various type. Continuous feature must be binarized in order to be used for risk-score model. Scorepyo leverages the awesome interpretML package and their EBM model to automatically extract binary features.

Risk score model

The risk-score model can be modeled as an optimization problem with 3 sets of decision variables:

Subset of binary features to use
Points associated to each selected binary feature
Probabilities associated to each possible score

The objective function is the optimization of a binary classification metric (e.g. the logloss, ROC AUC, average precision) of the computed risk on training samples.

This formulation is already used in other packages such as risk-slim or FasterRisk.

The novelty in Scorepyo is that it decomposes the model search into simple and easily customizable components:

Ranking of binary features
Enumeration maximization metric
Probability calibration

It also drops the link with the sigmoid function when defining the probability of each score, in order to widen the search space of risk-score model.

Acknowledgements

Standing on the shoulders of giants

Bernard de Chartres

This package is built on top of great packages:

interpretML for the binarizer
Dask to easily scale the costly enumeration step

More context

To better understand the justification of automatically creating risk score model from data, or to not only round coefficients from logistic regression, I refer to the great introduction of this Neurips 2022 paper associated with FasterRisk:

https://arxiv.org/pdf/2210.05846.pdf.

risk-slim has an elegant approach mixing Machine Learning and Integer Linear Programming (ILP), that provides the ability to integrate preferences and constraints on the subset of features, and their associated point. It is unfortunately based on CPLEX, a commercial ILP solver that limits its use, and also have trouble converging in large dimensions.

FasterRisk is a recent package that makes the computation much faster by dropping the ILP approach and providing an other approach to explore this large space of solutions and generate a list of interesting risk-score models that will be diverse. This approach does not integrate constraints as risk-slim does, but does a great job at quickly computing risk-score models. It only provides a binarizer based on quantile.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
.vscode		.vscode
core_module		core_module
docs		docs
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
codepal.py		codepal.py

License

sami-ka/scorepyo

Folders and files

Latest commit

History

Repository files navigation

Scorepyo

Example on Scikit-learn breast cancer dataset

Feature-point card

Score card

Installation

Documentation

Risk-Score model

Components of Scorepyo

Automatic feature binarizer

Risk score model

Acknowledgements

Bernard de Chartres

More context

About

Resources

License

Stars

Watchers

Forks

Languages