An sklearn-compatible classifier for benchmarking NLP classification problems. The model used is the NBSVM described in section 2.3 of the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. The authors provide their own (matlab) implementation.
Simply clone the repo,
cd into the project root directory and install into a python environment with
pip install .
python3 -m venv venv source venv/bin/activate git clone email@example.com:fastforwardlabs/nbsvm.git cd nbsvm pip install -r requirements.txt pip install .
The NBSVM classifier is intended to be used on features transformed by either
Example usage looks like this:
from nbsvm import NBSVM from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer news = fetch_20newsgroups() vectorizer = CountVectorizer(binary=True) X = vectorizer.fit_transform(news.data) y = news.target model = NBSVM() model.fit(X, y) model.predict(X)
There are a handful of unit tests for the public interface of the NBSVM class.
To run these locally, install the dependencies in
requirements.txt into a clean environment and simply call
pytest in the root directory of the project.
The first time the tests run, they will fetch a subset of the 20newsgroups dataset, which may take a few moments.
Tests should run in seconds after the initial download.
By default, the data will download to
~/scikit_learn_data (in your home directory), which can be changed by modifying the source.