# Scikit-learn Pipeline Example

TextWiser is designed with rapid prototyping and scikit-learn interoperability in mind. As such, it supports integration with the [Scikit-learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.

In [1]:
import os
os.chdir('..')

As an example, we use the news group dataset from Scikit-learn. This dataset contains 20 news groups with the aim of classifying a text document into one of these news groups. Here, we only use a subset of all the news group for demonstration purposes.

In [2]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
print("Train data size: {}".format(len(newsgroups_train.data)))
print("Test data size: {}".format(len(newsgroups_test.data)))

Train data size: 2034
Test data size: 1353


## Basic Vectorization

We use Tf-Idf as the embedding type, and do dimensionality reduction using Nonnegative Matrix Factorization. This represents a rather common use-case that is trivial to solve without relying on anything else than Scikit-learn.

The only difference here is that we use the relevant TextWiser, Embedding, and Transformation objects to set up the text vectorizer.

In [3]:
import numpy as np
from textwiser import TextWiser, Embedding, Transformation

emb = TextWiser(Embedding.TfIdf(min_df=5), Transformation.NMF(n_components=30))
vecs = emb.fit_transform(newsgroups_train.data)
vecs.shape

(2034, 30)

Once the documents are vectorized, we can train a multi-class Logistic Regression model to perform classification.

In [4]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(multi_class='auto', solver='lbfgs')
clf.fit(vecs, newsgroups_train.target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

For evaluating the test set, we can check the macro-averaged F1 score.

In [5]:
from sklearn import metrics

vecs_test = emb.transform(newsgroups_test.data)
pred = clf.predict(vecs_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.6787583044194434

## Using the pipeline

Since this use-case can generally be defined as text featurization followed by classification, we can fit that idea into a `Pipeline` object. Here, we use FastText word embeddings by Facebook to first convert individual words into word vectors, and then do a pooling operation on them to get a single vector per document. This would normally require the user to download and set up the word embeddings from a different library, but here it is just another option to use in TextWiser.

We again use a Logistic Regression classifier, train the whole pipeline with the training data, and get the F1 score of the test data.

In [6]:
from sklearn.pipeline import Pipeline
from textwiser import TextWiser, Embedding, PoolOptions, Transformation, WordOptions

clf = Pipeline([('featurizer', TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained='en'), Transformation.Pool(pool_option=PoolOptions.max))),
                ('classifier', LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=400))])
vecs = clf.fit(newsgroups_train.data, newsgroups_train.target)
pred = clf.predict(newsgroups_test.data)
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.7795074956936598