# Scikit-learn Model Selection Example

Textwiser is designed with rapid prototyping in mind, meaning it fully cooperates with the Scikit-learn model selection classes such as [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html). For finding the best hyperparameters for a task, random search and grid search are two very popular techniques. This often involves going through different parameters of one specific model, such as the minimum document frequency in Tf-Idf. However, we can go one step further and treat all text featurization techniques as hyperparameters.

In [1]:
import os
os.chdir('..')

Again, we use the news group dataset from Scikit-learn. This dataset contains 20 news groups with the aim of classifying a text document into one of these news groups. Here, we only use a subset of all the news group for demonstration purposes.

In [2]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
print("Train data size: {}".format(len(newsgroups_train.data)))
print("Test data size: {}".format(len(newsgroups_test.data)))

Train data size: 2034
Test data size: 1353


The cell below shows how to set up regular hyperparameter search on the text featurization and on the classifier. The specified embedding and transformations are held internally inside the `embedding` and `transformations` variables respectively. Here we set the choices of the `min_df` parameter of the TfIdf embedding. For setting a parameter of a transformation, you have to index into the `transformations` variable. The object hierarchy is specified by the double underscore (`__`) separator, same as the Scikit-learn behavior.

In [3]:
import numpy as np
from scipy.stats import uniform
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from textwiser import TextWiser, Embedding, PoolOptions, Transformation, WordOptions

clf = Pipeline([('featurizer', TextWiser(Embedding.TfIdf(min_df=5), Transformation.NMF(n_components=30), lazy_load=True)),
                ('classifier', LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=400))])

param_dist = {'featurizer__embedding__min_df': [5, 10],
              'featurizer__transformations__0__n_components': [20, 25, 30],
              'classifier__C': uniform()}

# run randomized search
n_iter_search = 2
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, scoring='f1_macro',
                                   n_iter=n_iter_search, cv=5, iid=False)

random_search.fit(newsgroups_train.data, newsgroups_train.target)
print("Best score: {}".format(random_search.best_score_))
print("Best model:")
random_search.best_estimator_

Best score: 0.7174741300461003
Best model:


Pipeline(memory=None,
         steps=[('featurizer',
                 TextWiser(
  (_imp): _Sequential(
    (0): _TfIdfEmbeddings()
    (1): _NMFTransformation()
  )
)),
                ('classifier',
                 LogisticRegression(C=0.8888647751187911, class_weight=None,
                                    dual=False, fit_intercept=True,
                                    intercept_scaling=1, l1_ratio=None,
                                    max_iter=400, multi_class='auto',
                                    n_jobs=None, penalty='l2',
                                    random_state=None, solver='lbfgs',
                                    tol=0.0001, verbose=0, warm_start=False))],
         verbose=False)

We can also treat the different feature extractors and classification models hyperparameters, and do a random search over them. Note that we also set the `lazy_load` parameter to `True`, meaning the model objects are only instantiated when either `fit` or `forward` is called. This saves a lot of memory at model selection stage, since initialized word embeddings and other models can use a lot of memory.

In [6]:
from sklearn.tree import DecisionTreeClassifier

clf = Pipeline([('featurizer', None),
                ('classifier', None)])

param_dist = {'featurizer': [TextWiser(Embedding.TfIdf(min_df=5), Transformation.NMF(n_components=30), lazy_load=True), TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained='en'), Transformation.Pool(pool_option=PoolOptions.max), lazy_load=True)],
              'classifier': [LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=400), DecisionTreeClassifier()]}

# run randomized search
n_iter_search = 2
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, scoring='f1_macro',
                                   n_iter=n_iter_search, cv=5, iid=False)

random_search.fit(newsgroups_train.data, newsgroups_train.target)
print("Best score: {}".format(random_search.best_score_))
print("Best model:")
random_search.best_estimator_

Best score: 0.8392193843432569
Best model:


Pipeline(memory=None,
         steps=[('featurizer',
                 TextWiser(
  (model): _Sequential(
    (0): _WordEmbeddings(
      (model): Embedding(1000001, 300, sparse=True)
    )
    (1): _PoolTransformation()
  )
)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=400,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In a more advanced example, we can define a schema as per the README, and modify the parameters inside the schema. This is important because we can theoretically swap entire embeddings and transformations using this method.

We first create the schema to base our search on.

In [7]:
schema = {
    "transform": [
        {
            "concat": [
                {
                    "transform": [
                        ["word2vec", {"pretrained": "en"}],
                        "pool"
                    ]
                },
                {
                    "transform": [
                        "tfidf",
                        ["nmf", { "n_components": 30 }]
                    ]
                }
            ]
        },
        "svd"
    ]
}

We can then pick any single component of the schema and give the usual hyperparameter options. You can go down the schema with the usual double underscore (`__`) separator.

In [8]:
clf = Pipeline([('featurizer', TextWiser(Embedding.Compound(schema=schema), dtype=np.float32, lazy_load=True)),
                ('classifier', LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=400))])

param_dist = {'featurizer__embedding__schema__transform__0__concat__1__transform__1__n_components': [5, 10],
              'classifier__C': uniform()}

# run randomized search
n_iter_search = 2
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, scoring='f1_macro',
                                   n_iter=n_iter_search, cv=2, iid=False)

random_search.fit(newsgroups_train.data, newsgroups_train.target)
print("Best score: {}".format(random_search.best_score_))
print("Best model:")
random_search.best_estimator_

Best score: 0.8301270978191986
Best model:


Pipeline(memory=None,
         steps=[('featurizer',
                 TextWiser(
  (model): _Sequential(
    (0): _CompoundEmbeddings(
      (model): _Sequential(
        (0): _Concat(
          (embeddings): ModuleList(
            (0): _Sequential(
              (0): _WordEmbeddings(
                (model): Embedding(1000001, 300, sparse=True)
              )
              (1): _PoolTransformation()
            )
            (1): _Sequential(
              (0): _TfIdfEmbeddings()
              (1): _NMFTransformation()
            )
          )
        )
        (1): _SVDTransformation()
      )
    )
  )
)),
                ('classifier',
                 LogisticRegression(C=0.7286697905299401, class_weight=None,
                                    dual=False, fit_intercept=True,
                                    intercept_scaling=1, l1_ratio=None,
                                    max_iter=400, multi_class='auto',
                                    n_jobs=None, penalty='l2',