## Using Pipelines
As the evaluation function takes scikit-learn compatible estimators, it is possible to use scikits <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">pipelines</a> to create models in an easy to use and concise way. A pipeline chains feature transformers with an estimator at the end. In the following, we will evalaute a support vector machine with linear kernel chaining a custom column-selector, a `CountVectorizer` and a `MaxAbsScaler` transformer as preprocessing steps in form of such a pipeline model.

In [1]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, feature_extraction, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
import numpy as np
import spacy

import evaluation

from sklearn import base


class WordVectorTransformer(base.TransformerMixin, base.BaseEstimator):
    def __init__(self, model="en_core_web_lg"):
        self.model = model

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        nlp = spacy.load(self.model)
        return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])    
    
# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #WordVectorTransformer of Spacy
    ('WordVectorTransformer', WordVectorTransformer()),
    # Scale data to maximum absolute value of 1 and keep sparsity properties
    ('scaler', preprocessing.MaxAbsScaler()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42)) #linear_model.RidgeClassifier)
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=True, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.77
INFO:root:F1-Score: 0.78 (training); 0.77 (test)
INFO:root:Accuracy: 81.38% (training); 80.01% (test)
INFO:root:Recall: 77.31% (training); 75.85% (test)
INFO:root:Precision: 78.92% (training); 77.22% (test)
INFO:root:---
INFO:root:Retraining model on the complete data set...
INFO:root:-> F1-Score on complete training set: 0.78
INFO:root:-> Stored model to ../models/model_2021-01-12_181935_Pipeline_1x10cv_0.77.pck
I

The actual submission result is `...`.

mit Model "en_cor_web_sm" ca. 64%

Model "en_core_web_lg" -  WordVectorTransformer() +  'scaler', preprocessing.MaxAbsScaler() & 'clf', svm.LinearSVC
INFO:root:Expected submission results (F1-Score): around 0.77
INFO:root:F1-Score: 0.78 (training); 0.77 (test)
INFO:root:Accuracy: 81.38% (training); 80.01% (test)
INFO:root:Recall: 77.31% (training); 75.85% (test)
INFO:root:Precision: 78.92% (training); 77.22% (test)

INFO:root:-> F1-Score on complete training set: 0.78
