# Using Pipelines
As the evaluation function takes scikit-learn compatible estimators, it is possible to use scikits <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">pipelines</a> to create models in an easy to use and concise way. A pipeline chains feature transformers with an estimator at the end. In the following, we evaluate the results with an TfidfVectorizer. For the Classification it uses the Naive Bayes and
linear SVC.

TfidfVectorizer with MultinomialNB
Using TfidfVectorizer, a combination of CountVectorizer and TfidfTransformer.
For the Classification we use the MultinomialNB, a Naive Bayes Classifier.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #TF-IDF Vectorizer
    ('tfidf', TfidfVectorizer(analyzer ='word', stop_words = 'english')),
    #NaiveBayes-Classifier
    ('clf', MultinomialNB()),
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Expected submission results (F1-Score): around 0.73
INFO:root:F1-Score: 0.85 (training); 0.73 (test)
INFO:root:Accuracy: 88.78% (training); 79.77% (test)
INFO:root:Recall: 76.96% (training); 62.18% (test)
INFO:root:Precision: 96.17% (training); 87.03% (test)

##tfidfVectorizer with Naive Bayes Classification and GridSearchCV for optimization

In [1]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

import evaluation
  
# Setup parameters for gridsearch
hyper_param = {'alpha': (1e-2, 1e-3),
}    

    
# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #('vect',  feature_extraction.text.CountVectorizer()),
    ('tfidf', TfidfVectorizer()),
    #NaiveBayes-Classifier
    ('clf', GridSearchCV(MultinomialNB(), hyper_param, scoring='f1')),
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.73
INFO:root:F1-Score: 0.97 (training); 0.73 (test)
INFO:root:Accuracy: 97.62% (training); 78.43% (test)
INFO:root:Recall: 95.69% (training); 68.82% (test)
INFO:root:Precision: 98.72% (training); 78.35% (test)
INFO:root:Evaluation finished.


INFO:root:Expected submission results (F1-Score): around 0.73
INFO:root:F1-Score: 0.97 (training); 0.73 (test)
INFO:root:Accuracy: 97.62% (training); 78.43% (test)
INFO:root:Recall: 95.69% (training); 68.82% (test)
INFO:root:Precision: 98.72% (training); 78.35% (test)

In [None]:
##TfidfVectorizer with linear SVM Classifier
Using again the TfidfVectorizer, but now a linear SVM Classifier for the Classification.
Vatriable C for the SVM Classifier is set to 1e-1.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer

import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #TF-IDF Vectorizer
    ('tfidf', TfidfVectorizer()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.71
INFO:root:F1-Score: 0.75 (training); 0.71 (test)
INFO:root:Accuracy: 78.44% (training); 74.45% (test)
INFO:root:Recall: 75.38% (training); 73.43% (test)
INFO:root:Precision: 74.68% (training); 69.06% (test)
INFO:root:Evaluation finished.


C=1e-2:
INFO:root:Expected submission results (F1-Score): around 0.71
INFO:root:F1-Score: 0.75 (training); 0.71 (test)
INFO:root:Accuracy: 78.44% (training); 74.45% (test)
INFO:root:Recall: 75.38% (training); 73.43% (test)
INFO:root:Precision: 74.68% (training); 69.06% (test)
INFO:root:Evaluation finished.


In [None]:
##TfidfVectorizer with linear SVM Classifier
Using again the TfidfVectorizer, but now a linear SVM Classifier for the Classification.
Vatriable C for the SVM Classifier is set to 0.5.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer

import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #TF-IDF Vectorizer
    ('tfidf', TfidfVectorizer()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=0.5, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.76
INFO:root:F1-Score: 0.96 (training); 0.76 (test)
INFO:root:Accuracy: 96.91% (training); 79.06% (test)
INFO:root:Recall: 95.69% (training); 75.11% (test)
INFO:root:Precision: 97.09% (training); 75.90% (test)
INFO:root:Evaluation finished.


In [None]:
C=0.5:
INFO:root:Expected submission results (F1-Score): around 0.76
INFO:root:F1-Score: 0.96 (training); 0.76 (test)
INFO:root:Accuracy: 96.91% (training); 79.06% (test)
INFO:root:Recall: 95.69% (training); 75.11% (test)
INFO:root:Precision: 97.09% (training); 75.90% (test)

##TfidfVectorizer with linear SVM Classifier and GridSearchCV.
Using again the TfidfVectorizer, but now a linear SVM Classifier for the Classification.
to optimize the Classifier GridSearchCv is used additionaly.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

import evaluation

hyper_param = [{
    'kernel': ['rbf'],
    'C': [1, 10, 100],
    'gamma': ['scale']
}]

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #TF-IDF Vectorizer
    ('tfidf', TfidfVectorizer()),
    # Classify data with a linear SVM
     ('clf', GridSearchCV(svm.SVC(), hyper_param, scoring='f1'))
])
#('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42))])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Expected submission results (F1-Score): around 0.76
INFO:root:F1-Score: 1.00 (training); 0.76 (test)
INFO:root:Accuracy: 99.67% (training); 80.47% (test)
INFO:root:Recall: 99.49% (training); 71.94% (test)
INFO:root:Precision: 99.73% (training); 80.53% (test)