# Using Pipelines
As the evaluation function takes scikit-learn compatible estimators, it is possible to use scikits <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">pipelines</a> to create models in an easy to use and concise way. A pipeline chains feature transformers with an estimator at the end. In the following, we evaluate the results with an TfidfVectorizer. For the Classification it uses the Naive Bayes and
linear SVC.

### Count Vectorizer with MultinomialNB
Using CountVectorizer. 
For the Classification we use the MultinomialNB, a Naive Bayes Classifier.


In [41]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #Tf-Vectorizer
    ('tf', CountVectorizer()),
    #Scaler
    ('scaler', preprocessing.MaxAbsScaler()),
    #NaiveBayes-Classifier
    ('clf', MultinomialNB()), 
])

# Evaluate model pipeline
_,_,_ = evaluation.evaluate(model, store_model=False, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 0.91 (training); 0.75 (test)
INFO:root:Accuracy: 92.67% (training); 80.13% (test)
INFO:root:Recall: 86.88% (training); 69.31% (test)
INFO:root:Precision: 95.66% (training); 81.66% (test)
INFO:root:---
INFO:root:Retraining model on the complete data set...
INFO:root:-> F1-Score on complete training set: 0.91
INFO:root:-> Stored submission file to ../models/submission_2021-01-24_135440_Pipeline_1x

without Scaler:
INFO:root:Expected submission results (F1-Score): around 0.76
INFO:root:F1-Score: 0.89 (training); 0.76 (test)
INFO:root:Accuracy: 90.71% (training); 80.61% (test)
INFO:root:Recall: 83.76% (training); 70.47% (test)
INFO:root:Precision: 93.96% (training); 81.88% (test)

with MaxAbsScaler:
INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 0.91 (training); 0.75 (test)
INFO:root:Accuracy: 92.67% (training); 80.13% (test)
INFO:root:Recall: 86.88% (training); 69.31% (test)
INFO:root:Precision: 95.66% (training); 81.66% (test)

Actual submisioin result: 0.79344

### CountVectorizer with Naive Bayes Classification and GridSearchCV for optimization

In [4]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

import evaluation
  
# Setup parameters for gridsearch
hyper_param = {'alpha': (1e-2, 2),
}    

    
# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #('vect',  feature_extraction.text.CountVectorizer()),
    ('tf', CountVectorizer()),
    #NaiveBayes-Classifier
    ('clf', GridSearchCV(MultinomialNB(), hyper_param, scoring='f1')),
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 0.86 (training); 0.75 (test)
INFO:root:Accuracy: 88.73% (training); 80.39% (test)
INFO:root:Recall: 79.97% (training); 68.79% (test)
INFO:root:Precision: 92.80% (training); 82.66% (test)
INFO:root:Evaluation finished.


INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 0.86 (training); 0.75 (test)
INFO:root:Accuracy: 88.73% (training); 80.39% (test)
INFO:root:Recall: 79.97% (training); 68.79% (test)
INFO:root:Precision: 92.80% (training); 82.66% (test)

### CountVectorizer with linear SVM Classifier
Using again the CountVectorizer, but now a linear SVM Classifier for the Classification.
Vatriable C for the SVM Classifier is set to 1e-1.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer

import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #TF-IDF Vectorizer
    ('tf', CountVectorizer()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.76
INFO:root:F1-Score: 0.86 (training); 0.76 (test)
INFO:root:Accuracy: 88.47% (training); 79.78% (test)
INFO:root:Recall: 82.96% (training); 72.76% (test)
INFO:root:Precision: 89.44% (training); 78.60% (test)
INFO:root:Evaluation finished.


C=1e-2:
INFO:root:Expected submission results (F1-Score): around 0.76
INFO:root:F1-Score: 0.86 (training); 0.76 (test)
INFO:root:Accuracy: 88.47% (training); 79.78% (test)
INFO:root:Recall: 82.96% (training); 72.76% (test)
INFO:root:Precision: 89.44% (training); 78.60% (test)

### CountVectorizer with linear SVM Classifier
Using again the CountVectorizer, but now a linear SVM Classifier for the Classification.
Vatriable C for the SVM Classifier is set to 0.5.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer

import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #TF-IDF Vectorizer
    ('tf', CountVectorizer()),
    #Scaler
    ('scaler', preprocessing.MaxAbsScaler()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=0.5, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=False, store_submission=False)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.74
INFO:root:F1-Score: 1.00 (training); 0.74 (test)
INFO:root:Accuracy: 99.57% (training); 78.71% (test)
INFO:root:Recall: 99.53% (training); 71.72% (test)
INFO:root:Precision: 99.47% (training); 77.12% (test)
INFO:root:Evaluation finished.


In [None]:
C=0.5: (without Scaler)
INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 1.00 (training); 0.75 (test)
INFO:root:Accuracy: 99.62% (training); 78.92% (test)
INFO:root:Recall: 99.62% (training); 72.33% (test)
INFO:root:Precision: 99.49% (training); 77.17% (test)
            
with Scaler:
INFO:root:Expected submission results (F1-Score): around 0.74
INFO:root:F1-Score: 1.00 (training); 0.74 (test)
INFO:root:Accuracy: 99.57% (training); 78.71% (test)
INFO:root:Recall: 99.53% (training); 71.72% (test)
INFO:root:Precision: 99.47% (training); 77.12% (test)

### CountVectorizer with linear SVM Classifier
Using again the CountVectorizer, but now a linear SVM Classifier for the Classification.
Vatriable C for the SVM Classifier is set to 0.5.
Addionally using stopword-removal

In [19]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, svm, linear_model
from sklearn.feature_extraction.text import CountVectorizer

import evaluation

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #CountVectorizer
    ('tf', CountVectorizer(stop_words='english')),
    #Scaler
    ('scaler', preprocessing.MaxAbsScaler()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=0.5, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
a,b,c = evaluation.evaluate(model, store_model=True, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 0.97 (training); 0.75 (test)
INFO:root:Accuracy: 97.34% (training); 79.34% (test)
INFO:root:Recall: 95.52% (training); 71.42% (test)
INFO:root:Precision: 98.25% (training); 78.55% (test)
INFO:root:---
INFO:root:Retraining model on the complete data set...
INFO:root:-> F1-Score on complete training set: 0.97
INFO:root:-> Stored model to ../models/model_2021-01-24_104713_Pipeline_1x10cv_0.75.pck
I

### CountVectorizer with linear SVM Classifier
Using again the CountVectorizer, but now a linear SVM Classifier for the Classification.
Vatriable C for the SVM Classifier is set to 0.5.
Addionally using stopword-removal, min_df, max_df and max_feature

In [30]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, svm, linear_model
from sklearn.feature_extraction.text import CountVectorizer

import evaluation



# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #CountVectorizer
    ('tf', CountVectorizer(stop_words='english', min_df= 0.3, max_df = 0.9)),
    #Scaler
    ('scaler', preprocessing.MaxAbsScaler()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=0.5, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
a,b,c = evaluation.evaluate(model, store_model=True, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.60
INFO:root:F1-Score: 0.60 (training); 0.60 (test)
INFO:root:Accuracy: 63.69% (training); 63.69% (test)
INFO:root:Recall: 62.82% (training); 62.82% (test)
INFO:root:Precision: 57.04% (training); 57.04% (test)
INFO:root:---
INFO:root:Retraining model on the complete data set...
INFO:root:-> F1-Score on complete training set: 0.60
INFO:root:-> Stored model to ../models/model_2021-01-24_105018_Pipeline_1x10cv_0.6.pck
IN

In [None]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, svm, linear_model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV


import evaluation

hyper_par=[{'tf_min_df': [ 0.1, 0.2, 0.5, 2],
    'tf_max_df': [0.4, 0.5, 0.7, 0.9],
    'tf_max_features': [100, 300, 500, 700, 1000]   
}]

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #CountVectorizer
    ('tf', GridSearchCV(CountVectorizer(stop_words='english'), hyper_par)),
    #Scaler
    ('scaler', preprocessing.MaxAbsScaler()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=1e-1, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
a,b,c = evaluation.evaluate(model, store_model=True, store_submission=True)

### CountVectorizer with linear SVM Classifier and GridSearchCV.
Using again the TfidfVectorizer, but now a linear SVM Classifier for the Classification.
to optimize the Classifier GridSearchCv is used additionaly.
Also removing StopWords.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

import evaluation

hyper_param = [{
    'kernel': ['rbf'],
    'C': [1, 10, 100],
    'gamma': ['scale']
}]

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #countVectorizer
    ('tf', CountVectorizer(stop_words='english')),
    # Classify data with a linear SVM
     ('clf', GridSearchCV(svm.SVC(), hyper_param, scoring='f1'))
])
#('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42))])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=True, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...


KeyboardInterrupt: 

INFO:root:Expected submission results (F1-Score): around 0.75
INFO:root:F1-Score: 1.00 (training); 0.75 (test)
INFO:root:Accuracy: 99.64% (training); 79.98% (test)
INFO:root:Recall: 99.42% (training); 69.15% (test)
INFO:root:Precision: 99.73% (training); 81.45% (test)

### CountVectorizer with linear SVM, removing Stopwords and Lemmatize 
Using again the CountVectorizer and a linear SVM Classifier for the Classification.
also removing stopwords and Lemmatize

In [39]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, feature_extraction, linear_model
from sklearn import svm
import evaluation

import spacy

nlp = spacy.load("en_core_web_sm")


def prepros(X):
        tweet = X[2]
        lemmatWords =  " ".join([token.lemma_ for token in nlp(tweet) if token.is_stop == False])
        return(lemmatWords)
           
    
# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    #('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    # Vectorize the text
    ('vectorizer', feature_extraction.text.CountVectorizer()),
    # Scale data to maximum absolute value of 1 and keep sparsity properties
    ('scaler', preprocessing.MaxAbsScaler()),
    # Classify data with a linear SVM
    ('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42))
])

# Evaluate model pipeline
_, _, _ = evaluation.evaluate(model, preprocessing_func = prepros, store_model=True, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Applying pre-processing function...
INFO:root:-> Feature matrix after preprocessing: (7613,)
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...
INFO:root:Run 1/10 finished
INFO:root:Run 2/10 finished
INFO:root:Run 3/10 finished
INFO:root:Run 4/10 finished
INFO:root:Run 5/10 finished
INFO:root:Run 6/10 finished
INFO:root:Run 7/10 finished
INFO:root:Run 8/10 finished
INFO:root:Run 9/10 finished
INFO:root:Run 10/10 finished
INFO:root:---
INFO:root:Expected submission results (F1-Score): around 0.74
INFO:root:F1-Score: 0.85 (training); 0.74 (test)
INFO:root:Accuracy: 87.42% (training); 78.83% (test)
INFO:root:Recall: 80.71% (training); 69.86% (test)
INFO:root:Precision: 88.99% (training); 78.50% (test)
INFO:root:---
INFO:root:Retraining model on the complete data set...
INFO:root:-> F1-Score on complete train

INFO:root:Expected submission results (F1-Score): around 0.74
INFO:root:F1-Score: 0.85 (training); 0.74 (test)
INFO:root:Accuracy: 87.42% (training); 78.83% (test)
INFO:root:Recall: 80.71% (training); 69.86% (test)
INFO:root:Precision: 88.99% (training); 78.50% (test)

Actual result: 0.78761

In [11]:
from sklearn.pipeline import Pipeline
from sklearn import preprocessing, base, svm, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

import evaluation

import spacy

def ownPrePro(X):
    nlp = spacy.load('en_core_web_sm')
    a= []
    for tweet in X:
        doc = nlp(X)
        words = [word for word in doc if not word.is_punct]
        a.append(words)
    return a
    
hyper_param = [{
    'kernel': ['rbf'],
    'C': [1, 10, 100],
    'gamma': ['scale']
}]

# Setup model as transformer pipeline with logistic regression
model = Pipeline([
    # Extract the `text` feature
    ('col-selector', preprocessing.FunctionTransformer(func=lambda X: X[:, 2])),
    #countVectorizer
    ('tf', CountVectorizer(stop_words='english', tokenizer = ownPrePro)),
    # Classify data with a linear SVM
     ('clf', GridSearchCV(svm.SVC(), hyper_param, scoring='f1'))
])
#('clf', svm.LinearSVC(C=1e-2, class_weight='balanced', random_state=42))])

# Evaluate model pipeline
evaluation.evaluate(model, store_model=True, store_submission=True)

INFO:root:Loading training data from ../data/external/kaggle/train.csv...
INFO:root:-> Number of samples: 7613
INFO:root:-> Number of features: 3
INFO:root:Evaluating model with 1 experiment(s) of 10-fold Cross Validation...


TypeError: unhashable type: 'list'