## How to run the experiments

Run the code blocs bellow in sequence. You can read the descriptions to understand it.


The dependencies can be found in https://github.com/eduardogc8/simple-qc

Before starting to run the experiments, change the variable ``path_wordembedding``, in the code block below, for the correct directory path. Make sure that the word embedding inside follow the template `wiki.multi.*.vec`.

In [1]:
import nltk
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import normalize

from benchmarking_methods import run_benchmark
from building_classifiers import lstm_default, svm_linear, random_forest, cnn
from download_word_embeddings import muse_embeddings_path, download_if_not_existing
from loading_data import load_embedding, load_uiuc

path_wordembedding = '/home/eduardo/word_embedding/'
download_if_not_existing()
from benchmarking_methods import run_benchmark_cv
from feature_creation import create_feature
from loading_data import load_disequa

Using TensorFlow backend.


### Extract features

The function *create_features* transform the questions in numerical vector to a classifier model.<br>It returns the output in the df_2 dataframe that is a parameter (*df_2.feature_type*, according to the *feature_type*).<br><br>
**feature_type:** type of feature. (bow, tfidf, embedding, embedding_sum, vocab_index, pos_index, pos_hotencode, ner_index, ner_hotencode)<br> 
**df:** the dataframe used to fit the transformers models (df.questions).<br>
**df_2:** dataframe wich the data will be transformed (df_2.questions).<br>
**embedding:** embedding model for word embedding features type.<br>
**max_features:** used in bag-of-words and TFIDF.


### Create classifier models

The models are created through functions that return them. These functions will be used to create a new model in each experiment. Therefore, an instance of a model is created by the benchmark function and not explicitly in a code block.


### UTILS

In [2]:
import warnings
warnings.filterwarnings("ignore")



#### Load UIUC dataset

#### Load DISEQuA dataset

## Benchmark UIUC - Normal

**Normal:** it uses the default fixed split of UIUC between train dataset (at last 5500 instances) and test dataset (500 instances). Therefore, it does not use cross-validation.

When the *run_benchmark* function is executed, it will save each result in the *save* path.

**model:** a dictionary with the classifier name and the function to create and return the model (not an instance of the model). <br> Example: *model = {'name': 'SVM', 'model': svm_linear}*<br>
**X:** all the training set.<br>
**y:** all the labels of the training set.<br>
**x_test:** test set.<br>
**y_test:** labels of the test set.<br>
**sizes_train:** sizes of training set. For each size, an experiment is executed.<br>
**runs:** number of time that each experiment is executed (used in models which has parameters with random values, like weights in an ANN).<br>
**save:** csv path where the results will be saved.<br>
**metric_average:** used in f1, recall and precision metrics<br>
**onehot:** one-hot model to transform labels.<br>
**out_dim:** the total of classes for ANN models.<br>
**epochs:** epochs for ANN models.<br>
**batch_size:** batch_size for ANN models.<br>
**vocabulary_size:** vocabulary size (used in CNN model).



## Benchmark UIUC and DISEQuA - Cross-validation

**Cross-validation:** instead of uses default fixed splits, it uses the all the dataset with cross-validation with 10 folds.

When the *run_benchmark* function is executed, it will save each result in the *save* path.

**model:** a dictionary with the classifier name and the function to create and return the model (not an instance of the model). <br> Example: *model = {'name': 'SVM', 'model': svm_linear}*<br>
**X:** Input features.<br>
**y:** Input labels.<br>
**sizes_train:** sizes of training set. For each size, an experiment is executed.<br>
**folds:** Amount of folds for cross-validations.<br>
**save:** csv path where the results will be saved.<br>
**metric_average:** used in f1, recall and precision metrics<br>
**onehot:** one-hot model to transform labels.<br>
**epochs:** epochs for ANN models.<br>
**batch_size:** batch_size for ANN models.<br>
**vocabulary_size:** vocabulary size (used in CNN model).



## Run UIUC Benchmark - Normal

Different classifier models are tested with different dependency levels of external linguistic resources (Low, Medium and High)

#### SVM + TF-IDF

In [2]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    dataset_train, dataset_test = load_uiuc(language)
    create_feature('tfidf', dataset_train, dataset_train, max_features=2000)
    create_feature('tfidf', dataset_train, dataset_test, max_features=2000)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf_train = np.array([list(r) for r in dataset_train['tfidf'].values])
    tfidf_test = np.array([list(r) for r in dataset_test['tfidf'].values])
    tfidf_train = normalize(tfidf_train, norm='max')
    tfidf_test = normalize(tfidf_test, norm='max')
    
    X_train = np.array([list(x) for x in dataset_train['tfidf'].values])
    X_test = np.array([list(x) for x in dataset_test['tfidf'].values])
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  save='results/UIUC_svm_tfidf_' + language + '.csv', runs=1)



Language:  en

1000|.
2000|.
3000|.
4000|.
5500|.Run time benchmark: 0.5764660835266113


Language:  es

1000|.
2000|.
3000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



4000|.
5500|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Run time benchmark: 0.6209449768066406


Language:  pt

1000|.
2000|.
3000|.
4000|.
5500|.Run time benchmark: 0.5239431858062744


#### SVM + TF-IDF + WB

In [5]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    create_feature('tfidf', dataset_train, dataset_train, max_features=2000)
    create_feature('tfidf', dataset_train, dataset_test, max_features=2000)
    create_feature('embedding_sum', None, dataset_train, embedding)
    create_feature('embedding_sum', None, dataset_test, embedding)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf_train = np.array([list(r) for r in dataset_train['tfidf'].values])
    tfidf_test = np.array([list(r) for r in dataset_test['tfidf'].values])
    tfidf_train = normalize(tfidf_train, norm='max')
    tfidf_test = normalize(tfidf_test, norm='max')
    
    embedding_train = np.array([list(r) for r in dataset_train['embedding_sum'].values])
    embedding_test = np.array([list(r) for r in dataset_test['embedding_sum'].values])
    embedding_train = normalize(embedding_train, norm='max')
    embedding_test = normalize(embedding_test, norm='max')
    
    X_train = np.array([list(x) + list(xx) for x, xx in zip(tfidf_train, embedding_train)])
    X_test = np.array([list(x) + list(xx) for x, xx in zip(tfidf_test, embedding_test)])
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500], 
                  runs=1, save='results/UIUC_svm_cortes_' + language + '.csv')



Language:  en

1000|.
2000|.
3000|.
4000|.




5500|.Run time benchmark: 11.371490478515625


Language:  es

1000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



2000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



3000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



4000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



5500|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Run time benchmark: 14.12940001487732


Language:  pt

1000|.
2000|.




3000|.




4000|.




5500|.Run time benchmark: 14.28162956237793




#### SVM + TF-IDF + WB + POS + NER

In [6]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    create_feature('tfidf', dataset_train, dataset_train, max_features=2000)
    create_feature('tfidf', dataset_train, dataset_test, max_features=2000)
    create_feature('embedding_sum', dataset_train, dataset_train, embedding)
    create_feature('embedding_sum', dataset_train, dataset_test, embedding)
    create_feature('pos_hotencode', dataset_train, dataset_train)
    create_feature('pos_hotencode', dataset_train, dataset_test)
    create_feature('ner_hotencode', dataset_train, dataset_train)
    create_feature('ner_hotencode', dataset_train, dataset_test)
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf_train = np.array([list(r) for r in dataset_train['tfidf'].values])
    tfidf_test = np.array([list(r) for r in dataset_test['tfidf'].values])
    tfidf_train = normalize(tfidf_train, norm='max')
    tfidf_test = normalize(tfidf_test, norm='max')
    
    embedding_train = np.array([list(r) for r in dataset_train['embedding_sum'].values])
    embedding_test = np.array([list(r) for r in dataset_test['embedding_sum'].values])
    embedding_train = normalize(embedding_train, norm='max')
    embedding_test = normalize(embedding_test, norm='max')
    
    pos_train = np.array([list(r) for r in dataset_train['pos_hotencode'].values])
    pos_test = np.array([list(r) for r in dataset_test['pos_hotencode'].values])
    
    ner_train = np.array([list(r) for r in dataset_train['ner_hotencode'].values])
    ner_test = np.array([list(r) for r in dataset_test['ner_hotencode'].values])
    
    X_train = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf_train, embedding_train, pos_train, ner_train)])
    X_test = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf_test, embedding_test, pos_test, ner_test)])
    
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    
    classes = list(dataset_train['class'].unique())
    y_train_ = [classes.index(c) for c in y_train]
    
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  runs=1, save='results/UIUC_svm_high_' + language + '.csv')



Language:  en

1000|.
2000|.
3000|.




4000|.




5500|.



Run time benchmark: 12.750246524810791


Language:  es

1000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



2000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



3000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



4000|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



5500|.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Run time benchmark: 15.330715417861938


Language:  pt

1000|.




2000|.




3000|.




4000|.




5500|.Run time benchmark: 13.996777296066284




#### BERT + CNN

In [3]:
from typing import List
from flair_cnn_doc_embedding import DocumentCNNEmbeddings
from torch.utils.data import Dataset
import torch
from flair.data import Sentence, Corpus
from flair.embeddings import DocumentRNNEmbeddings, BertEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
import time
import datetime
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, confusion_matrix


def build_flair_sentences(text_label_tuples):
    sentences = [Sentence(text, labels=[label], use_tokenizer=True) for text,label in text_label_tuples]
    return [s for s in sentences if len(s.tokens) > 0]

def get_labels(sentences:List[Sentence]):
    return [[l.value for l in s.labels] for s in sentences]


def calc_metrics_with_sklearn(clf:TextClassifier,sentences:List[Sentence],train_size=0,
                              run=0,train_time=0,metric_average='macro',
                              classes=['ABBR', 'DESC', 'ENTY', 'HUM', 'LOC', 'NUM']):
    targets = get_labels(sentences)
    start_time = time.time()
    clf.predict(sentences)
    test_time = time.time() - start_time
    prediction = get_labels(sentences)
    p = prediction
    t = targets
    data = {'datetime': datetime.datetime.now(),
            'model': 'cnn_bert',
            'accuracy': accuracy_score(prediction, targets),
            'precision': precision_score(prediction, targets, average=metric_average),
            'recall': recall_score(prediction, targets, average=metric_average),
            'f1': f1_score(prediction, targets, average=metric_average),
            'mcc': matthews_corrcoef(prediction, targets),
            'confusion': confusion_matrix(prediction, targets, labels=classes),
            'run': run,
            'train_size': size_train,
            'execution_time': train_time,
            'test_time': test_time}
          
    #report = metrics.classification_report(y_true=targets, y_pred=prediction, digits=3, output_dict=True)
    return data


for language in ['en']: # , 'es', 'pt'
    results = pd.DataFrame()
    
    save = 'results/UIUC_cnn_bert_'+language+'.csv'
    for size_train in [5500]: # 1000, 2000, 3000, 4000, 
        for run in range(1,6):
            dataset_train, dataset_test = load_uiuc(language)
            if size_train < 5500:
                dataset_train = dataset_train[:size_train]

            sentences_train:Dataset = build_flair_sentences([(text, label) for text, label in zip(dataset_train['question'], dataset_train['class'])])
            sentences_dev:Dataset = sentences_train
            sentences_test:Dataset = build_flair_sentences([(text, label) for text, label in zip(dataset_test['question'], dataset_test['class'])])

            corpus:Corpus = Corpus(sentences_train, sentences_dev, sentences_test)
            label_dict = corpus.make_label_dictionary()
            word_embeddings = [
                # WordEmbeddings('glove'),
                BertEmbeddings('bert-base-multilingual-cased', layers='-1')
            ]
            document_embeddings = DocumentCNNEmbeddings(word_embeddings,
                                                        dropout=0.0,
                                                        hidden_size=64,
                                                        )

            clf = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=False)
            trainer = ModelTrainer(clf, corpus,torch.optim.RMSprop)
            base_path = 'flair_resources/qc_en_uiuc'
            start_time = time.time()
            trainer.train(base_path,
                          learning_rate=0.001,
                          mini_batch_size=32,
                          anneal_factor=0.5,
                          patience=2,
                          max_epochs=4)
            train_time = time.time() - start_time
            data = calc_metrics_with_sklearn(clf, sentences_test, train_size=size_train, train_time=train_time, run=run)
            results = results.append([data])
            results.to_csv(save)

2019-08-29 10:29:23,810 {'ENTY', 'NUM', 'ABBR', 'HUM', 'DESC', 'LOC'}
2019-08-29 10:29:23,812 The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
2019-08-29 10:29:39,108 ----------------------------------------------------------------------------------------------------
2019-08-29 10:29:39,109 Evaluation method: MICRO_F1_SCORE
2019-08-29 10:29:39,558 ----------------------------------------------------------------------------------------------------
2019-08-29 10:29:40,523 epoch 1 - iter 0/171 - loss 2.02791667
2019-08-29 10:29:52,500 epoch 1 - iter 17/171 - loss 5.07643974
2019-08-29 10:30:03,991 epoch 1 - iter 34/171 - loss 3.36154547
2019-08-29 10:30:17,008 epoch 1 - iter 51/171 - loss 2.54324335
2019-08-29 10:30:29,965 epoch 1 - iter 68/171 - loss 2.10228707
2019-08-29 10:30:43,098 epoch 1 - iter 85/171 - loss 1.82153485
2019-08-29 10:30:58,962 epo

2019-08-29 10:53:19,511 epoch 2 - iter 17/171 - loss 0.36208694
2019-08-29 10:53:34,000 epoch 2 - iter 34/171 - loss 0.32396458
2019-08-29 10:53:47,311 epoch 2 - iter 51/171 - loss 0.32892783
2019-08-29 10:54:00,003 epoch 2 - iter 68/171 - loss 0.33267455
2019-08-29 10:54:14,356 epoch 2 - iter 85/171 - loss 0.33102816
2019-08-29 10:54:28,643 epoch 2 - iter 102/171 - loss 0.32793201
2019-08-29 10:54:42,628 epoch 2 - iter 119/171 - loss 0.32320372
2019-08-29 10:54:57,346 epoch 2 - iter 136/171 - loss 0.32026019
2019-08-29 10:55:11,541 epoch 2 - iter 153/171 - loss 0.31835228
2019-08-29 10:55:24,458 epoch 2 - iter 170/171 - loss 0.31882253
2019-08-29 10:55:24,679 ----------------------------------------------------------------------------------------------------
2019-08-29 10:55:24,680 EPOCH 2 done: loss 0.3188 - lr 0.0010 - bad epochs 0
2019-08-29 10:57:36,554 DEV : loss 0.46687325835227966 - score 0.832
2019-08-29 10:57:45,016 TEST : loss 0.46059271693229675 - score 0.83
2019-08-29 10:5

2019-08-29 11:18:15,672 ----------------------------------------------------------------------------------------------------
2019-08-29 11:18:15,673 EPOCH 3 done: loss 0.2384 - lr 0.0010 - bad epochs 0
2019-08-29 11:20:18,942 DEV : loss 0.2959625720977783 - score 0.8881
2019-08-29 11:20:27,369 TEST : loss 0.24141234159469604 - score 0.908
2019-08-29 11:20:27,370 ----------------------------------------------------------------------------------------------------
2019-08-29 11:20:28,899 epoch 4 - iter 0/171 - loss 0.23360655
2019-08-29 11:20:42,250 epoch 4 - iter 17/171 - loss 0.21072820
2019-08-29 11:20:54,938 epoch 4 - iter 34/171 - loss 0.18797867
2019-08-29 11:21:08,743 epoch 4 - iter 51/171 - loss 0.18857907
2019-08-29 11:21:21,634 epoch 4 - iter 68/171 - loss 0.18263207
2019-08-29 11:21:33,883 epoch 4 - iter 85/171 - loss 0.17533590
2019-08-29 11:21:46,904 epoch 4 - iter 102/171 - loss 0.17171580
2019-08-29 11:21:59,484 epoch 4 - iter 119/171 - loss 0.17915344
2019-08-29 11:22:12,4

2019-08-29 11:43:50,507 
MICRO_AVG: acc 0.8975 - f1-score 0.946
MACRO_AVG: acc 0.9002 - f1-score 0.9470166666666667
ABBR       tp: 9 - fp: 1 - fn: 0 - tn: 490 - precision: 0.9000 - recall: 1.0000 - accuracy: 0.9000 - f1-score: 0.9474
DESC       tp: 136 - fp: 14 - fn: 2 - tn: 348 - precision: 0.9067 - recall: 0.9855 - accuracy: 0.8947 - f1-score: 0.9445
ENTY       tp: 81 - fp: 4 - fn: 13 - tn: 402 - precision: 0.9529 - recall: 0.8617 - accuracy: 0.8265 - f1-score: 0.9050
HUM        tp: 64 - fp: 2 - fn: 1 - tn: 433 - precision: 0.9697 - recall: 0.9846 - accuracy: 0.9552 - f1-score: 0.9771
LOC        tp: 76 - fp: 4 - fn: 5 - tn: 415 - precision: 0.9500 - recall: 0.9383 - accuracy: 0.8941 - f1-score: 0.9441
NUM        tp: 107 - fp: 2 - fn: 6 - tn: 385 - precision: 0.9817 - recall: 0.9469 - accuracy: 0.9304 - f1-score: 0.9640
2019-08-29 11:43:50,507 ----------------------------------------------------------------------------------------------------
2019-08-29 11:43:59,075 {'ENTY', 'NUM', 'A

## Run UIUC Benchmark - Cross-validation

Different classifier models are tested with different dependency levels of external linguistic resources (Low, Medium and High)

#### SVM + TF-IDF

In [3]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    create_feature('tfidf', dataset, dataset, max_features=2000)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    
    
    # run_benchmark_cv(model, X, y, [50, 100] + list(range(500, 5501, 500)),
    run_benchmark_cv(model, X, y, [1000, 2000, 3000, 4000, 5500],
                     save='results/UIUC_cv_svm_tfidf_' + language + '.csv')



Language:  en

1000|..........
2000|..........
3000|..........
4000|..........
5500|..........
Run time benchmark: 8.106821775436401


Language:  es

1000|..........
2000|..........
3000|..........
4000|..........
5500|..........
Run time benchmark: 9.061235904693604


Language:  pt

1000|..........
2000|..........
3000|..........
4000|..........
5500|........

  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


..
Run time benchmark: 8.001790523529053


#### SVM + TF-IDF + WB

In [4]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', None, dataset, embedding)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    X = np.array([list(x) + list(xx) for x, xx in zip(tfidf, embedding)])
    y = dataset['class'].values
    
    # run_benchmark_cv(model, X, y, [50, 100] + list(range(500, 5501, 500)),
    run_benchmark_cv(model, X, y, [1000, 2000, 3000, 4000, 5500],
                     save='results/UIUC_cv_svm_cortes_' + language + '.csv')



Language:  en

1000|..........
2000|.



.........
3000|.



...



.



.



...



.
4000|.



.



.



..



.



.



.



.



.




5500|..



.



.



..



.



.



..




Run time benchmark: 125.29054236412048


Language:  es

1000|..........
2000|...



.......
3000|.



.



..



..



.



.



.



.




4000|.



.



.



.



.



.



.



.



.



.




5500|.



.



.



.



.



.



.



.



.



.




Run time benchmark: 151.8110692501068


Language:  pt

1000|...



...



....
2000|...



...



..



.



.




3000|.



.



.



.



...



..



.




4000|.



.



.



.



.



.



.



.



..




5500|.



.



.



.



.



.



.



.



.



.
Run time benchmark: 143.85443115234375




#### SVM + TF-IDF + WB + POS + NER

In [5]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', dataset, dataset, embedding)
    create_feature('pos_hotencode', dataset, dataset)
    create_feature('ner_hotencode', dataset, dataset)
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    pos = np.array([list(r) for r in dataset['pos_hotencode'].values])
    
    ner = np.array([list(r) for r in dataset['ner_hotencode'].values])
    
    X = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf, embedding, pos, ner)])
    
    y = dataset['class'].values
    
    # run_benchmark_cv(model, X, y, [50, 100] + list(range(500, 5501, 500)),
    run_benchmark_cv(model, X, y, [1000, 2000, 3000, 4000, 5500],
                     save='results/UIUC_cv_svm_high_' + language + '.csv')



Language:  en

1000|.........



.
2000|...



...



.



..



.
3000|.



.



.



.



.



.



.



.



.



.




4000|.



.



.



.



.



.



.



.



.



.




5500|.



.



.



.



.



.



.



.



.



.




Run time benchmark: 129.40373587608337


Language:  es

1000|....



......
2000|.



..



.



.



.



..



.



.




3000|.



.



.



.



.



.



.



.



.



.




4000|.



.



.



.



.



.



.



.



.



.




5500|.



.



.



.



.



.



.



.



.



.




Run time benchmark: 163.61614727973938


Language:  pt

1000|....



..



.



...
2000|.



.



.



.



.



.



.



.



.



.




3000|.



.



.



.



.



.



.



.



.



.




4000|.



.



.



.



.



.



.



.



.



.




5500|.



.



.



.



.



.



.



.



.



.
Run time benchmark: 144.99252271652222




#### BERT + CNN - Cross validation

In [6]:
from typing import List
from flair_cnn_doc_embedding import DocumentCNNEmbeddings
from torch.utils.data import Dataset
import torch
from flair.data import Sentence, Corpus
from flair.embeddings import DocumentRNNEmbeddings, BertEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
import time
import datetime
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, confusion_matrix


def build_flair_sentences(text_label_tuples):
    sentences = [Sentence(text, labels=[label], use_tokenizer=True) for text,label in text_label_tuples]
    return [s for s in sentences if len(s.tokens) > 0]

def get_labels(sentences:List[Sentence]):
    return [[l.value for l in s.labels] for s in sentences]

def calc_metrics_with_sklearn(clf:TextClassifier,sentences:List[Sentence],train_size=0,
                              fold=0,train_time=0,metric_average='macro',
                              classes=['ABBR', 'DESC', 'ENTY', 'HUM', 'LOC', 'NUM']):
    targets = get_labels(sentences)
    start_time = time.time()
    clf.predict(sentences)
    test_time = time.time() - start_time
    prediction = get_labels(sentences)
    p = prediction
    t = targets
    data = {'datetime': datetime.datetime.now(),
            'model': 'cnn_bert',
            'accuracy': accuracy_score(prediction, targets),
            'precision': precision_score(prediction, targets, average=metric_average),
            'recall': recall_score(prediction, targets, average=metric_average),
            'f1': f1_score(prediction, targets, average=metric_average),
            'mcc': matthews_corrcoef(prediction, targets),
            'confusion': confusion_matrix(prediction, targets, labels=classes),
            'fold': fold,
            'train_size': size_train,
            'execution_time': train_time,
            'test_time': test_time}

    #report = metrics.classification_report(y_true=targets, y_pred=prediction, digits=3, output_dict=True)
    return data




for language in ['es',]: # ,  'es', 'pt'
    print(f"########## {language} ##########")
    results = pd.DataFrame()
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    save = 'results/UIUC_cv_cnn_bert_'+language+'_1000_2000.csv'
    for size_train in [1000, 2000]: # 
        print(f"##### {size_train} #####")
        
        word_embeddings = [BertEmbeddings('bert-base-multilingual-cased', layers='-1')]
        document_embeddings = DocumentCNNEmbeddings(word_embeddings, dropout=0.0, hidden_size=64)
        
        size_test = len(dataset) - size_train
        rs = StratifiedShuffleSplit(n_splits=10, train_size=size_train, test_size=size_test, random_state=1)
        fold = 0
        for train_indexs, test_indexs in rs.split(dataset, dataset['class']):
            fold += 1
            print(f"## {fold} ##")
            df_train = dataset.iloc[train_indexs]
            df_test = dataset.iloc[test_indexs]
            
            x_train:Dataset = build_flair_sentences([(text, label) for text, label in zip(df_train['question'], df_train['class'])])
            x_dev:Dataset = x_train
            x_test:Dataset = build_flair_sentences([(text, label) for text, label in zip(df_test['question'], df_test['class'])])
            
            corpus = Corpus(x_train, x_dev, x_test)
            label_dict = corpus.make_label_dictionary()

            clf = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=False)
            trainer = ModelTrainer(clf, corpus,torch.optim.RMSprop)
            base_path = 'flair_resources/qc_'+language+'_uiuc'
            start_time = time.time()
            trainer.train(base_path,
                          learning_rate=0.001,
                          mini_batch_size=32,
                          anneal_factor=0.5,
                          max_epochs=4,
                          patience=2,
                          )
            train_time = time.time() - start_time
            data = calc_metrics_with_sklearn(clf, x_test, train_size=size_train, train_time=train_time, fold=fold)
            results = results.append([data])
            results.to_csv(save)

########## es ##########
##### 1000 #####
2019-08-30 00:15:56,544 The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
## 1 ##
2019-08-30 00:16:44,314 {'ENTY', 'NUM', 'ABBR', 'HUM', 'DESC', 'LOC'}
2019-08-30 00:16:44,526 ----------------------------------------------------------------------------------------------------
2019-08-30 00:16:44,527 Evaluation method: MICRO_F1_SCORE
2019-08-30 00:16:45,244 ----------------------------------------------------------------------------------------------------
2019-08-30 00:16:47,386 epoch 1 - iter 0/32 - loss 1.77986097
2019-08-30 00:16:50,576 epoch 1 - iter 3/32 - loss 8.04128671
2019-08-30 00:16:53,477 epoch 1 - iter 6/32 - loss 6.44657620
2019-08-30 00:16:56,890 epoch 1 - iter 9/32 - loss 5.41995763
2019-08-30 00:16:59,525 epoch 1 - iter 12/32 - loss 4.59803794
2019-08-30 00:17:02,758 epoch 1 - iter 15/32 - lo

2019-08-30 00:41:46,920 epoch 2 - iter 12/32 - loss 1.03961400
2019-08-30 00:41:49,447 epoch 2 - iter 15/32 - loss 1.00217039
2019-08-30 00:41:52,242 epoch 2 - iter 18/32 - loss 0.99930125
2019-08-30 00:41:55,333 epoch 2 - iter 21/32 - loss 1.03038987
2019-08-30 00:41:58,089 epoch 2 - iter 24/32 - loss 1.02364323
2019-08-30 00:42:01,085 epoch 2 - iter 27/32 - loss 0.98668079
2019-08-30 00:42:04,376 epoch 2 - iter 30/32 - loss 0.98622822
2019-08-30 00:42:06,090 ----------------------------------------------------------------------------------------------------
2019-08-30 00:42:06,092 EPOCH 2 done: loss 0.9702 - lr 0.0010 - bad epochs 0
2019-08-30 00:42:34,942 DEV : loss 0.6005484461784363 - score 0.776
2019-08-30 00:45:11,661 TEST : loss 0.9332205653190613 - score 0.672
2019-08-30 00:45:18,071 ----------------------------------------------------------------------------------------------------
2019-08-30 00:45:19,903 epoch 3 - iter 0/32 - loss 0.68040985
2019-08-30 00:45:22,861 epoch 3 -

2019-08-30 01:14:09,416 epoch 4 - iter 0/32 - loss 0.29436827
2019-08-30 01:14:12,317 epoch 4 - iter 3/32 - loss 0.36847810
2019-08-30 01:14:16,698 epoch 4 - iter 6/32 - loss 0.38930078
2019-08-30 01:14:19,076 epoch 4 - iter 9/32 - loss 0.39506946
2019-08-30 01:14:22,146 epoch 4 - iter 12/32 - loss 0.42571741
2019-08-30 01:14:24,620 epoch 4 - iter 15/32 - loss 0.43147716
2019-08-30 01:14:26,987 epoch 4 - iter 18/32 - loss 0.45169947
2019-08-30 01:14:29,940 epoch 4 - iter 21/32 - loss 0.46752783
2019-08-30 01:14:32,542 epoch 4 - iter 24/32 - loss 0.48703005
2019-08-30 01:14:35,146 epoch 4 - iter 27/32 - loss 0.48262915
2019-08-30 01:14:37,769 epoch 4 - iter 30/32 - loss 0.48836879
2019-08-30 01:14:51,805 ----------------------------------------------------------------------------------------------------
2019-08-30 01:14:51,992 EPOCH 4 done: loss 0.4909 - lr 0.0010 - bad epochs 0
2019-08-30 01:15:19,717 DEV : loss 0.3683728873729706 - score 0.866
2019-08-30 01:17:58,523 TEST : loss 0.722

2019-08-30 01:42:18,577 ----------------------------------------------------------------------------------------------------
## 5 ##
2019-08-30 01:44:56,002 {'ENTY', 'NUM', 'ABBR', 'HUM', 'DESC', 'LOC'}
2019-08-30 01:44:56,019 ----------------------------------------------------------------------------------------------------
2019-08-30 01:44:56,020 Evaluation method: MICRO_F1_SCORE
2019-08-30 01:44:56,674 ----------------------------------------------------------------------------------------------------
2019-08-30 01:44:59,045 epoch 1 - iter 0/32 - loss 8.46961308
2019-08-30 01:45:01,958 epoch 1 - iter 3/32 - loss 4.47838008
2019-08-30 01:45:05,070 epoch 1 - iter 6/32 - loss 3.46463159
2019-08-30 01:45:08,097 epoch 1 - iter 9/32 - loss 2.94123491
2019-08-30 01:45:10,924 epoch 1 - iter 12/32 - loss 2.61017027
2019-08-30 01:45:13,875 epoch 1 - iter 15/32 - loss 2.43157306
2019-08-30 01:45:17,212 epoch 1 - iter 18/32 - loss 2.27864429
2019-08-30 01:45:20,545 epoch 1 - iter 21/32 - loss 

2019-08-30 02:13:38,011 epoch 2 - iter 18/32 - loss 0.73612964
2019-08-30 02:13:40,875 epoch 2 - iter 21/32 - loss 0.74855922
2019-08-30 02:13:43,803 epoch 2 - iter 24/32 - loss 0.74200044
2019-08-30 02:13:46,585 epoch 2 - iter 27/32 - loss 0.74170662
2019-08-30 02:13:49,176 epoch 2 - iter 30/32 - loss 0.76656455
2019-08-30 02:13:50,114 ----------------------------------------------------------------------------------------------------
2019-08-30 02:13:50,115 EPOCH 2 done: loss 0.7595 - lr 0.0010 - bad epochs 0
2019-08-30 02:14:17,572 DEV : loss 0.642583966255188 - score 0.772
2019-08-30 02:16:55,673 TEST : loss 0.7775426506996155 - score 0.7306
2019-08-30 02:17:03,503 ----------------------------------------------------------------------------------------------------
2019-08-30 02:17:05,275 epoch 3 - iter 0/32 - loss 0.71338159
2019-08-30 02:17:08,624 epoch 3 - iter 3/32 - loss 0.63500477
2019-08-30 02:17:11,665 epoch 3 - iter 6/32 - loss 0.59191082
2019-08-30 02:17:14,069 epoch 3 - i

2019-08-30 02:44:18,593 epoch 4 - iter 6/32 - loss 0.36877178
2019-08-30 02:44:21,369 epoch 4 - iter 9/32 - loss 0.38325475
2019-08-30 02:44:24,362 epoch 4 - iter 12/32 - loss 0.42184563
2019-08-30 02:44:27,023 epoch 4 - iter 15/32 - loss 0.42488693
2019-08-30 02:44:29,895 epoch 4 - iter 18/32 - loss 0.44909219
2019-08-30 02:44:32,654 epoch 4 - iter 21/32 - loss 0.46328877
2019-08-30 02:44:35,970 epoch 4 - iter 24/32 - loss 0.45741438
2019-08-30 02:44:39,549 epoch 4 - iter 27/32 - loss 0.44357761
2019-08-30 02:44:42,548 epoch 4 - iter 30/32 - loss 0.44174075
2019-08-30 02:44:43,958 ----------------------------------------------------------------------------------------------------
2019-08-30 02:44:43,960 EPOCH 4 done: loss 0.4610 - lr 0.0010 - bad epochs 0
2019-08-30 02:45:12,691 DEV : loss 0.3352341651916504 - score 0.88
2019-08-30 02:47:49,526 TEST : loss 0.6712322235107422 - score 0.7893
2019-08-30 02:48:02,298 ------------------------------------------------------------------------

2019-08-30 03:23:38,720 ----------------------------------------------------------------------------------------------------
## 9 ##
2019-08-30 03:26:25,477 {'ENTY', 'NUM', 'ABBR', 'HUM', 'DESC', 'LOC'}
2019-08-30 03:26:25,492 ----------------------------------------------------------------------------------------------------
2019-08-30 03:26:25,494 Evaluation method: MICRO_F1_SCORE
2019-08-30 03:26:26,288 ----------------------------------------------------------------------------------------------------
2019-08-30 03:26:28,571 epoch 1 - iter 0/32 - loss 5.76605940
2019-08-30 03:26:32,091 epoch 1 - iter 3/32 - loss 3.12136924
2019-08-30 03:26:35,175 epoch 1 - iter 6/32 - loss 2.58436818
2019-08-30 03:26:38,126 epoch 1 - iter 9/32 - loss 2.35934826
2019-08-30 03:26:42,458 epoch 1 - iter 12/32 - loss 2.22664917
2019-08-30 03:26:45,263 epoch 1 - iter 15/32 - loss 2.14590960
2019-08-30 03:27:14,209 epoch 1 - iter 18/32 - loss 2.09085246
2019-08-30 03:27:20,172 epoch 1 - iter 21/32 - loss 

2019-08-30 04:00:25,107 epoch 2 - iter 18/32 - loss 1.44678915
2019-08-30 04:00:42,565 epoch 2 - iter 21/32 - loss 1.45005693
2019-08-30 04:00:47,657 epoch 2 - iter 24/32 - loss 1.42944946
2019-08-30 04:00:52,362 epoch 2 - iter 27/32 - loss 1.38160244
2019-08-30 04:00:55,963 epoch 2 - iter 30/32 - loss 1.36093566
2019-08-30 04:01:01,184 ----------------------------------------------------------------------------------------------------
2019-08-30 04:01:01,186 EPOCH 2 done: loss 1.3346 - lr 0.0010 - bad epochs 0
2019-08-30 04:01:44,546 DEV : loss 1.4872339963912964 - score 0.528
2019-08-30 04:04:50,460 TEST : loss 1.5606820583343506 - score 0.5157
2019-08-30 04:04:57,868 ----------------------------------------------------------------------------------------------------
2019-08-30 04:05:00,295 epoch 3 - iter 0/32 - loss 1.12190866
2019-08-30 04:05:03,444 epoch 3 - iter 3/32 - loss 1.11403586
2019-08-30 04:05:07,061 epoch 3 - iter 6/32 - loss 1.10233332
2019-08-30 04:05:09,986 epoch 3 - 

2019-08-30 04:42:27,893 ----------------------------------------------------------------------------------------------------
2019-08-30 04:42:30,005 epoch 4 - iter 0/63 - loss 0.96202040
2019-08-30 04:42:35,842 epoch 4 - iter 6/63 - loss 0.52275287
2019-08-30 04:42:41,817 epoch 4 - iter 12/63 - loss 0.56042740
2019-08-30 04:42:48,032 epoch 4 - iter 18/63 - loss 0.60783323
2019-08-30 04:42:53,410 epoch 4 - iter 24/63 - loss 0.59525855
2019-08-30 04:43:00,064 epoch 4 - iter 30/63 - loss 0.57500186
2019-08-30 04:43:05,737 epoch 4 - iter 36/63 - loss 0.57795122
2019-08-30 04:43:11,854 epoch 4 - iter 42/63 - loss 0.57601038
2019-08-30 04:43:21,095 epoch 4 - iter 48/63 - loss 0.57952554
2019-08-30 04:43:26,750 epoch 4 - iter 54/63 - loss 0.59178255
2019-08-30 04:43:41,509 epoch 4 - iter 60/63 - loss 0.59554219
2019-08-30 04:44:19,448 ----------------------------------------------------------------------------------------------------
2019-08-30 04:44:20,344 EPOCH 4 done: loss 0.5898 - lr 0.00

2019-08-30 05:25:01,503 ----------------------------------------------------------------------------------------------------
## 3 ##
2019-08-30 05:28:10,471 {'ABBR', 'NUM', 'ENTY', 'HUM', 'DESC', 'LOC'}
2019-08-30 05:28:10,506 ----------------------------------------------------------------------------------------------------
2019-08-30 05:28:10,507 Evaluation method: MICRO_F1_SCORE
2019-08-30 05:28:11,458 ----------------------------------------------------------------------------------------------------
2019-08-30 05:28:13,657 epoch 1 - iter 0/63 - loss 3.89572191
2019-08-30 05:28:23,189 epoch 1 - iter 6/63 - loss 5.07361865
2019-08-30 05:28:30,605 epoch 1 - iter 12/63 - loss 3.64696637
2019-08-30 05:28:37,403 epoch 1 - iter 18/63 - loss 2.91157686
2019-08-30 05:28:44,985 epoch 1 - iter 24/63 - loss 2.54759516
2019-08-30 05:28:53,251 epoch 1 - iter 30/63 - loss 2.23323882
2019-08-30 05:29:01,479 epoch 1 - iter 36/63 - loss 2.03466799
2019-08-30 05:29:47,726 epoch 1 - iter 42/63 - los

2019-08-30 06:12:41,644 epoch 2 - iter 36/63 - loss 0.70999384
2019-08-30 06:12:47,471 epoch 2 - iter 42/63 - loss 0.72349023
2019-08-30 06:13:41,835 epoch 2 - iter 48/63 - loss 0.71207783
2019-08-30 06:13:49,874 epoch 2 - iter 54/63 - loss 0.70311885
2019-08-30 06:13:56,021 epoch 2 - iter 60/63 - loss 0.70469071
2019-08-30 06:14:17,549 ----------------------------------------------------------------------------------------------------
2019-08-30 06:14:17,704 EPOCH 2 done: loss 0.7060 - lr 0.0010 - bad epochs 0
2019-08-30 06:15:56,674 DEV : loss 0.6544597148895264 - score 0.7745
2019-08-30 06:18:12,285 TEST : loss 0.7671982049942017 - score 0.7447
2019-08-30 06:18:19,895 ----------------------------------------------------------------------------------------------------
2019-08-30 06:18:22,107 epoch 3 - iter 0/63 - loss 0.56100959
2019-08-30 06:18:27,724 epoch 3 - iter 6/63 - loss 0.56337269
2019-08-30 06:18:33,993 epoch 3 - iter 12/63 - loss 0.54354249
2019-08-30 06:18:39,201 epoch 3 

2019-08-30 06:59:13,901 epoch 4 - iter 12/63 - loss 0.41851586
2019-08-30 06:59:20,802 epoch 4 - iter 18/63 - loss 0.45900025
2019-08-30 06:59:26,353 epoch 4 - iter 24/63 - loss 0.43404794
2019-08-30 06:59:31,665 epoch 4 - iter 30/63 - loss 0.43845882
2019-08-30 06:59:37,607 epoch 4 - iter 36/63 - loss 0.43892490
2019-08-30 06:59:44,319 epoch 4 - iter 42/63 - loss 0.43396895
2019-08-30 07:01:28,265 epoch 4 - iter 48/63 - loss 0.43914601
2019-08-30 07:01:42,805 epoch 4 - iter 54/63 - loss 0.42909163
2019-08-30 07:02:40,623 epoch 4 - iter 60/63 - loss 0.44225843
2019-08-30 07:03:22,029 ----------------------------------------------------------------------------------------------------
2019-08-30 07:03:22,573 EPOCH 4 done: loss 0.4414 - lr 0.0010 - bad epochs 0
2019-08-30 07:04:34,845 DEV : loss 0.3516794741153717 - score 0.888
2019-08-30 07:07:01,847 TEST : loss 0.6665229201316833 - score 0.7995
2019-08-30 07:07:16,327 ---------------------------------------------------------------------

2019-08-30 07:45:45,857 ----------------------------------------------------------------------------------------------------
## 7 ##
2019-08-30 07:48:54,339 {'ENTY', 'NUM', 'ABBR', 'HUM', 'DESC', 'LOC'}
2019-08-30 07:48:54,388 ----------------------------------------------------------------------------------------------------
2019-08-30 07:48:54,390 Evaluation method: MICRO_F1_SCORE
2019-08-30 07:48:55,311 ----------------------------------------------------------------------------------------------------
2019-08-30 07:48:57,344 epoch 1 - iter 0/63 - loss 5.33133459
2019-08-30 07:49:03,858 epoch 1 - iter 6/63 - loss 3.62429563
2019-08-30 07:49:52,111 epoch 1 - iter 12/63 - loss 2.56970778
2019-08-30 07:50:14,790 epoch 1 - iter 18/63 - loss 2.23636854
2019-08-30 07:52:26,468 epoch 1 - iter 24/63 - loss 2.02165040
2019-08-30 07:52:54,131 epoch 1 - iter 30/63 - loss 1.85336732
2019-08-30 07:53:17,209 epoch 1 - iter 36/63 - loss 1.75582739
2019-08-30 07:53:38,659 epoch 1 - iter 42/63 - los

2019-08-30 08:45:25,182 epoch 2 - iter 36/63 - loss 1.32795631
2019-08-30 08:45:30,876 epoch 2 - iter 42/63 - loss 1.31881424
2019-08-30 08:45:57,924 epoch 2 - iter 48/63 - loss 1.30883798
2019-08-30 08:48:18,536 epoch 2 - iter 54/63 - loss 1.29632945
2019-08-30 08:48:28,273 epoch 2 - iter 60/63 - loss 1.29257091
2019-08-30 08:48:55,249 ----------------------------------------------------------------------------------------------------
2019-08-30 08:48:55,425 EPOCH 2 done: loss 1.2938 - lr 0.0010 - bad epochs 0
2019-08-30 08:50:17,722 DEV : loss 1.2849305868148804 - score 0.416
2019-08-30 08:53:50,785 TEST : loss 1.346233606338501 - score 0.4097
2019-08-30 08:54:02,082 ----------------------------------------------------------------------------------------------------
2019-08-30 08:54:04,375 epoch 3 - iter 0/63 - loss 1.16440558
2019-08-30 08:54:11,128 epoch 3 - iter 6/63 - loss 1.36838572
2019-08-30 08:54:17,907 epoch 3 - iter 12/63 - loss 1.32065621
2019-08-30 08:54:24,089 epoch 3 - 

2019-08-30 09:30:12,055 epoch 4 - iter 12/63 - loss 0.49906795
2019-08-30 09:30:17,569 epoch 4 - iter 18/63 - loss 0.50968976
2019-08-30 09:30:23,347 epoch 4 - iter 24/63 - loss 0.47567989
2019-08-30 09:30:29,608 epoch 4 - iter 30/63 - loss 0.49497867
2019-08-30 09:30:35,706 epoch 4 - iter 36/63 - loss 0.47769011
2019-08-30 09:30:40,469 epoch 4 - iter 42/63 - loss 0.48527218
2019-08-30 09:30:45,918 epoch 4 - iter 48/63 - loss 0.47183558
2019-08-30 09:30:51,205 epoch 4 - iter 54/63 - loss 0.47279594
2019-08-30 09:30:56,249 epoch 4 - iter 60/63 - loss 0.47369679
2019-08-30 09:30:58,388 ----------------------------------------------------------------------------------------------------
2019-08-30 09:30:58,389 EPOCH 4 done: loss 0.4740 - lr 0.0010 - bad epochs 0
2019-08-30 09:31:51,645 DEV : loss 0.5004297494888306 - score 0.8225
2019-08-30 09:34:03,928 TEST : loss 0.6718747615814209 - score 0.7799
2019-08-30 09:34:17,664 --------------------------------------------------------------------

2019-08-30 10:07:47,468 ----------------------------------------------------------------------------------------------------


## Run DISEQuA Benchmark - Cross-validation

Different classifier models are tested with different dependency levels of external linguistic resources (Low, Medium and High)

#### SVM + <font color=#007700>TF-IDF</font>

In [152]:
for language in ['en', 'es', 'it', 'nl']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, max_features=2000)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, sizes_train=[100,200,300,400],
                     save='results/DISEQuA_svm_tfidf_' + language + '.csv')



Language:  en

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.027012586593628


Language:  es

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.0114972591400146


Language:  it

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.1434721946716309


Language:  nl

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.1250619888305664


#### SVM + <font color=#007700>TF-IDF</font> + <font color=#0055CC>WB</font>

In [163]:
for language in ['en', 'es', 'it', 'nl']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', None, dataset, embedding)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    X = np.array([list(x) + list(xx) for x, xx in zip(tfidf, embedding)])
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, sizes_train=[100,200,300,400],
                     save='results/DISEQuA_svm_cortes_' + language + '.csv')



Language:  en

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.358882427215576


Language:  es

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 7.197380065917969


Language:  it

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 5.5334153175354


Language:  nl

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.624628782272339


#### SVM + <font color=#007700>TF-IDF</font> + <font color=#0055CC>WB</font> + <font color=#CC6600>POS</font> + <font color=#CC6600>NER</font>

In [164]:


for language in ['en', 'es', 'it', 'nl']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', dataset, dataset, embedding)
    create_feature('pos_hotencode', dataset, dataset)
    create_feature('ner_hotencode', dataset, dataset)
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    pos = np.array([list(r) for r in dataset['pos_hotencode'].values])
    
    ner = np.array([list(r) for r in dataset['ner_hotencode'].values])
    
    X = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf, embedding, pos, ner)])
    
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, sizes_train=[100,200,300,400],
                     save='results/DISEQuA_svm_high_' + language + '.csv')



Language:  en

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.811999559402466


Language:  es

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 8.384974479675293


Language:  it

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.426969528198242


Language:  nl

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.852076053619385


## Old stuffs bellow

#### CNN

In [None]:
# 'en', 'es'
for language in ['es']:
    print('\n\nLanguage: ', language)
    #embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    text_representation = 'vocab_index'
    vocabulary_inv = create_feature(text_representation, dataset_train, dataset_train)
    create_feature(text_representation, dataset_train, dataset_test)
    model = {'name': 'cnn', 'model': cnn}
    X_train = np.array([list(x) for x in dataset_train[text_representation].values])
    X_test = np.array([list(x) for x in dataset_test[text_representation].values])
    #X_train = pad_sequences(X_train, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    #X_test = pad_sequences(X_test, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    ohe = OneHotEncoder()
    y_train = ohe.fit_transform([[y_] for y_ in y_train]).toarray()
    y_test = ohe.transform([[y_] for y_ in y_test]).toarray()
    # , 
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  runs=30, save='results/UIUC_cnn_' + language + '.csv', epochs=100, onehot=ohe,
                  vocabulary_size=len(vocabulary_inv))

#### LSTM + WordEmbedding

In [73]:
for language in ['es']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    dataset_train = dataset_train[:100]
    #dataset_test = dataset_test[:10]
    create_feature('embedding', dataset_train, dataset_train, embedding)
    create_feature('embedding', dataset_train, dataset_test, embedding)
    model = {'name': 'lstm', 'model': lstm_default}
    #print(dataset_train['embedding'].values.shape)
    #print(dataset_train['embedding'].values.dtype)
    #print(dataset_test['embedding'].values.shape)
    X_train = np.array([list(x) for x in dataset_train['embedding'].values])
    X_test = np.array([list(x) for x in dataset_test['embedding'].values])
    X_train = pad_sequences(X_train, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    X_test = pad_sequences(X_test, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
#     y_train_sub = dataset_train['sub_class'].values
#     sub_classes = set()
#     for sc in y_train_sub:
#         sub_classes.add(sc)
#     y_test_sub = dataset_test['sub_class'].values
#     X_test_sub_ = []
#     y_test_sub_ = []
#     for i in range(len(X_test)):
#         if y_train_sub[i] in sub_classes:
#             X_test_sub_.append(X_test[i])
#             y_test_sub_.append(y_train_sub[i])
#     X_test_sub_ = np.array(X_test_sub_)
#     y_test_sub_ = np.array(y_test_sub_)
    ohe = OneHotEncoder()
    y_train = ohe.fit_transform([[y_] for y_ in y_train]).toarray()
    y_test = ohe.transform([[y_] for y_ in y_test]).toarray() 
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  runs=30, save='results/UIUC_lstm_embedding_' + language + '_2.csv', epochs=100, onehot=ohe)
    #run_benchmark(model, X_train, y_train_sub, X_test_sub_, y_test_sub_, sizes_train=[1000, 2000, 3000, 4000, 5500],
    #              save='results/UIUCsub_svm_tfidf_' + language + '.csv')



Language:  es
(100,)
object
(1349,)

1000|...
2000|...
3000|...
4000|...
5500|...
Run time benchmark: 228.79835891723633


#### LSTM + BERT

In [None]:
for language in ['en']:
    print('\n\nLanguage: ', language)
    #embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    # debug
    print('WARNING: use subset (first 1000 entries) of training data')
    #dataset_train = dataset_train[:5500].copy()
    
    create_feature('bert', dataset_train, dataset_train)
    create_feature('bert', dataset_train, dataset_test)
    model = {'name': 'lstm', 'model': lstm_default}
    X_train = dataset_train['bert'].values
    X_test = dataset_test['bert'].values
    
    X_train = np.array([x for x in X_train])
    X_test = np.array([x for x in X_test])
    
    #X_train = pad_sequences(X_train, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    #X_test = pad_sequences(X_test, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    ohe = OneHotEncoder()
    y_train = ohe.fit_transform([[y_] for y_ in y_train]).toarray()
    y_test = ohe.transform([[y_] for y_ in y_test]).toarray() 
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[5500], # 1000, 2000, 3000, 4000, 5500
                  runs=1, save='results/UIUC_lstm_bert_' + language + '.csv', 
                  epochs=100, onehot=ohe, in_dim=768)
    #run_benchmark(model, X_train, y_train_sub, X_test_sub_, y_test_sub_, sizes_train=[1000, 2000, 3000, 4000, 5500],
    #              save='results/UIUCsub_svm_tfidf_' + language + '.csv')

## DISEQuA Benchmark

### RUN DISEQuA Benchmark

##### SVM + TFIDF

In [None]:
for language in ['DUT', 'ENG', 'ITA', 'SPA']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, embedding)
    model = {'name': 'svm', 'model': svm_linear}
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    run_benchmark(model, X, y, sizes_train=[100,200,300,400,405],
                  save='results/DISEQuA_svm_tfidf_' + language + '.csv')

##### RFC + TFIDF

In [None]:
for language in ['DUT', 'ENG', 'ITA', 'SPA']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, embedding)
    model = {'name': 'rfc', 'model': random_forest}
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    run_benchmark(model, X, y, sizes_train=[100,200,300,400],
                  save='results/DISEQuA_rfc_tfidf_' + language + '.csv')

##### SVM + TFIDF_3gram + SKB

In [None]:
for language in ['DUT', 'ENG', 'ITA', 'SPA']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf_3gram', dataset, dataset)
    model = {'name': 'svm', 'model': svm_linear}
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    skb = SelectKBest(chi2, k=2000).fit(X, y)
    X = skb.transform(X)
    run_benchmark(model, X, y, sizes_train=[100,200,300,400],
                  save='results/DISEQuA_svm_tfidf_3gram_' + language + '.csv')

##### LSTM + Embedding

In [None]:
for language, embd_l in zip(['SPA'], ['es']):
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + embd_l + '.vec')
    dataset = load_disequa(language)
    create_feature('embedding', dataset, dataset, embedding)
    model = {'name': 'lstm', 'model': lstm_default}
    X = np.array([list(x) for x in dataset['embedding'].values])
    y = dataset['class'].values
    X = pad_sequences(X, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    ohe = OneHotEncoder()
    y = ohe.fit_transform([[y_] for y_ in y]).toarray()
    run_benchmark(model, X, y, sizes_train=[100,200,300,400,405], onehot=ohe,
                  save='results/DISEQuA_lstm_embedding_' + language + '.csv')

##### CNN

In [None]:
for language, embd_l in zip(['DUT', 'ENG', 'ITA', 'SPA'], ['nl', 'eng', 'it', 'es']):
    print('\n\nLanguage: ', language)
    #embedding = load_embedding(path_wordembedding + 'wiki.multi.' + embd_l + '.vec')
    dataset = load_disequa(language)
    text_representation = 'vocab_index'
    vocabulary_inv = create_feature(text_representation, dataset, dataset)
    model = {'name': 'cnn', 'model': cnn}
    X = np.array([list(x) for x in dataset[text_representation].values])
    y = dataset['class'].values
    #X = pad_sequences(X, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    ohe = OneHotEncoder()
    y = ohe.fit_transform([[y_] for y_ in y]).toarray()
    run_benchmark(model, X, y, sizes_train=[100,200,300,400], onehot=ohe, vocabulary_size=len(vocabulary_inv),
                  save='results/DISEQuA_cnn_' + language + '.csv', epochs=100)