## How to run the experiments

Run the code blocs bellow in sequence. You can read the descriptions to understand it.


The dependencies can be found in https://github.com/eduardogc8/simple-qc

Before starting to run the experiments, change the variable ``path_wordembedding``, in the code block below, for the correct directory path. Make sure that the word embedding inside follow the template `wiki.multi.*.vec`.

In [1]:
import nltk
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import normalize

from benchmarking_methods import run_benchmark
from building_classifiers import lstm_default, svm_linear, random_forest, cnn
from download_word_embeddings import muse_embeddings_path, download_if_not_existing
from loading_data import load_embedding, load_uiuc

path_wordembedding = muse_embeddings_path
download_if_not_existing()
from benchmarking_methods import run_benchmark_cv
from feature_creation import create_feature
from loading_data import load_disequa

Using TensorFlow backend.
[nltk_data] Downloading package punkt to /home/eduardo/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Extract features

The function *create_features* transform the questions in numerical vector to a classifier model.<br>It returns the output in the df_2 dataframe that is a parameter (*df_2.feature_type*, according to the *feature_type*).<br><br>
**feature_type:** type of feature. (bow, tfidf, embedding, embedding_sum, vocab_index, pos_index, pos_hotencode, ner_index, ner_hotencode)<br> 
**df:** the dataframe used to fit the transformers models (df.questions).<br>
**df_2:** dataframe wich the data will be transformed (df_2.questions).<br>
**embedding:** embedding model for word embedding features type.<br>
**max_features:** used in bag-of-words and TFIDF.


### Create classifier models

The models are created through functions that return them. These functions will be used to create a new model in each experiment. Therefore, an instance of a model is created by the benchmark function and not explicitly in a code block.


### UTILS



#### Load UIUC dataset

#### Load DISEQuA dataset

## Benchmark UIUC - Normal

**Normal:** it uses the default fixed split of UIUC between train dataset (at last 5500 instances) and test dataset (500 instances). Therefore, it does not use cross-validation.

When the *run_benchmark* function is executed, it will save each result in the *save* path.

**model:** a dictionary with the classifier name and the function to create and return the model (not an instance of the model). <br> Example: *model = {'name': 'SVM', 'model': svm_linear}*<br>
**X:** all the training set.<br>
**y:** all the labels of the training set.<br>
**x_test:** test set.<br>
**y_test:** labels of the test set.<br>
**sizes_train:** sizes of training set. For each size, an experiment is executed.<br>
**runs:** number of time that each experiment is executed (used in models which has parameters with random values, like weights in an ANN).<br>
**save:** csv path where the results will be saved.<br>
**metric_average:** used in f1, recall and precision metrics<br>
**onehot:** one-hot model to transform labels.<br>
**out_dim:** the total of classes for ANN models.<br>
**epochs:** epochs for ANN models.<br>
**batch_size:** batch_size for ANN models.<br>
**vocabulary_size:** vocabulary size (used in CNN model).



## Benchmark UIUC and DISEQuA - Cross-validation

**Cross-validation:** instead of uses default fixed splits, it uses the all the dataset with cross-validation with 10 folds.

When the *run_benchmark* function is executed, it will save each result in the *save* path.

**model:** a dictionary with the classifier name and the function to create and return the model (not an instance of the model). <br> Example: *model = {'name': 'SVM', 'model': svm_linear}*<br>
**X:** Input features.<br>
**y:** Input labels.<br>
**sizes_train:** sizes of training set. For each size, an experiment is executed.<br>
**folds:** Amount of folds for cross-validations.<br>
**save:** csv path where the results will be saved.<br>
**metric_average:** used in f1, recall and precision metrics<br>
**onehot:** one-hot model to transform labels.<br>
**epochs:** epochs for ANN models.<br>
**batch_size:** batch_size for ANN models.<br>
**vocabulary_size:** vocabulary size (used in CNN model).



## Run UIUC Benchmark - Normal

Different classifier models are tested with different dependency levels of external linguistic resources (Low, Medium and High)

#### SVM + TF-IDF

In [110]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    dataset_train, dataset_test = load_uiuc(language)
    create_feature('tfidf', dataset_train, dataset_train, max_features=2000)
    create_feature('tfidf', dataset_train, dataset_test, max_features=2000)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf_train = np.array([list(r) for r in dataset_train['tfidf'].values])
    tfidf_test = np.array([list(r) for r in dataset_test['tfidf'].values])
    tfidf_train = normalize(tfidf_train, norm='max')
    tfidf_test = normalize(tfidf_test, norm='max')
    
    X_train = np.array([list(x) for x in dataset_train['tfidf'].values])
    X_test = np.array([list(x) for x in dataset_test['tfidf'].values])
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  save='results/UIUC_svm_tfidf_' + language + '.csv', runs=1)



Language:  en

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 0.41161417961120605


Language:  es

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 0.5091326236724854


Language:  pt

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 0.4507761001586914


#### SVM + TF-IDF + WB

In [37]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    create_feature('tfidf', dataset_train, dataset_train, max_features=2000)
    create_feature('tfidf', dataset_train, dataset_test, max_features=2000)
    create_feature('embedding_sum', None, dataset_train, embedding)
    create_feature('embedding_sum', None, dataset_test, embedding)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf_train = np.array([list(r) for r in dataset_train['tfidf'].values])
    tfidf_test = np.array([list(r) for r in dataset_test['tfidf'].values])
    tfidf_train = normalize(tfidf_train, norm='max')
    tfidf_test = normalize(tfidf_test, norm='max')
    
    embedding_train = np.array([list(r) for r in dataset_train['embedding_sum'].values])
    embedding_test = np.array([list(r) for r in dataset_test['embedding_sum'].values])
    embedding_train = normalize(embedding_train, norm='max')
    embedding_test = normalize(embedding_test, norm='max')
    
    X_train = np.array([list(x) + list(xx) for x, xx in zip(tfidf_train, embedding_train)])
    X_test = np.array([list(x) + list(xx) for x, xx in zip(tfidf_test, embedding_test)])
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500], 
                  runs=1, save='results/UIUC_svm_cortes_' + language + '.csv')



Language:  en

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 12.742478370666504


Language:  es

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 14.3670494556427


Language:  pt

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 13.426743984222412


#### SVM + TF-IDF + WB + POS + NER

In [101]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    create_feature('tfidf', dataset_train, dataset_train, max_features=2000)
    create_feature('tfidf', dataset_train, dataset_test, max_features=2000)
    create_feature('embedding_sum', dataset_train, dataset_train, embedding)
    create_feature('embedding_sum', dataset_train, dataset_test, embedding)
    create_feature('pos_hotencode', dataset_train, dataset_train)
    create_feature('pos_hotencode', dataset_train, dataset_test)
    create_feature('ner_hotencode', dataset_train, dataset_train)
    create_feature('ner_hotencode', dataset_train, dataset_test)
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf_train = np.array([list(r) for r in dataset_train['tfidf'].values])
    tfidf_test = np.array([list(r) for r in dataset_test['tfidf'].values])
    tfidf_train = normalize(tfidf_train, norm='max')
    tfidf_test = normalize(tfidf_test, norm='max')
    
    embedding_train = np.array([list(r) for r in dataset_train['embedding_sum'].values])
    embedding_test = np.array([list(r) for r in dataset_test['embedding_sum'].values])
    embedding_train = normalize(embedding_train, norm='max')
    embedding_test = normalize(embedding_test, norm='max')
    
    pos_train = np.array([list(r) for r in dataset_train['pos_hotencode'].values])
    pos_test = np.array([list(r) for r in dataset_test['pos_hotencode'].values])
    
    ner_train = np.array([list(r) for r in dataset_train['ner_hotencode'].values])
    ner_test = np.array([list(r) for r in dataset_test['ner_hotencode'].values])
    
    X_train = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf_train, embedding_train, pos_train, ner_train)])
    X_test = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf_test, embedding_test, pos_test, ner_test)])
    
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    
    classes = list(dataset_train['class'].unique())
    y_train_ = [classes.index(c) for c in y_train]
    
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  runs=1, save='results/UIUC_svm_high_' + language + '.csv')



Language:  en

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 12.932790994644165


Language:  es

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 15.48978304862976


Language:  pt

1000|.
2000|.
3000|.
4000|.
5500|.
Run time benchmark: 14.322027683258057


## Run UIUC Benchmark - Cross-validation

Different classifier models are tested with different dependency levels of external linguistic resources (Low, Medium and High)

#### SVM + TF-IDF

In [175]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    create_feature('tfidf', dataset, dataset, max_features=2000)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, [50, 100] + list(range(500, 5501, 500)),
                     save='results/UIUC_cv_svm_tfidf_' + language + '.csv')



Language:  en

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 22.216983318328857


Language:  es

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 24.743942499160767


Language:  pt

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 22.218426942825317


#### SVM + TF-IDF + WB

In [176]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', None, dataset, embedding)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    X = np.array([list(x) + list(xx) for x, xx in zip(tfidf, embedding)])
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, [50, 100] + list(range(500, 5501, 500)),
                     save='results/UIUC_cv_svm_cortes_' + language + '.csv')



Language:  en

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 270.81999158859253


Language:  es

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 329.85390615463257


Language:  pt

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 327.71513843536377


#### SVM + TF-IDF + WB + POS + NER

In [177]:
for language in ['en', 'es', 'pt']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    dataset = pd.concat([dataset_train, dataset_test])
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', dataset, dataset, embedding)
    create_feature('pos_hotencode', dataset, dataset)
    create_feature('ner_hotencode', dataset, dataset)
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    pos = np.array([list(r) for r in dataset['pos_hotencode'].values])
    
    ner = np.array([list(r) for r in dataset['ner_hotencode'].values])
    
    X = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf, embedding, pos, ner)])
    
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, [50, 100] + list(range(500, 5501, 500)),
                     save='results/UIUC_cv_svm_high_' + language + '.csv')



Language:  en

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 300.1998710632324


Language:  es

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 361.68490052223206


Language:  pt

50|..........
100|..........
500|..........
1000|..........
1500|..........
2000|..........
2500|..........
3000|..........
3500|..........
4000|..........
4500|..........
5000|..........
5500|..........
Run time benchmark: 308.6705446243286


## Run DISEQuA Benchmark - Cross-validation

Different classifier models are tested with different dependency levels of external linguistic resources (Low, Medium and High)

#### SVM + <font color=#007700>TF-IDF</font>

In [152]:
for language in ['en', 'es', 'it', 'nl']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, max_features=2000)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, sizes_train=[100,200,300,400],
                     save='results/DISEQuA_svm_tfidf_' + language + '.csv')



Language:  en

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.027012586593628


Language:  es

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.0114972591400146


Language:  it

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.1434721946716309


Language:  nl

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 1.1250619888305664


#### SVM + <font color=#007700>TF-IDF</font> + <font color=#0055CC>WB</font>

In [163]:
for language in ['en', 'es', 'it', 'nl']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', None, dataset, embedding)
    
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    X = np.array([list(x) + list(xx) for x, xx in zip(tfidf, embedding)])
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, sizes_train=[100,200,300,400],
                     save='results/DISEQuA_svm_cortes_' + language + '.csv')



Language:  en

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.358882427215576


Language:  es

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 7.197380065917969


Language:  it

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 5.5334153175354


Language:  nl

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.624628782272339


#### SVM + <font color=#007700>TF-IDF</font> + <font color=#0055CC>WB</font> + <font color=#CC6600>POS</font> + <font color=#CC6600>NER</font>

In [164]:


for language in ['en', 'es', 'it', 'nl']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, max_features=2000)
    create_feature('embedding_sum', dataset, dataset, embedding)
    create_feature('pos_hotencode', dataset, dataset)
    create_feature('ner_hotencode', dataset, dataset)
    model = {'name': 'svm', 'model': svm_linear}
    
    tfidf = np.array([list(r) for r in dataset['tfidf'].values])
    tfidf = normalize(tfidf, norm='max')
    
    embedding = np.array([list(r) for r in dataset['embedding_sum'].values])
    embedding = normalize(embedding, norm='max')
    
    pos = np.array([list(r) for r in dataset['pos_hotencode'].values])
    
    ner = np.array([list(r) for r in dataset['ner_hotencode'].values])
    
    X = np.array([list(x) + list(xx) + list(xxx) + list(xxxx) for x, xx, xxx, xxxx in zip(tfidf, embedding, pos, ner)])
    
    y = dataset['class'].values
    
    run_benchmark_cv(model, X, y, sizes_train=[100,200,300,400],
                     save='results/DISEQuA_svm_high_' + language + '.csv')



Language:  en

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.811999559402466


Language:  es

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 8.384974479675293


Language:  it

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.426969528198242


Language:  nl

100|..........
200|..........
300|..........
400|..........
Run time benchmark: 6.852076053619385


## Old stuffs bellow

#### CNN

In [None]:
# 'en', 'es'
for language in ['es']:
    print('\n\nLanguage: ', language)
    #embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    text_representation = 'vocab_index'
    vocabulary_inv = create_feature(text_representation, dataset_train, dataset_train)
    create_feature(text_representation, dataset_train, dataset_test)
    model = {'name': 'cnn', 'model': cnn}
    X_train = np.array([list(x) for x in dataset_train[text_representation].values])
    X_test = np.array([list(x) for x in dataset_test[text_representation].values])
    #X_train = pad_sequences(X_train, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    #X_test = pad_sequences(X_test, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    ohe = OneHotEncoder()
    y_train = ohe.fit_transform([[y_] for y_ in y_train]).toarray()
    y_test = ohe.transform([[y_] for y_ in y_test]).toarray()
    # , 
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  runs=30, save='results/UIUC_cnn_' + language + '.csv', epochs=100, onehot=ohe,
                  vocabulary_size=len(vocabulary_inv))

#### LSTM + WordEmbedding

In [73]:
for language in ['es']:
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    dataset_train = dataset_train[:100]
    #dataset_test = dataset_test[:10]
    create_feature('embedding', dataset_train, dataset_train, embedding)
    create_feature('embedding', dataset_train, dataset_test, embedding)
    model = {'name': 'lstm', 'model': lstm_default}
    #print(dataset_train['embedding'].values.shape)
    #print(dataset_train['embedding'].values.dtype)
    #print(dataset_test['embedding'].values.shape)
    X_train = np.array([list(x) for x in dataset_train['embedding'].values])
    X_test = np.array([list(x) for x in dataset_test['embedding'].values])
    X_train = pad_sequences(X_train, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    X_test = pad_sequences(X_test, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
#     y_train_sub = dataset_train['sub_class'].values
#     sub_classes = set()
#     for sc in y_train_sub:
#         sub_classes.add(sc)
#     y_test_sub = dataset_test['sub_class'].values
#     X_test_sub_ = []
#     y_test_sub_ = []
#     for i in range(len(X_test)):
#         if y_train_sub[i] in sub_classes:
#             X_test_sub_.append(X_test[i])
#             y_test_sub_.append(y_train_sub[i])
#     X_test_sub_ = np.array(X_test_sub_)
#     y_test_sub_ = np.array(y_test_sub_)
    ohe = OneHotEncoder()
    y_train = ohe.fit_transform([[y_] for y_ in y_train]).toarray()
    y_test = ohe.transform([[y_] for y_ in y_test]).toarray() 
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[1000, 2000, 3000, 4000, 5500],
                  runs=30, save='results/UIUC_lstm_embedding_' + language + '_2.csv', epochs=100, onehot=ohe)
    #run_benchmark(model, X_train, y_train_sub, X_test_sub_, y_test_sub_, sizes_train=[1000, 2000, 3000, 4000, 5500],
    #              save='results/UIUCsub_svm_tfidf_' + language + '.csv')



Language:  es
(100,)
object
(1349,)

1000|...
2000|...
3000|...
4000|...
5500|...
Run time benchmark: 228.79835891723633


#### LSTM + BERT

In [None]:
for language in ['en']:
    print('\n\nLanguage: ', language)
    #embedding = load_embedding(path_wordembedding + 'wiki.multi.' + language + '.vec')
    dataset_train, dataset_test = load_uiuc(language)
    # debug
    print('WARNING: use subset (first 1000 entries) of training data')
    #dataset_train = dataset_train[:5500].copy()
    
    create_feature('bert', dataset_train, dataset_train)
    create_feature('bert', dataset_train, dataset_test)
    model = {'name': 'lstm', 'model': lstm_default}
    X_train = dataset_train['bert'].values
    X_test = dataset_test['bert'].values
    
    X_train = np.array([x for x in X_train])
    X_test = np.array([x for x in X_test])
    
    #X_train = pad_sequences(X_train, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    #X_test = pad_sequences(X_test, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    y_train = dataset_train['class'].values
    y_test = dataset_test['class'].values
    ohe = OneHotEncoder()
    y_train = ohe.fit_transform([[y_] for y_ in y_train]).toarray()
    y_test = ohe.transform([[y_] for y_ in y_test]).toarray() 
    run_benchmark(model, X_train, y_train, X_test, y_test, sizes_train=[5500], # 1000, 2000, 3000, 4000, 5500
                  runs=1, save='results/UIUC_lstm_bert_' + language + '.csv', 
                  epochs=100, onehot=ohe, in_dim=768)
    #run_benchmark(model, X_train, y_train_sub, X_test_sub_, y_test_sub_, sizes_train=[1000, 2000, 3000, 4000, 5500],
    #              save='results/UIUCsub_svm_tfidf_' + language + '.csv')

## DISEQuA Benchmark

### RUN DISEQuA Benchmark

##### SVM + TFIDF

In [None]:
for language in ['DUT', 'ENG', 'ITA', 'SPA']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, embedding)
    model = {'name': 'svm', 'model': svm_linear}
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    run_benchmark(model, X, y, sizes_train=[100,200,300,400,405],
                  save='results/DISEQuA_svm_tfidf_' + language + '.csv')

##### RFC + TFIDF

In [None]:
for language in ['DUT', 'ENG', 'ITA', 'SPA']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf', dataset, dataset, embedding)
    model = {'name': 'rfc', 'model': random_forest}
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    run_benchmark(model, X, y, sizes_train=[100,200,300,400],
                  save='results/DISEQuA_rfc_tfidf_' + language + '.csv')

##### SVM + TFIDF_3gram + SKB

In [None]:
for language in ['DUT', 'ENG', 'ITA', 'SPA']:
    print('\n\nLanguage: ', language)
    dataset = load_disequa(language)
    create_feature('tfidf_3gram', dataset, dataset)
    model = {'name': 'svm', 'model': svm_linear}
    X = np.array([list(x) for x in dataset['tfidf'].values])
    y = dataset['class'].values
    skb = SelectKBest(chi2, k=2000).fit(X, y)
    X = skb.transform(X)
    run_benchmark(model, X, y, sizes_train=[100,200,300,400],
                  save='results/DISEQuA_svm_tfidf_3gram_' + language + '.csv')

##### LSTM + Embedding

In [None]:
for language, embd_l in zip(['SPA'], ['es']):
    print('\n\nLanguage: ', language)
    embedding = load_embedding(path_wordembedding + 'wiki.multi.' + embd_l + '.vec')
    dataset = load_disequa(language)
    create_feature('embedding', dataset, dataset, embedding)
    model = {'name': 'lstm', 'model': lstm_default}
    X = np.array([list(x) for x in dataset['embedding'].values])
    y = dataset['class'].values
    X = pad_sequences(X, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    ohe = OneHotEncoder()
    y = ohe.fit_transform([[y_] for y_ in y]).toarray()
    run_benchmark(model, X, y, sizes_train=[100,200,300,400,405], onehot=ohe,
                  save='results/DISEQuA_lstm_embedding_' + language + '.csv')

##### CNN

In [None]:
for language, embd_l in zip(['DUT', 'ENG', 'ITA', 'SPA'], ['nl', 'eng', 'it', 'es']):
    print('\n\nLanguage: ', language)
    #embedding = load_embedding(path_wordembedding + 'wiki.multi.' + embd_l + '.vec')
    dataset = load_disequa(language)
    text_representation = 'vocab_index'
    vocabulary_inv = create_feature(text_representation, dataset, dataset)
    model = {'name': 'cnn', 'model': cnn}
    X = np.array([list(x) for x in dataset[text_representation].values])
    y = dataset['class'].values
    #X = pad_sequences(X, maxlen=12, dtype='float', padding='post', truncating='post', value=0.0)
    ohe = OneHotEncoder()
    y = ohe.fit_transform([[y_] for y_ in y]).toarray()
    run_benchmark(model, X, y, sizes_train=[100,200,300,400], onehot=ohe, vocabulary_size=len(vocabulary_inv),
                  save='results/DISEQuA_cnn_' + language + '.csv', epochs=100)