<a href="https://colab.research.google.com/github/alexlimatds/fact_extraction/blob/main/AILA2020/FACTS_AILA_TF_IDF_approach_2_cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Facts extraction with AILA data and TF-IDF features

This notebook experiments TF-IDF features in order to find the best hyperparameters.

The computation of the TF-IDF weights is based on documents and these are  the steps to compute the TF-IDF vector for a sentence:

- The TF-IDF is fed with the document that contains the sentence.
- For the sentence, the vector is built with the TF-IDF weights of its terms whose weights were document-based computed.

Data used in this notebook:

- for cross-validation: the train dataset from AILA 2020. This can be obtained at https://github.com/Law-AI/semantic-segmentation;

### Loading dataset

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
g_drive_dir = "/content/gdrive/MyDrive/"

Mounted at /content/gdrive


In [None]:
!rm -r data
!mkdir data
!mkdir data/train
!tar -xf {g_drive_dir}fact_extraction_AILA/train.tar.xz -C data/train

train_dir = 'data/train/'

rm: cannot remove 'data': No such file or directory


In [None]:
import pandas as pd
from os import listdir

def read_docs(dir_name):
  """
  Read the docs in a directory.
  Params:
    dir_name : the directory that contains the documents.
  Returns:
    A dictionary whose keys are the names of the read files and the values are 
    pandas dataframes. Each dataframe has the columns sentence and label.
  """
  docs = {} # key: file name, value: dataframe with sentences and labels
  for f in listdir(dir_name):
    df = pd.read_csv(
        dir_name + f, 
        sep='\t', 
        names=['sentence', 'label'])
    docs[f] = df
  return docs

docs_train = read_docs(train_dir)

print(f'TRAIN: {len(docs_train)} documents read.')

TRAIN: 50 documents read.


### Spliting document files according to the folds

In [None]:
# Reading the file containing the sets of trains documents and test documents by fold
train_files_by_fold = []  # Each index in the list represents a fold and stores a list of file names
test_files_by_fold = []   # Each index in the list represents a fold and stores a list of file names

df_folds = pd.read_csv(
  g_drive_dir + 'fact_extraction_AILA/train_docs_by_fold.csv', 
  sep=';', 
  names=['train', 'test'])
for line in df_folds['train'].tolist():
  train_files_by_fold.append(line.split(','))
for line in df_folds['test'].tolist():
  test_files_by_fold.append(line.split(','))

for i in range(len(test_files_by_fold)):
  print(f'Fold {i}: \n\tTrain files: {train_files_by_fold[i]} \n\tTest files: {test_files_by_fold[i]}')

Fold 0: 
	Train files: ['d_44.txt', 'd_39.txt', 'd_12.txt', 'd_2.txt', 'd_7.txt', 'd_33.txt', 'd_16.txt', 'd_8.txt', 'd_42.txt', 'd_34.txt', 'd_40.txt', 'd_24.txt', 'd_36.txt', 'd_11.txt', 'd_13.txt', 'd_19.txt', 'd_18.txt', 'd_4.txt', 'd_1.txt', 'd_21.txt', 'd_15.txt', 'd_23.txt', 'd_32.txt', 'd_9.txt', 'd_5.txt', 'd_3.txt', 'd_26.txt', 'd_20.txt', 'd_30.txt', 'd_41.txt', 'd_46.txt', 'd_43.txt', 'd_50.txt', 'd_27.txt', 'd_25.txt', 'd_35.txt', 'd_45.txt', 'd_17.txt', 'd_48.txt', 'd_6.txt'] 
	Test files: ['d_22.txt', 'd_31.txt', 'd_49.txt', 'd_14.txt', 'd_29.txt', 'd_47.txt', 'd_10.txt', 'd_38.txt', 'd_28.txt', 'd_37.txt']
Fold 1: 
	Train files: ['d_22.txt', 'd_31.txt', 'd_49.txt', 'd_14.txt', 'd_29.txt', 'd_47.txt', 'd_10.txt', 'd_38.txt', 'd_28.txt', 'd_37.txt', 'd_40.txt', 'd_24.txt', 'd_36.txt', 'd_11.txt', 'd_13.txt', 'd_19.txt', 'd_18.txt', 'd_4.txt', 'd_1.txt', 'd_21.txt', 'd_15.txt', 'd_23.txt', 'd_32.txt', 'd_9.txt', 'd_5.txt', 'd_3.txt', 'd_26.txt', 'd_20.txt', 'd_30.txt', 'd_

### Functions to get data by fold

In [None]:
def docs_to_sentences(file_names, docs_dic):
  """
  Extracts the sentences and the labels from a subset of documents.
  Params:
    file_names    : List with the names of the documents in the desired subset.
    docs_dic      : Dictionary of documents as returned by the read_docs function.
  Returns:
    - A list of sentences (strings).
    - A list of labels (strings). The indexes of this list are 
    respective to the indexes in the returned sentence list.
  """
  sentences_ = []
  targets_ = []
  for fname in file_names:
    sentences_.extend(docs_dic[fname]['sentence'].tolist())
    targets_.extend(docs_dic[fname]['label'].tolist())
  
  return sentences_, targets_


In [None]:
def docs_as_strings(file_names, docs_dic):
  """
  Returns a subset of documents as a list of strings.
  Params:
    file_names    : List with the names of the documents in the desired subset.
    docs_dic      : Dictionary of documents as returned by the read_docs function.
  Returns:
    A list of strings whose each element is a document.
  """
  docs = []
  for fname in file_names:
    docs.append(" ".join(docs_dic[fname]['sentence'].tolist()))
  return docs


In [None]:
import numpy as np
import scipy.sparse as sparse

def docs_to_features(file_names, docs_dic, tfidf_model, to_dense=False):
  """
  Converts the sentences from a set of documents to features.
  Params:
    file_names    : List with the names of the documents in the desired subset.
    docs_dic      : Dictionary of documents as returned by the read_docs function.
    tfidf_model   : A trained model used to compute the TF-IDF weights.
    to_dense      : If the features vectors must returned as numpy vectors or not.
  Returns:
    - A scipy sparse matrix if to_dense=False and a numpy matrix otherwise. The 
    shape of the matrix is equal to (len(sentences), vocabulary size)
    - A list of labels (strings). The indexes of this list are 
    respective to the sentences indexes in the feature matrix.
  """
  n_vocab = len(tfidf_model.vocabulary_)
  features = None
  targets = []
  for fname in file_names:
    targets.extend(docs_dic[fname]['label'].tolist())
    sentences = docs_dic[fname]['sentence'].tolist()
    doc_str = " ".join(docs_dic[fname]['sentence'].tolist())
    doc_tfidf = tfidf_model.transform([doc_str])[0]
    # using the model to generate a sparse matrix with the correct structure
    # altough, the tf-idf weights will be get from doc_tfidf 
    sentences_tfidf = tfidf_model.transform(sentences)
    non_zeros_idx = sentences_tfidf.nonzero()
    for sent_idx, term_idx in zip(non_zeros_idx[0], non_zeros_idx[1]):
      # overriding with the weights from doc_tfidf
      sentences_tfidf[sent_idx, term_idx] = doc_tfidf[0, term_idx]
    if features is None:
      features = sentences_tfidf
    else:
      features = sparse.vstack([features, sentences_tfidf])
  if to_dense:
    features = features.toarray()
  
  return features, targets

### Evaluation functions

In [None]:
import sklearn
from sklearn.metrics import precision_recall_fscore_support
from IPython.display import display, HTML

def metrics_report(title, metrics):
  report_df = pd.DataFrame(columns=['Precision', 'P std', 'Recall', 'R std', 'F1', 'F1 std'])
  for (model, p, p_std, r, r_std, f1, f1_std) in metrics:
    report_df.loc[model] = [f'{p:.4f}', f'{p_std:.4f}', f'{r:.4f}', f'{r_std:.4f}', f'{f1:.4f}', f'{f1_std:.4f}']
    display(HTML(f'<br><span style="font-weight: bold">{title}: cross-validation averages</span>'))
    display(report_df)

def update_report(display_id, report_df, metrics):
  model, p, p_std, r, r_std, f1, f1_std = metrics
  report_df.loc[model] = [f'{p:.4f}', f'{p_std:.4f}', f'{r:.4f}', f'{r_std:.4f}', f'{f1:.4f}', f'{f1_std:.4f}']
  update_display(report_df, display_id=display_id)

test_metrics = {}  

def cross_validation(model_tuples, tfidf_builder, set_description, verbose_vocab=False):
  """
  Params:
    model_tuples  : A list of tuples. For each tuple the first element is a function 
                    returning a unfited machine learning model and the second one 
                    is a flag to use numpy vectors or not.
    tfidf_builder : A function returning a unfited TF-IDF model.
    set_description : Text description of the feature set.
    verbose_vocab   : If the size of the vocabulary must be printed or not.
  """
  train_metrics_cross = {}
  test_metrics_cross = {}
  tfidf_model = tfidf_builder()
  for i_fold in range(len(train_files_by_fold)):
    print(f'Starting fold {i_fold}')
    # training TF-IDF model
    tfidf_model.fit(docs_as_strings(train_files_by_fold[i_fold], docs_train))
    if verbose_vocab:
      print(f'   Learned {len(tfidf_model.vocabulary_)} terms.')
    # running classifiers
    last_to_dense = None
    for (model_builder, to_dense) in model_tuples:
      model = model_builder()
      model_name = model.__class__.__name__
      print(f'   Processing model: {model_name}')
      if last_to_dense != to_dense:
        train_features, train_targets = docs_to_features(train_files_by_fold[i_fold], docs_train, tfidf_model, to_dense=to_dense)
        test_features, test_targets = docs_to_features(test_files_by_fold[i_fold], docs_train, tfidf_model, to_dense=to_dense)
      last_to_dense = to_dense
      model.fit(train_features, train_targets)
      # test metrics
      predictions = model.predict(test_features)
      p_test, r_test, f1_test, _ = precision_recall_fscore_support(
          test_targets, 
          predictions, 
          average='binary', 
          pos_label='Facts', 
          zero_division=0)
      model_metrics = test_metrics_cross.get(model_name, [])
      model_metrics.append([p_test, r_test, f1_test])
      test_metrics_cross[model_name] = model_metrics
      # train metrics
      predictions = model.predict(train_features)
      p_train, r_train, f1_train, _ = precision_recall_fscore_support(
          train_targets, 
          predictions, 
          average='binary', 
          pos_label='Facts', 
          zero_division=0)
      model_metrics = train_metrics_cross.get(model_name, [])
      model_metrics.append([p_train, r_train, f1_train])
      train_metrics_cross[model_name] = model_metrics

  # averaging and reporting the metrics achieved in each fold
  # train metrics
  report_df_train = pd.DataFrame(columns=['Precision', 'P std', 'Recall', 'R std', 'F1', 'F1 std'])
  for model_name, metrics in train_metrics_cross.items():
    model_metrics = np.array(metrics)
    mean = np.mean(model_metrics, axis=0)
    std = np.std(model_metrics, axis=0)
    report_df_train.loc[model_name] = [
        f'{mean[0]:.4f}', f'{std[0]:.4f}',  # precision
        f'{mean[1]:.4f}', f'{std[1]:.4f}',  # recall
        f'{mean[2]:.4f}', f'{std[2]:.4f}']  # f1
  display(HTML(f'<br><span style="font-weight: bold">TRAIN: cross-validation averages</span>'))
  display(report_df_train)
  # test metrics
  report_df_test = pd.DataFrame(columns=['Precision', 'P std', 'Recall', 'R std', 'F1', 'F1 std'])
  for model_name, metrics in test_metrics_cross.items():
    model_metrics = np.array(metrics)
    mean = np.mean(model_metrics, axis=0)
    std = np.std(model_metrics, axis=0)
    report_df_test.loc[model_name] = [
        f'{mean[0]:.4f}', f'{std[0]:.4f}',  # precision
        f'{mean[1]:.4f}', f'{std[1]:.4f}',  # recall
        f'{mean[2]:.4f}', f'{std[2]:.4f}']  # f1
    # metrics for the summary
    summary_model_metrics = test_metrics.get(model_name, [])
    summary_model_metrics.append((set_description, mean, std))
    test_metrics[model_name] = summary_model_metrics
  display(HTML(f'<br><span style="font-weight: bold">TEST: cross-validation averages</span>'))
  display(report_df_test)


### Pre-processing function

In [None]:
import re

def preprocess(str):
  pstr = str
  pstr = re.sub(r'[/(){}\[\]\|@,;]', ' ', pstr) # replaces symbols with spaces
  pstr = re.sub(r'[^0-9a-z #+_]', '', pstr)     # removes bad symbols
  pstr = re.sub(r'\d+', '', pstr)               # removes numbers
  return pstr

### Models

#### MLP

In [None]:
from sklearn.neural_network import MLPClassifier

def mlp():
  # Default MLP from scikit-learn
  return MLPClassifier(early_stopping=True, random_state=1)

#### Linear SVM

In [None]:
from sklearn.svm import LinearSVC

def linear_svm():
  return LinearSVC(random_state=1)

#### RBF SVM

In [None]:
from sklearn.svm import SVC

def rbf_svm():
  return SVC(kernel='rbf', random_state=1)

#### Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

def logistic_regression():
  return LogisticRegression(solver='sag', max_iter=200, random_state=1)

#### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

def knn():
  return KNeighborsClassifier(5)

#### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

def decision_tree():
  return DecisionTreeClassifier(random_state=1)

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

def random_forest():
  return RandomForestClassifier(random_state=1)

#### AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

def adaboost():
  return AdaBoostClassifier(random_state=1)

#### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

def naive_bayes():
  return GaussianNB()

#### XGBoost

In [None]:
from xgboost.sklearn import XGBClassifier

def xgboost():
  return XGBClassifier(objective="binary:logistic", tree_method='hist')

### Set 1

- N-grams: 1 to 3
- Stop words removal: No
- Vocabulary's size: no limits

Notes:
- It doesn't apply Naive Bayes and XGBoost models because there's no enough RAM to run them.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set1():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set1, 
    'SET 1', 
    verbose_vocab=True)

Starting fold 0
   Learned 221360 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 221933 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 200352 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing 

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.964,0.0058,0.787,0.0307,0.8663,0.0192
LinearSVC,0.825,0.0215,0.2012,0.0342,0.3217,0.0438
SVC,0.8329,0.0209,0.1293,0.0311,0.2222,0.0471
LogisticRegression,0.8077,0.0489,0.058,0.024,0.1068,0.0431
KNeighborsClassifier,0.7582,0.0128,0.4284,0.0332,0.5464,0.0258
DecisionTreeClassifier,0.9977,0.0004,0.9816,0.0015,0.9896,0.0009
RandomForestClassifier,0.999,0.0002,0.9802,0.0017,0.9895,0.0009
AdaBoostClassifier,0.6883,0.0205,0.3681,0.041,0.4778,0.0336


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4779,0.0711,0.3486,0.0645,0.3976,0.0505
LinearSVC,0.5428,0.1237,0.1047,0.0581,0.1693,0.078
SVC,0.2636,0.3036,0.0039,0.0053,0.0077,0.0104
LogisticRegression,0.6305,0.2274,0.0086,0.0039,0.0169,0.0076
KNeighborsClassifier,0.3868,0.1288,0.1598,0.0387,0.2232,0.0576
DecisionTreeClassifier,0.3984,0.0958,0.2524,0.0305,0.3076,0.0502
RandomForestClassifier,0.6194,0.1464,0.0637,0.0144,0.1151,0.0253
AdaBoostClassifier,0.4942,0.1036,0.2265,0.0229,0.3081,0.0374


CPU times: user 1h 42min 8s, sys: 13min 59s, total: 1h 56min 7s
Wall time: 1h 39min 47s


### Set 2

- N-grams: 1 to 3
- Stop words removal: No
- Maximum vocabulary's size: 20,000


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set2():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=20000)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set2, 
    'SET 2', 
    verbose_vocab=True)

Starting fold 0
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9151,0.019,0.7088,0.0726,0.7975,0.0533
LinearSVC,0.822,0.0143,0.2106,0.0332,0.3337,0.0417
SVC,0.8288,0.0203,0.1278,0.0318,0.2196,0.0482
LogisticRegression,0.7893,0.0158,0.0684,0.0257,0.1245,0.0451
KNeighborsClassifier,0.7622,0.0141,0.4306,0.0325,0.5492,0.0244
DecisionTreeClassifier,0.9962,0.0005,0.9766,0.0021,0.9863,0.0012
RandomForestClassifier,0.9979,0.0006,0.9748,0.0025,0.9862,0.0012
AdaBoostClassifier,0.6835,0.0183,0.3586,0.0271,0.4695,0.0205
XGBClassifier,0.8953,0.0085,0.267,0.0267,0.4105,0.0316
GaussianNB,0.702,0.0127,1.0,0.0,0.8248,0.0088


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4686,0.0849,0.3555,0.0512,0.4005,0.0498
LinearSVC,0.5295,0.1191,0.1047,0.064,0.1677,0.085
SVC,0.2692,0.3474,0.0045,0.0067,0.0089,0.0132
LogisticRegression,0.6267,0.2332,0.0089,0.0036,0.0174,0.0069
KNeighborsClassifier,0.3797,0.1244,0.1607,0.0328,0.2241,0.0532
DecisionTreeClassifier,0.3952,0.0995,0.3137,0.0199,0.3449,0.0414
RandomForestClassifier,0.6159,0.1119,0.1263,0.023,0.2094,0.0375
AdaBoostClassifier,0.4837,0.1123,0.2912,0.0527,0.3552,0.049
XGBClassifier,0.6532,0.0754,0.1133,0.0088,0.1925,0.0121
GaussianNB,0.3906,0.0981,0.4346,0.0778,0.3981,0.0475


CPU times: user 19min 34s, sys: 8min 43s, total: 28min 17s
Wall time: 19min 33s


### Set 3

- N-grams: 1 to 3
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set3():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=2000)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set3, 
    'SET 3', 
    verbose_vocab=True)

Starting fold 0
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing mo

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.7854,0.0174,0.5161,0.0824,0.6197,0.0644
LinearSVC,0.7872,0.0219,0.1943,0.0365,0.3096,0.0471
SVC,0.8167,0.0308,0.1125,0.0332,0.1957,0.0518
LogisticRegression,0.7765,0.0244,0.0766,0.0267,0.1378,0.0457
KNeighborsClassifier,0.6866,0.0621,0.4754,0.0385,0.5592,0.0264
DecisionTreeClassifier,0.9949,0.0001,0.9625,0.0037,0.9784,0.0019
RandomForestClassifier,0.997,0.0005,0.9605,0.0041,0.9784,0.0019
AdaBoostClassifier,0.6859,0.0135,0.3593,0.028,0.4706,0.0218
XGBClassifier,0.8809,0.0148,0.2596,0.0256,0.4001,0.0298
GaussianNB,0.4293,0.0245,0.934,0.0049,0.5878,0.0227


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.5032,0.0655,0.3556,0.0563,0.4117,0.041
LinearSVC,0.5139,0.1093,0.1176,0.0709,0.1828,0.0898
SVC,0.2167,0.3232,0.0053,0.0092,0.0102,0.018
LogisticRegression,0.541,0.1524,0.0141,0.0052,0.0272,0.01
KNeighborsClassifier,0.3603,0.1044,0.2187,0.0461,0.2685,0.0613
DecisionTreeClassifier,0.3562,0.0954,0.3507,0.0233,0.3471,0.0506
RandomForestClassifier,0.5668,0.0721,0.2086,0.037,0.3036,0.0469
AdaBoostClassifier,0.5076,0.1011,0.2683,0.0409,0.3484,0.0524
XGBClassifier,0.6492,0.1071,0.1117,0.0164,0.1896,0.0261
GaussianNB,0.3433,0.0806,0.7274,0.0496,0.4609,0.0755


CPU times: user 7min, sys: 58.6 s, total: 7min 59s
Wall time: 7min


### Set 4

- N-grams: 1 to 2
- Stop words removal: No
- Vocabulary's size: No limits

Notes:
- It doesn't apply Naive Bayes and XGBoost models because there's no enough RAM to run them.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set4():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set4, 
    'SET 4', 
    verbose_vocab=True)

Starting fold 0
   Learned 83166 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 83169 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 75807 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing mod

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9304,0.0253,0.7652,0.0591,0.8389,0.0427
LinearSVC,0.817,0.0169,0.2064,0.0331,0.3279,0.0416
SVC,0.8199,0.0219,0.1252,0.0304,0.2155,0.0462
LogisticRegression,0.7874,0.025,0.0671,0.0257,0.1221,0.0451
KNeighborsClassifier,0.7702,0.0135,0.427,0.0292,0.5486,0.0229
DecisionTreeClassifier,0.9977,0.0004,0.9816,0.0015,0.9896,0.0009
RandomForestClassifier,0.9987,0.0004,0.9805,0.0015,0.9896,0.0009
AdaBoostClassifier,0.6944,0.0153,0.364,0.0301,0.4766,0.0237


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4684,0.0686,0.3539,0.0542,0.3977,0.039
LinearSVC,0.5288,0.1263,0.0992,0.0643,0.1595,0.0866
SVC,0.2667,0.3432,0.0042,0.0061,0.0083,0.012
LogisticRegression,0.5662,0.2492,0.0075,0.0039,0.0148,0.0075
KNeighborsClassifier,0.3836,0.1122,0.1706,0.0342,0.2346,0.0513
DecisionTreeClassifier,0.4186,0.0846,0.3027,0.0384,0.3455,0.0309
RandomForestClassifier,0.613,0.1356,0.0868,0.015,0.1517,0.0262
AdaBoostClassifier,0.5007,0.1033,0.2911,0.0489,0.3626,0.0507


CPU times: user 49min 19s, sys: 17min 35s, total: 1h 6min 55s
Wall time: 49min 48s


### Set 5

- N-grams: 1 to 2
- Stop words removal: No
- Maximum vocabulary's size: 20,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set5():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=20000)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set5, 
    'SET 5', 
    verbose_vocab=True)

Starting fold 0
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9252,0.0184,0.7334,0.065,0.8171,0.0462
LinearSVC,0.8174,0.0159,0.2094,0.0334,0.3317,0.0419
SVC,0.8171,0.0222,0.1246,0.0308,0.2145,0.0467
LogisticRegression,0.7808,0.0203,0.0698,0.0267,0.1266,0.0467
KNeighborsClassifier,0.7539,0.0372,0.4395,0.0384,0.5532,0.0266
DecisionTreeClassifier,0.9964,0.0003,0.9782,0.002,0.9872,0.0011
RandomForestClassifier,0.9978,0.0004,0.9767,0.0024,0.9872,0.0011
AdaBoostClassifier,0.6966,0.0127,0.364,0.0307,0.4771,0.0245
XGBClassifier,0.8856,0.0196,0.2666,0.0252,0.409,0.0287
GaussianNB,0.7293,0.0138,1.0,0.0,0.8434,0.0092


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4592,0.08,0.3567,0.0645,0.3976,0.056
LinearSVC,0.5342,0.1255,0.1044,0.0651,0.1667,0.0861
SVC,0.25,0.3162,0.0048,0.0073,0.0095,0.0143
LogisticRegression,0.5829,0.2519,0.0083,0.0034,0.0163,0.0065
KNeighborsClassifier,0.3792,0.1238,0.1877,0.0574,0.2451,0.0617
DecisionTreeClassifier,0.3995,0.0743,0.3111,0.0286,0.3443,0.0167
RandomForestClassifier,0.6158,0.0958,0.1243,0.0145,0.2064,0.0233
AdaBoostClassifier,0.5022,0.116,0.2807,0.0601,0.3567,0.0733
XGBClassifier,0.6269,0.0835,0.1203,0.0143,0.2001,0.0149
GaussianNB,0.3653,0.0922,0.4114,0.0743,0.3732,0.04


CPU times: user 20min 34s, sys: 10min 30s, total: 31min 4s
Wall time: 20min 38s


### Set 6

- N-grams: 1 to 2
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set6():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=2000)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set6, 
    'SET 6', 
    verbose_vocab=True)

Starting fold 0
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing mo

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8041,0.0178,0.5101,0.0689,0.6217,0.0523
LinearSVC,0.7882,0.0188,0.2007,0.0351,0.3182,0.0448
SVC,0.8029,0.0256,0.1192,0.0307,0.2058,0.0469
LogisticRegression,0.7744,0.0191,0.077,0.028,0.1383,0.048
KNeighborsClassifier,0.7001,0.0608,0.4712,0.0476,0.5588,0.0201
DecisionTreeClassifier,0.9949,0.0001,0.9641,0.0032,0.9792,0.0017
RandomForestClassifier,0.9966,0.0005,0.9624,0.0033,0.9792,0.0017
AdaBoostClassifier,0.7012,0.0187,0.35,0.0288,0.466,0.0253
XGBClassifier,0.888,0.0146,0.2588,0.0272,0.3998,0.0319
GaussianNB,0.4294,0.024,0.9411,0.0049,0.5892,0.0224


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4983,0.052,0.3455,0.0609,0.4034,0.0437
LinearSVC,0.5275,0.1054,0.1143,0.0714,0.1787,0.0904
SVC,0.2167,0.3232,0.0053,0.0092,0.0102,0.018
LogisticRegression,0.541,0.1524,0.0141,0.0052,0.0272,0.01
KNeighborsClassifier,0.35,0.102,0.2104,0.0277,0.26,0.0488
DecisionTreeClassifier,0.3662,0.1011,0.3356,0.0169,0.3427,0.0401
RandomForestClassifier,0.5761,0.0915,0.196,0.0306,0.2913,0.0424
AdaBoostClassifier,0.5126,0.0971,0.2759,0.0335,0.3537,0.0373
XGBClassifier,0.6577,0.0828,0.1268,0.0213,0.2111,0.0307
GaussianNB,0.3387,0.0763,0.7268,0.055,0.4566,0.0699


CPU times: user 6min 48s, sys: 1min 1s, total: 7min 50s
Wall time: 6min 49s


### Set 7

- N-grams: 1
- Stop words removal: No
- Vocabulary's size: No limits

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set7():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set7, 
    'SET 7', 
    verbose_vocab=True)

Starting fold 0
   Learned 10394 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 10208 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 9612 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing mode

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8394,0.0243,0.5921,0.0368,0.6937,0.0271
LinearSVC,0.7886,0.0243,0.2014,0.0298,0.3193,0.0369
SVC,0.8044,0.0308,0.1046,0.0301,0.1834,0.0481
LogisticRegression,0.7825,0.0246,0.0736,0.0278,0.1328,0.0482
KNeighborsClassifier,0.7505,0.0104,0.4463,0.0358,0.5588,0.0292
DecisionTreeClassifier,0.9968,0.0005,0.9816,0.0015,0.9891,0.0009
RandomForestClassifier,0.9979,0.0004,0.9804,0.0017,0.9891,0.0009
AdaBoostClassifier,0.7014,0.021,0.3563,0.0208,0.472,0.0178


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4601,0.0686,0.308,0.066,0.3626,0.0529
LinearSVC,0.5328,0.0844,0.0953,0.0669,0.1533,0.0885
SVC,0.2167,0.2963,0.0015,0.0019,0.0029,0.0038
LogisticRegression,0.6397,0.2114,0.0127,0.0048,0.0246,0.009
KNeighborsClassifier,0.4062,0.1048,0.2006,0.0436,0.2669,0.0575
DecisionTreeClassifier,0.3804,0.1164,0.3344,0.0297,0.3505,0.059
RandomForestClassifier,0.6443,0.0923,0.1373,0.0241,0.2251,0.0354
AdaBoostClassifier,0.4824,0.0916,0.2545,0.0652,0.3245,0.0549


CPU times: user 7min 56s, sys: 4min 25s, total: 12min 21s
Wall time: 8min 7s


### Set 8

- N-grams: 1
- Stop words removal: No
- Maximum vocabulary's size: 20,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set8():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=20000)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set8, 
    'SET 8', 
    verbose_vocab=True)

Starting fold 0
   Learned 10394 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 10208 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 9612 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing 

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8394,0.0243,0.5921,0.0368,0.6937,0.0271
LinearSVC,0.7886,0.0243,0.2014,0.0298,0.3193,0.0369
SVC,0.8044,0.0308,0.1046,0.0301,0.1834,0.0481
LogisticRegression,0.7825,0.0246,0.0736,0.0278,0.1328,0.0482
KNeighborsClassifier,0.7505,0.0104,0.4463,0.0358,0.5588,0.0292
DecisionTreeClassifier,0.9968,0.0005,0.9816,0.0015,0.9891,0.0009
RandomForestClassifier,0.9979,0.0004,0.9804,0.0017,0.9891,0.0009
AdaBoostClassifier,0.7014,0.021,0.3563,0.0208,0.472,0.0178
XGBClassifier,0.8847,0.0167,0.266,0.0277,0.408,0.0321
GaussianNB,0.5714,0.0175,1.0,0.0,0.7271,0.0143


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4601,0.0686,0.308,0.066,0.3626,0.0529
LinearSVC,0.5328,0.0844,0.0953,0.0669,0.1533,0.0885
SVC,0.2167,0.2963,0.0015,0.0019,0.0029,0.0038
LogisticRegression,0.6397,0.2114,0.0127,0.0048,0.0246,0.009
KNeighborsClassifier,0.4062,0.1048,0.2006,0.0436,0.2669,0.0575
DecisionTreeClassifier,0.3804,0.1164,0.3344,0.0297,0.3505,0.059
RandomForestClassifier,0.6443,0.0923,0.1373,0.0241,0.2251,0.0354
AdaBoostClassifier,0.4824,0.0916,0.2545,0.0652,0.3245,0.0549
XGBClassifier,0.6259,0.0986,0.1327,0.0248,0.2164,0.0339
GaussianNB,0.3167,0.0814,0.5097,0.0434,0.3811,0.0517


CPU times: user 9min 58s, sys: 4min 13s, total: 14min 12s
Wall time: 10min


### Set 9

- N-grams: 1
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set9():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=2000)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set9, 
    'SET 9', 
    verbose_vocab=True)

Starting fold 0
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing mo

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8209,0.0176,0.5536,0.0261,0.6606,0.0149
LinearSVC,0.7872,0.0228,0.2028,0.0302,0.3209,0.0375
SVC,0.8032,0.0302,0.1042,0.03,0.1826,0.0479
LogisticRegression,0.7884,0.0287,0.0757,0.0275,0.1365,0.0473
KNeighborsClassifier,0.7283,0.0447,0.4559,0.0366,0.5586,0.0237
DecisionTreeClassifier,0.9947,0.0005,0.9711,0.0025,0.9828,0.0015
RandomForestClassifier,0.997,0.0008,0.9687,0.0031,0.9827,0.0016
AdaBoostClassifier,0.6991,0.0215,0.3663,0.0247,0.48,0.0204
XGBClassifier,0.8868,0.0198,0.2644,0.026,0.4063,0.0293
GaussianNB,0.4295,0.0242,0.9742,0.0061,0.5957,0.0225


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4878,0.0684,0.3433,0.0898,0.391,0.0591
LinearSVC,0.528,0.0821,0.104,0.068,0.1657,0.0873
SVC,0.1529,0.3059,0.0041,0.0082,0.008,0.016
LogisticRegression,0.6119,0.144,0.0176,0.0047,0.034,0.0088
KNeighborsClassifier,0.3649,0.0946,0.2003,0.0365,0.2577,0.0514
DecisionTreeClassifier,0.3597,0.101,0.3446,0.0369,0.3451,0.051
RandomForestClassifier,0.6156,0.077,0.188,0.0364,0.2863,0.0473
AdaBoostClassifier,0.4992,0.1019,0.2718,0.0782,0.3409,0.062
XGBClassifier,0.6258,0.1014,0.1269,0.0286,0.2067,0.0322
GaussianNB,0.3227,0.0828,0.6816,0.0503,0.4299,0.0685


CPU times: user 5min 41s, sys: 1min 2s, total: 6min 43s
Wall time: 5min 39s


### Set 10

- N-grams: 1 to 3
- Stop words removal: Yes
- Vocabulary's size: No limits

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set10():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set10, 
    'SET 10', 
    verbose_vocab=True)

Starting fold 0
   Learned 161130 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 161870 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 146655 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing 

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9657,0.0197,0.8006,0.0725,0.8743,0.0528
LinearSVC,0.8143,0.0129,0.1901,0.0299,0.3069,0.0395
SVC,0.9038,0.017,0.3561,0.0347,0.5097,0.034
LogisticRegression,0.7958,0.035,0.0612,0.0216,0.1125,0.0383
KNeighborsClassifier,0.7728,0.0041,0.4743,0.0407,0.5867,0.0315
DecisionTreeClassifier,0.9977,0.0003,0.9748,0.0015,0.9861,0.0008
RandomForestClassifier,0.999,0.0002,0.9734,0.0017,0.986,0.0009
AdaBoostClassifier,0.6949,0.0135,0.2935,0.0219,0.4122,0.022


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3841,0.1215,0.2999,0.0721,0.3267,0.0699
LinearSVC,0.3497,0.1092,0.0834,0.0806,0.1243,0.1043
SVC,0.4473,0.3273,0.0109,0.0091,0.0203,0.0171
LogisticRegression,0.0727,0.1455,0.0013,0.0025,0.0025,0.005
KNeighborsClassifier,0.3127,0.1287,0.169,0.0402,0.2128,0.0564
DecisionTreeClassifier,0.3525,0.0722,0.2921,0.0399,0.3141,0.037
RandomForestClassifier,0.4946,0.0969,0.1769,0.0259,0.2568,0.0266
AdaBoostClassifier,0.4926,0.1081,0.1522,0.0527,0.2283,0.0629


CPU times: user 1h 23min 38s, sys: 18min 49s, total: 1h 42min 27s
Wall time: 1h 25min 58s


### Set 11

- N-grams: 1 to 3
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set11():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=20000, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set11, 
    'SET 11', 
    verbose_vocab=True)

Starting fold 0
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9237,0.0053,0.7021,0.0596,0.7965,0.0404
LinearSVC,0.8153,0.0089,0.2101,0.0296,0.3329,0.0374
SVC,0.8953,0.0172,0.3555,0.0333,0.5078,0.0325
LogisticRegression,0.7973,0.0274,0.0699,0.0239,0.1272,0.0416
KNeighborsClassifier,0.7485,0.0433,0.5176,0.0333,0.6098,0.0087
DecisionTreeClassifier,0.9969,0.0003,0.9702,0.0011,0.9834,0.0007
RandomForestClassifier,0.9976,0.0003,0.9695,0.0013,0.9833,0.0007
AdaBoostClassifier,0.6928,0.0119,0.295,0.0203,0.4133,0.0194
XGBClassifier,0.9227,0.0203,0.1634,0.021,0.2769,0.0298
GaussianNB,0.6743,0.0142,1.0,0.0,0.8054,0.0102


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3679,0.1071,0.2916,0.0954,0.322,0.095
LinearSVC,0.3644,0.1057,0.0858,0.0821,0.1273,0.1052
SVC,0.467,0.3342,0.0137,0.0105,0.0251,0.0191
LogisticRegression,0.304,0.4021,0.005,0.0079,0.0097,0.0153
KNeighborsClassifier,0.3173,0.1225,0.2135,0.0764,0.2405,0.0737
DecisionTreeClassifier,0.3397,0.0738,0.3209,0.058,0.3261,0.0563
RandomForestClassifier,0.464,0.0861,0.2799,0.0198,0.345,0.0238
AdaBoostClassifier,0.47,0.124,0.1592,0.0519,0.2345,0.0645
XGBClassifier,0.6334,0.0646,0.0546,0.0149,0.1,0.0247
GaussianNB,0.3444,0.0838,0.5031,0.0737,0.3966,0.0469


CPU times: user 15min 13s, sys: 8min 31s, total: 23min 44s
Wall time: 15min 22s


### Set 12

- N-grams: 1 to 3
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set12():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=2000, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set12, 
    'SET 12', 
    verbose_vocab=True)

Starting fold 0
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing mo

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8222,0.0167,0.5465,0.0401,0.6557,0.0299
LinearSVC,0.7938,0.0146,0.2156,0.0324,0.3376,0.0401
SVC,0.8716,0.012,0.3167,0.0335,0.4633,0.0346
LogisticRegression,0.7958,0.0177,0.0852,0.0271,0.1525,0.0457
KNeighborsClassifier,0.7686,0.0097,0.4903,0.024,0.5982,0.016
DecisionTreeClassifier,0.9925,0.0011,0.9492,0.0023,0.9703,0.0012
RandomForestClassifier,0.9964,0.0007,0.9455,0.0025,0.9702,0.0012
AdaBoostClassifier,0.6866,0.0133,0.3022,0.0229,0.4193,0.0238
XGBClassifier,0.9213,0.016,0.1619,0.0176,0.2749,0.0251
GaussianNB,0.4096,0.0226,0.9855,0.0039,0.5782,0.0221


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3733,0.0859,0.3123,0.0682,0.3374,0.0713
LinearSVC,0.3485,0.1195,0.101,0.0762,0.1501,0.0936
SVC,0.3585,0.2468,0.0211,0.0181,0.038,0.0325
LogisticRegression,0.55,0.1354,0.0223,0.0278,0.0406,0.0485
KNeighborsClassifier,0.3194,0.1212,0.1764,0.0417,0.2226,0.0611
DecisionTreeClassifier,0.3255,0.1025,0.3488,0.0523,0.3295,0.0674
RandomForestClassifier,0.3992,0.0974,0.3707,0.0194,0.3787,0.0466
AdaBoostClassifier,0.4854,0.1212,0.1666,0.0531,0.2446,0.0636
XGBClassifier,0.6607,0.0652,0.0519,0.0179,0.0952,0.0294
GaussianNB,0.3238,0.0832,0.6728,0.0429,0.4298,0.0703


CPU times: user 4min 5s, sys: 53.1 s, total: 4min 58s
Wall time: 4min 4s


### Set 13

- N-grams: 1 to 2
- Stop words removal: Yes
- Vocabulary's size: No limits

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set13():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set13, 
    'SET 13', 
    verbose_vocab=True)

Starting fold 0
   Learned 76363 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 76670 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 69441 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing mod

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9595,0.015,0.7941,0.0777,0.8674,0.0524
LinearSVC,0.8146,0.012,0.2029,0.0299,0.3235,0.0383
SVC,0.8955,0.0163,0.3532,0.0339,0.5054,0.0334
LogisticRegression,0.8006,0.0263,0.0695,0.0236,0.1267,0.0412
KNeighborsClassifier,0.7466,0.0674,0.5065,0.0422,0.6,0.028
DecisionTreeClassifier,0.9977,0.0003,0.9748,0.0015,0.9861,0.0008
RandomForestClassifier,0.9987,0.0004,0.9737,0.0016,0.986,0.0009
AdaBoostClassifier,0.6893,0.0109,0.2921,0.0263,0.4095,0.0255


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3787,0.1053,0.3123,0.0503,0.3327,0.0477
LinearSVC,0.3579,0.0999,0.0823,0.0817,0.1223,0.1049
SVC,0.4371,0.3325,0.0107,0.009,0.0197,0.0167
LogisticRegression,0.2933,0.3969,0.0027,0.0043,0.0052,0.0083
KNeighborsClassifier,0.2901,0.1107,0.1901,0.0276,0.2263,0.0531
DecisionTreeClassifier,0.3556,0.0762,0.3158,0.032,0.3305,0.0414
RandomForestClassifier,0.4962,0.1194,0.2033,0.0228,0.2843,0.0315
AdaBoostClassifier,0.4757,0.1046,0.1632,0.0481,0.2402,0.0583


CPU times: user 44min 54s, sys: 15min 44s, total: 1h 38s
Wall time: 44min 12s


### Set 14

- N-grams: 1 to 2
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set14():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=20000, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set14, 
    'SET 14', 
    verbose_vocab=True)

Starting fold 0
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9033,0.0271,0.6436,0.0937,0.749,0.0748
LinearSVC,0.8153,0.0084,0.2122,0.0303,0.3355,0.0383
SVC,0.8928,0.0157,0.354,0.0327,0.5059,0.032
LogisticRegression,0.7982,0.0188,0.0739,0.0243,0.134,0.0419
KNeighborsClassifier,0.7447,0.0529,0.5021,0.046,0.5962,0.0197
DecisionTreeClassifier,0.9971,0.0005,0.9712,0.0018,0.984,0.0011
RandomForestClassifier,0.9978,0.0005,0.9704,0.0019,0.9839,0.0011
AdaBoostClassifier,0.6903,0.0093,0.2933,0.0251,0.411,0.0252
XGBClassifier,0.9262,0.0149,0.162,0.0219,0.275,0.0313
GaussianNB,0.7039,0.0135,1.0,0.0,0.8262,0.0093


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3576,0.1121,0.2598,0.1021,0.2969,0.1009
LinearSVC,0.3463,0.1067,0.0842,0.0832,0.1245,0.1064
SVC,0.4396,0.3274,0.0127,0.0112,0.0232,0.0203
LogisticRegression,0.3067,0.4035,0.0064,0.0098,0.0123,0.0186
KNeighborsClassifier,0.2909,0.1235,0.1916,0.0522,0.2199,0.0556
DecisionTreeClassifier,0.3387,0.0795,0.3034,0.061,0.3156,0.0611
RandomForestClassifier,0.4578,0.0905,0.268,0.0266,0.3335,0.0306
AdaBoostClassifier,0.4785,0.1178,0.1588,0.0589,0.2345,0.0718
XGBClassifier,0.5931,0.0991,0.047,0.0202,0.0863,0.0341
GaussianNB,0.3396,0.0776,0.4857,0.0703,0.3885,0.0417


CPU times: user 14min 2s, sys: 7min 20s, total: 21min 23s
Wall time: 14min 12s


### Set 15

- N-grams: 1 to 2
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set15():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=2000, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set15, 
    'SET 15', 
    verbose_vocab=True)

Starting fold 0
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing mo

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8234,0.0174,0.5235,0.0629,0.6383,0.0515
LinearSVC,0.797,0.0152,0.2163,0.0315,0.3388,0.0388
SVC,0.8735,0.0111,0.3166,0.0346,0.4634,0.036
LogisticRegression,0.8045,0.0256,0.0868,0.0275,0.1551,0.0462
KNeighborsClassifier,0.7618,0.0145,0.4933,0.0366,0.5977,0.0257
DecisionTreeClassifier,0.9927,0.0004,0.9514,0.0033,0.9716,0.0016
RandomForestClassifier,0.9967,0.0004,0.9475,0.0032,0.9715,0.0016
AdaBoostClassifier,0.6886,0.0126,0.2951,0.0261,0.4123,0.0246
XGBClassifier,0.9173,0.0225,0.1636,0.0201,0.2769,0.0282
GaussianNB,0.4133,0.0221,0.9865,0.0043,0.5822,0.0214


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3619,0.094,0.3048,0.0633,0.3267,0.0679
LinearSVC,0.3371,0.1207,0.0985,0.0775,0.146,0.0958
SVC,0.3523,0.2553,0.0206,0.0186,0.0369,0.0334
LogisticRegression,0.542,0.1343,0.0221,0.0285,0.0403,0.0497
KNeighborsClassifier,0.3015,0.1324,0.1692,0.0442,0.2125,0.0676
DecisionTreeClassifier,0.3319,0.1056,0.3654,0.0578,0.3402,0.0716
RandomForestClassifier,0.4048,0.102,0.3718,0.0203,0.3796,0.0384
AdaBoostClassifier,0.4613,0.1169,0.1647,0.0555,0.2391,0.0668
XGBClassifier,0.6199,0.0744,0.0494,0.0134,0.0909,0.0222
GaussianNB,0.3239,0.0853,0.6693,0.0396,0.4294,0.0731


CPU times: user 4min, sys: 51.8 s, total: 4min 51s
Wall time: 4min 1s


### Set 16

- N-grams: 1
- Stop words removal: Yes
- Vocabulary's size: No limits

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set16():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set16, 
    'SET 16', 
    verbose_vocab=True)

Starting fold 0
   Learned 10108 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 9927 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 9334 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.914,0.0261,0.7027,0.0687,0.793,0.0497
LinearSVC,0.8018,0.0148,0.2208,0.0317,0.3449,0.0389
SVC,0.8868,0.0099,0.3423,0.034,0.4928,0.0343
LogisticRegression,0.7987,0.0367,0.0874,0.0263,0.156,0.0439
KNeighborsClassifier,0.7806,0.0078,0.4878,0.0271,0.5999,0.0198
DecisionTreeClassifier,0.9968,0.0003,0.9743,0.0014,0.9854,0.0008
RandomForestClassifier,0.9985,0.0006,0.9726,0.001,0.9854,0.0008
AdaBoostClassifier,0.6933,0.0057,0.2912,0.0251,0.4095,0.0254


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3574,0.1001,0.3019,0.0559,0.3167,0.0573
LinearSVC,0.3376,0.1177,0.08,0.0846,0.1193,0.1083
SVC,0.4374,0.2768,0.0155,0.0115,0.0283,0.0208
LogisticRegression,0.493,0.4474,0.0086,0.0119,0.0163,0.0223
KNeighborsClassifier,0.3116,0.1362,0.1646,0.0443,0.2082,0.0634
DecisionTreeClassifier,0.3243,0.0611,0.3174,0.0434,0.3161,0.0353
RandomForestClassifier,0.449,0.1014,0.3008,0.021,0.3541,0.0257
AdaBoostClassifier,0.4774,0.131,0.1589,0.0369,0.2358,0.0485


CPU times: user 8min 21s, sys: 9min 26s, total: 17min 47s
Wall time: 10min 32s


### Set 17

- N-grams: 1
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set17():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=20000, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set17, 
    'SET 17', 
    verbose_vocab=True)

Starting fold 0
   Learned 10108 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 9927 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 9334 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing m

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.914,0.0261,0.7027,0.0687,0.793,0.0497
LinearSVC,0.8018,0.0148,0.2208,0.0317,0.3449,0.0389
SVC,0.8868,0.0099,0.3423,0.034,0.4928,0.0343
LogisticRegression,0.7987,0.0367,0.0874,0.0263,0.156,0.0439
KNeighborsClassifier,0.7806,0.0078,0.4878,0.0271,0.5999,0.0198
DecisionTreeClassifier,0.9968,0.0003,0.9743,0.0014,0.9854,0.0008
RandomForestClassifier,0.9985,0.0006,0.9726,0.001,0.9854,0.0008
AdaBoostClassifier,0.6933,0.0057,0.2912,0.0251,0.4095,0.0254
XGBClassifier,0.9164,0.0228,0.1591,0.0192,0.2705,0.0274
GaussianNB,0.5606,0.0178,1.0,0.0,0.7183,0.0147


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3574,0.1001,0.3019,0.0559,0.3167,0.0573
LinearSVC,0.3376,0.1177,0.08,0.0846,0.1193,0.1083
SVC,0.4374,0.2768,0.0155,0.0115,0.0283,0.0208
LogisticRegression,0.493,0.4474,0.0086,0.0119,0.0163,0.0223
KNeighborsClassifier,0.3116,0.1362,0.1646,0.0443,0.2082,0.0634
DecisionTreeClassifier,0.3243,0.0611,0.3174,0.0434,0.3161,0.0353
RandomForestClassifier,0.449,0.1014,0.3008,0.021,0.3541,0.0257
AdaBoostClassifier,0.4774,0.131,0.1589,0.0369,0.2358,0.0485
XGBClassifier,0.6691,0.0716,0.0575,0.0132,0.1054,0.0223
GaussianNB,0.3157,0.081,0.5048,0.0354,0.3799,0.0529


CPU times: user 9min 3s, sys: 5min 41s, total: 14min 44s
Wall time: 9min 33s


### Set 18

- N-grams: 1
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set18():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=2000, 
      stop_words='english')


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    get_tf_idf_set18, 
    'SET 18', 
    verbose_vocab=True)

Starting fold 0
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 1
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB
Starting fold 2
   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing mo

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.836,0.0206,0.5179,0.0797,0.6369,0.0686
LinearSVC,0.7931,0.016,0.2216,0.0296,0.3452,0.0358
SVC,0.8809,0.0117,0.3309,0.0349,0.4798,0.0358
LogisticRegression,0.7888,0.0285,0.0908,0.0263,0.1613,0.0436
KNeighborsClassifier,0.7081,0.0787,0.5161,0.0562,0.5906,0.0248
DecisionTreeClassifier,0.9938,0.0002,0.9558,0.0027,0.9744,0.0014
RandomForestClassifier,0.9967,0.0007,0.9528,0.0021,0.9743,0.0013
AdaBoostClassifier,0.6928,0.0086,0.2943,0.0227,0.4125,0.0223
XGBClassifier,0.9249,0.019,0.1572,0.0192,0.2681,0.0276
GaussianNB,0.4168,0.0231,0.9903,0.0036,0.5862,0.0224


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3579,0.1083,0.2834,0.0519,0.3116,0.0676
LinearSVC,0.3594,0.1088,0.0948,0.0822,0.1417,0.1017
SVC,0.3553,0.2552,0.0171,0.015,0.0308,0.0269
LogisticRegression,0.481,0.269,0.022,0.0283,0.0398,0.0492
KNeighborsClassifier,0.3059,0.1301,0.224,0.0418,0.2486,0.0483
DecisionTreeClassifier,0.3217,0.0865,0.3439,0.0619,0.3241,0.0596
RandomForestClassifier,0.4134,0.0912,0.3693,0.038,0.3825,0.037
AdaBoostClassifier,0.4745,0.1169,0.1576,0.0467,0.2336,0.0568
XGBClassifier,0.6284,0.0696,0.0537,0.016,0.0984,0.0275
GaussianNB,0.3222,0.0838,0.6628,0.0372,0.4257,0.0694


CPU times: user 3min 48s, sys: 49.8 s, total: 4min 37s
Wall time: 3min 48s


### Set 19

- N-grams: 1 to 3
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set19():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None, 
      max_df=0.85)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set19, 
    'SET 19', 
    verbose_vocab=True)

Starting fold 0
   Learned 221209 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 221783 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 200212 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing 

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9771,0.0026,0.8345,0.0332,0.8998,0.019
LinearSVC,0.8392,0.0133,0.1805,0.0303,0.2957,0.0412
SVC,0.9319,0.0149,0.3759,0.0272,0.5349,0.0261
LogisticRegression,0.8002,0.0428,0.054,0.0209,0.1001,0.0378
KNeighborsClassifier,0.7645,0.013,0.4773,0.0276,0.5872,0.022
DecisionTreeClassifier,0.9982,0.0003,0.9783,0.0017,0.9881,0.0009
RandomForestClassifier,0.9992,0.0003,0.9772,0.0022,0.9881,0.001
AdaBoostClassifier,0.7091,0.0214,0.2862,0.0307,0.4066,0.0291


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.4209,0.1086,0.32,0.0487,0.3581,0.0581
LinearSVC,0.3949,0.1417,0.0708,0.072,0.1118,0.0992
SVC,0.4779,0.2281,0.0161,0.0084,0.0304,0.0159
LogisticRegression,0.0,0.0,0.0,0.0,0.0,0.0
KNeighborsClassifier,0.304,0.1346,0.164,0.0198,0.2056,0.0387
DecisionTreeClassifier,0.3595,0.0944,0.3378,0.0472,0.3363,0.024
RandomForestClassifier,0.5118,0.0982,0.1915,0.0264,0.2764,0.0369
AdaBoostClassifier,0.4542,0.0826,0.1512,0.012,0.2253,0.0183


CPU times: user 1h 53min 39s, sys: 17min 17s, total: 2h 10min 57s
Wall time: 1h 54min 49s


### Set 20

- N-grams: 1 to 2
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set20():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None, 
      max_df=0.85)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set20, 
    'SET 20', 
    verbose_vocab=True)

Starting fold 0
   Learned 83016 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 83020 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 75668 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing mod

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.9686,0.005,0.8369,0.023,0.8978,0.0138
LinearSVC,0.8365,0.0129,0.1967,0.0307,0.317,0.0402
SVC,0.9166,0.0143,0.3691,0.0279,0.5255,0.0272
LogisticRegression,0.7764,0.0391,0.0651,0.0227,0.1188,0.0398
KNeighborsClassifier,0.7776,0.0124,0.4783,0.0256,0.592,0.0224
DecisionTreeClassifier,0.9982,0.0003,0.9783,0.0017,0.9881,0.0009
RandomForestClassifier,0.9992,0.0003,0.9772,0.0019,0.9881,0.001
AdaBoostClassifier,0.7055,0.0239,0.2875,0.0319,0.4073,0.0306


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3999,0.1161,0.3285,0.0437,0.3545,0.0615
LinearSVC,0.392,0.1389,0.0757,0.0846,0.1143,0.1101
SVC,0.4606,0.224,0.0178,0.0074,0.0335,0.0144
LogisticRegression,0.0,0.0,0.0,0.0,0.0,0.0
KNeighborsClassifier,0.312,0.1315,0.1615,0.0233,0.2067,0.0397
DecisionTreeClassifier,0.3656,0.1023,0.3458,0.043,0.3432,0.027
RandomForestClassifier,0.4833,0.0971,0.2386,0.0135,0.317,0.0303
AdaBoostClassifier,0.4652,0.0994,0.1552,0.0089,0.23,0.0107


CPU times: user 47min 37s, sys: 15min 26s, total: 1h 3min 3s
Wall time: 46min 18s


### Set 21

- N-grams: 1
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf_set21():
  return TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None, 
      max_df=0.85)


In [None]:
%%time

cross_validation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    get_tf_idf_set21, 
    'SET 21', 
    verbose_vocab=True)

Starting fold 0
   Learned 10290 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 1
   Learned 10102 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
Starting fold 2
   Learned 9513 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing mode

Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.8963,0.0354,0.6741,0.0602,0.7689,0.0507
LinearSVC,0.7996,0.0155,0.2125,0.0319,0.3343,0.0398
SVC,0.8782,0.0128,0.349,0.0274,0.4987,0.0268
LogisticRegression,0.7828,0.0307,0.0844,0.0252,0.151,0.0422
KNeighborsClassifier,0.736,0.0592,0.5179,0.0361,0.6053,0.0237
DecisionTreeClassifier,0.9959,0.0005,0.9733,0.0015,0.9844,0.0007
RandomForestClassifier,0.9977,0.0011,0.9713,0.0022,0.9844,0.0007
AdaBoostClassifier,0.6936,0.0089,0.2932,0.0255,0.4115,0.0244


Unnamed: 0,Precision,P std,Recall,R std,F1,F1 std
MLPClassifier,0.3562,0.1047,0.2914,0.0581,0.3145,0.0675
LinearSVC,0.384,0.109,0.0843,0.0835,0.1282,0.108
SVC,0.4376,0.2131,0.0227,0.0059,0.042,0.0111
LogisticRegression,0.2857,0.3938,0.0018,0.0023,0.0037,0.0045
KNeighborsClassifier,0.3084,0.1173,0.2255,0.0474,0.2585,0.0707
DecisionTreeClassifier,0.3451,0.1197,0.3487,0.0144,0.3358,0.051
RandomForestClassifier,0.4236,0.0956,0.321,0.0121,0.3609,0.0368
AdaBoostClassifier,0.428,0.0958,0.1413,0.0296,0.2116,0.0434


CPU times: user 6min 56s, sys: 4min 21s, total: 11min 18s
Wall time: 7min 10s


### Summary

In [None]:
from IPython.display import display, update_display

pd.set_option("display.max_rows", None)
metrics_df = pd.DataFrame(columns=['Model', 'TF-IDF set', 'Precision', 'P STD', 'Recall', 'R STD', 'F1', 'F1 STD'])
i = 0
for model_name, metrics in test_metrics.items():
  for m in metrics:
    metrics_df.loc[i] = [model_name, m[0], f'{m[1][0]:.4f}', f'{m[2][0]:.4f}', f'{m[1][1]:.4f}', f'{m[2][1]:.4f}', f'{m[1][2]:.4f}', f'{m[2][2]:.4f}']
    i += 1
metrics_display = display(metrics_df, display_id='metrics_table')

Unnamed: 0,Model,TF-IDF set,Precision,P STD,Recall,R STD,F1,F1 STD
0,MLPClassifier,SET 1,0.4779,0.0711,0.3486,0.0645,0.3976,0.0505
1,MLPClassifier,SET 2,0.4686,0.0849,0.3555,0.0512,0.4005,0.0498
2,MLPClassifier,SET 3,0.5032,0.0655,0.3556,0.0563,0.4117,0.041
3,MLPClassifier,SET 4,0.4684,0.0686,0.3539,0.0542,0.3977,0.039
4,MLPClassifier,SET 5,0.4592,0.08,0.3567,0.0645,0.3976,0.056
5,MLPClassifier,SET 6,0.4983,0.052,0.3455,0.0609,0.4034,0.0437
6,MLPClassifier,SET 7,0.4601,0.0686,0.308,0.066,0.3626,0.0529
7,MLPClassifier,SET 8,0.4601,0.0686,0.308,0.066,0.3626,0.0529
8,MLPClassifier,SET 9,0.4878,0.0684,0.3433,0.0898,0.391,0.0591
9,MLPClassifier,SET 10,0.3841,0.1215,0.2999,0.0721,0.3267,0.0699


###Reference paper:

> Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. **Identification of Rhetorical Roles of Sentences in Indian Legal Judgments**. In Proc. International Conference on Legal Knowledge and Information Systems (JURIX).

