<a href="https://colab.research.google.com/github/alexlimatds/fact_extraction/blob/main/AILA2020/FACTS_AILA_TF_IDF_approach_1_test_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Facts extraction with AILA data and TF-IDF features

This notebook experiments TF-IDF features in order to find the best hyperparameters.

The computation of the TF-IDF weights is based on sentences instead on the traditional document-based approach:

- Sentences are used to train the TF-IDF model.
- TF-IDF vectors are computed for sentences and in order to do this, a sentence is fed into the TF-IDF model.

Data used in this notebook:

- for training: the train dataset from AILA 2020. This can be obtained at https://github.com/Law-AI/semantic-segmentation;
- for test: additional train documents from AILA 2021.

### Loading dataset

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
g_drive_dir = '/content/gdrive/MyDrive/'
dataset_dir = 'fact_extraction_AILA/'

Mounted at /content/gdrive


In [2]:
!rm -r data
!mkdir data
!mkdir data/train
!tar -xf {g_drive_dir}{dataset_dir}/train.tar.xz -C data/train
!mkdir data/test
!tar -xf {g_drive_dir}{dataset_dir}/test.tar.xz -C data/test

train_dir = 'data/train/'
test_dir = 'data/test/'

rm: cannot remove 'data': No such file or directory


In [3]:
import pandas as pd
from os import listdir
import csv

def read_docs(dir_name):
  """
  Read the docs in a directory.
  Params:
    dir_name : the directory that contains the documents.
  Returns:
    A dictionary.
  """
  sentences = []
  labels = []
  for f in listdir(dir_name):
    df = pd.read_csv(
        dir_name + f, 
        sep='\t', 
        quoting=csv.QUOTE_NONE, 
        names=['sentence', 'label'])
    sentences.extend(df['sentence'].to_list())
    labels.extend(df['label'].to_list())
  return {'sentences': sentences, 'labels': labels}

dic_train = read_docs(train_dir)
dic_test = read_docs(test_dir)

print('Number of train sentences: ', len(dic_train['sentences']))
print('Number of test sentences: ', len(dic_test['sentences']))

Number of train sentences:  9380
Number of test sentences:  1905


### Counting sentences by label

In [4]:
def target_stats(set_name, targets):
  stats = {}
  for t in targets:
    stats[t] = stats.get(t, 0) + 1
  print(f'Statistics of the {set_name} set:')
  print(f'   Total number of sentences: {len(targets)}')
  for t, n in stats.items():
    print(f'   Number of {t} labels: {n}')

target_stats('TRAIN', dic_train['labels'])
target_stats('TEST', dic_test['labels'])

Statistics of the TRAIN set:
   Total number of sentences: 9380
   Number of Facts labels: 2219
   Number of Other labels: 7161
Statistics of the TEST set:
   Total number of sentences: 1905
   Number of Other labels: 1502
   Number of Facts labels: 403


### Evaluation functions

In [36]:
import sklearn
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from IPython.display import display, HTML
import numpy as np

test_metrics = {}

def get_features(train_sentences, test_sentences, tfidf_model, to_dense):
  if to_dense:
    train_features = tfidf_model.transform(train_sentences).toarray()
    test_features = tfidf_model.transform(test_sentences).toarray()
  else:
    train_features = tfidf_model.transform(train_sentences)
    test_features = tfidf_model.transform(test_sentences)
  return train_features, test_features

def report_set(metrics_dic, dataset_description):
  report_df = pd.DataFrame(columns=['Precision', 'Recall', 'F1'])
  for model_name, model_metrics in metrics_dic.items():
    report_df.loc[model_name] = [
        f'{model_metrics[0]:.4f}',  # precision
        f'{model_metrics[1]:.4f}',  # recall
        f'{model_metrics[2]:.4f}']  # f1
  display(HTML(f'<br><span style="font-weight: bold">{dataset_description} scores</span>'))
  display(report_df)

def evaluation(model_tuples, tfidf_model, set_description, verbose_vocab=False):
  """
  Params:
    model_tuples  : A list of tuples. For each tuple the first element is a function 
                    returning a unfited machine learning model and the second one 
                    is a flag to use numpy vectors or not.
    tfidf_model     : An unfitted TF-IDF model.
    set_description : Text description of the feature set.
    verbose_vocab   : If the size of the vocabulary must be printed or not.
  """
  train_metrics_set = {}
  test_metrics_set = {}

  train_sentences, train_targets = dic_train['sentences'], dic_train['labels']
  test_sentences, test_targets = dic_test['sentences'], dic_test['labels']
  tfidf_model.fit(train_sentences)
  if verbose_vocab:
    print(f'   Learned {len(tfidf_model.vocabulary_)} terms.')
  
  last_to_dense = None
  for (model_builder, to_dense) in model_tuples:
    model = model_builder()
    model_name = model.__class__.__name__
    print(f'   Processing model: {model_name}')
    if last_to_dense != to_dense:
      train_features, test_features = get_features(train_sentences, test_sentences, tfidf_model, to_dense)
    last_to_dense = to_dense
    model.fit(train_features, train_targets)
    # test metrics
    predictions = model.predict(test_features)
    p_test, r_test, f1_test, _ = precision_recall_fscore_support(
        test_targets, 
        predictions, 
        average='binary', 
        pos_label='Facts', 
        zero_division=0)
    test_metrics_set[model_name] = (p_test, r_test, f1_test)
    # train metrics
    predictions = model.predict(train_features)
    p_train, r_train, f1_train, _ = precision_recall_fscore_support(
        train_targets, 
        predictions, 
        average='binary', 
        pos_label='Facts', 
        zero_division=0)
    train_metrics_set[model_name] = (p_train, r_train, f1_train)

    # metrics for the summary
    summary_model_metrics = test_metrics.get(model_name, [])
    summary_model_metrics.append((set_description, p_test, r_test, f1_test))
    test_metrics[model_name] = summary_model_metrics

  # reporting the achieved metrics
  report_set(train_metrics_set, 'TRAIN SET')
  report_set(test_metrics_set, 'TEST SET')
  """
  # train metrics
  report_df_train = pd.DataFrame(columns=['Precision', 'Recall', 'F1'])
  for model_name, metrics in train_metrics_set.items():
    model_metrics = np.array(metrics)
    mean = np.mean(model_metrics, axis=0)
    std = np.std(model_metrics, axis=0)
    report_df_train.loc[model_name] = [
        f'{mean[0]:.4f}', f'{std[0]:.4f}',  # precision
        f'{mean[1]:.4f}', f'{std[1]:.4f}',  # recall
        f'{mean[2]:.4f}', f'{std[2]:.4f}']  # f1
  display(HTML(f'<br><span style="font-weight: bold">TRAIN: cross-validation averages</span>'))
  display(report_df_train)
  # test metrics
  report_df_test = pd.DataFrame(columns=['Precision', 'P std', 'Recall', 'R std', 'F1', 'F1 std'])
  for model_name, metrics in test_metrics_set.items():
    model_metrics = np.array(metrics)
    mean = np.mean(model_metrics, axis=0)
    std = np.std(model_metrics, axis=0)
    report_df_test.loc[model_name] = [
        f'{mean[0]:.4f}', f'{std[0]:.4f}',  # precision
        f'{mean[1]:.4f}', f'{std[1]:.4f}',  # recall
        f'{mean[2]:.4f}', f'{std[2]:.4f}']  # f1
    # metrics for the summary
    summary_model_metrics = test_metrics.get(model_name, [])
    summary_model_metrics.append((set_description, mean, std))
    test_metrics[model_name] = summary_model_metrics
  display(HTML(f'<br><span style="font-weight: bold">TEST: cross-validation averages</span>'))
  display(report_df_test)
  """

  ### Pre-processing function

In [6]:
import re

def preprocess(str):
  pstr = str
  pstr = re.sub(r'[/(){}\[\]\|@,;]', ' ', pstr) # replaces symbols with spaces
  pstr = re.sub(r'[^0-9a-z #+_]', '', pstr)     # removes bad symbols
  pstr = re.sub(r'\d+', '', pstr)               # removes numbers
  return pstr

### Models

#### MLP

In [7]:
from sklearn.neural_network import MLPClassifier

def mlp():
  # Default MLP from scikit-learn
  return MLPClassifier(early_stopping=True, random_state=1)

#### Linear SVM

In [8]:
from sklearn.svm import LinearSVC

def linear_svm():
  return LinearSVC(random_state=1)

#### RBF SVM

In [9]:
from sklearn.svm import SVC

def rbf_svm():
  return SVC(kernel='rbf', random_state=1)

#### Logistic regression

In [10]:
from sklearn.linear_model import LogisticRegression

def logistic_regression():
  return LogisticRegression(solver='sag', max_iter=200, random_state=1)

#### KNN

In [11]:
from sklearn.neighbors import KNeighborsClassifier

def knn():
  return KNeighborsClassifier(5)

#### Decision Tree

In [12]:
from sklearn.tree import DecisionTreeClassifier

def decision_tree():
  return DecisionTreeClassifier(random_state=1)

#### Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

def random_forest():
  return RandomForestClassifier(random_state=1)

#### AdaBoost

In [14]:
from sklearn.ensemble import AdaBoostClassifier

def adaboost():
  return AdaBoostClassifier(random_state=1)

#### Naive Bayes

In [15]:
from sklearn.naive_bayes import GaussianNB

def naive_bayes():
  return GaussianNB()

#### XGBoost

In [16]:
from xgboost.sklearn import XGBClassifier

def xgboost():
  return XGBClassifier(objective="binary:logistic", tree_method='hist')

### Set 1

- N-grams: 1 to 3
- Stop words removal: No
- Vocabulary's size: no limits

Notes:
- It doesn't apply Naive Bayes and XGBoost models because there's no enough RAM to run them.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set1 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None)


In [40]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set1, 
    'SET 1', 
    verbose_vocab=True)

   Learned 237147 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9793,0.9171,0.9472
LinearSVC,0.9986,0.9703,0.9842
SVC,0.9995,0.941,0.9694
LogisticRegression,0.983,0.3641,0.5314
KNeighborsClassifier,0.5633,0.2307,0.3274
DecisionTreeClassifier,0.9945,0.9757,0.985
RandomForestClassifier,0.9981,0.9721,0.9849
AdaBoostClassifier,0.6705,0.3384,0.4498


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6939,0.4218,0.5247
LinearSVC,0.7078,0.3846,0.4984
SVC,0.6941,0.1464,0.2418
LogisticRegression,0.6988,0.1439,0.2387
KNeighborsClassifier,0.2343,0.1017,0.1419
DecisionTreeClassifier,0.4062,0.3598,0.3816
RandomForestClassifier,0.6364,0.1042,0.1791
AdaBoostClassifier,0.522,0.2357,0.3248


CPU times: user 12min 8s, sys: 1min 22s, total: 13min 31s
Wall time: 12min 47s


### Set 2

- N-grams: 1 to 3
- Stop words removal: No
- Maximum vocabulary's size: 20,000


In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set2 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=20000)


In [43]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set2, 
    'SET 2', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9097,0.6399,0.7513
LinearSVC,0.9848,0.9338,0.9586
SVC,0.9954,0.8833,0.936
LogisticRegression,0.9244,0.4957,0.6454
KNeighborsClassifier,0.6268,0.2596,0.3671
DecisionTreeClassifier,0.9935,0.9703,0.9818
RandomForestClassifier,0.9977,0.9662,0.9817
AdaBoostClassifier,0.6872,0.3515,0.4651
XGBClassifier,0.898,0.2262,0.3614
GaussianNB,0.6702,1.0,0.8025


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7277,0.3449,0.468
LinearSVC,0.6789,0.4144,0.5146
SVC,0.7206,0.2432,0.3636
LogisticRegression,0.7246,0.2481,0.3697
KNeighborsClassifier,0.2238,0.1166,0.1533
DecisionTreeClassifier,0.4529,0.3821,0.4145
RandomForestClassifier,0.6771,0.1613,0.2605
AdaBoostClassifier,0.5294,0.201,0.2914
XGBClassifier,0.5538,0.0893,0.1538
GaussianNB,0.491,0.5434,0.5159


CPU times: user 2min 22s, sys: 1min 39s, total: 4min 1s
Wall time: 2min 54s


### Set 3

- N-grams: 1 to 3
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set3 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=2000)


In [37]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set3, 
    'SET 3', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7814,0.5687,0.6583
LinearSVC,0.831,0.6494,0.7291
SVC,0.9687,0.7801,0.8642
LogisticRegression,0.811,0.4642,0.5904
KNeighborsClassifier,0.5409,0.393,0.4552
DecisionTreeClassifier,0.992,0.9545,0.9729
RandomForestClassifier,0.9967,0.95,0.9728
AdaBoostClassifier,0.6716,0.3502,0.4603
XGBClassifier,0.8932,0.2298,0.3656
GaussianNB,0.4343,0.9256,0.5912


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6442,0.4268,0.5134
LinearSVC,0.5867,0.4367,0.5007
SVC,0.6774,0.3127,0.4278
LogisticRegression,0.6859,0.3251,0.4411
KNeighborsClassifier,0.3203,0.2233,0.2632
DecisionTreeClassifier,0.4037,0.4318,0.4173
RandomForestClassifier,0.6619,0.2283,0.3395
AdaBoostClassifier,0.5309,0.2134,0.3044
XGBClassifier,0.5571,0.0968,0.1649
GaussianNB,0.3803,0.8511,0.5257


CPU times: user 1min 1s, sys: 4.88 s, total: 1min 6s
Wall time: 1min 2s


### Set 4

- N-grams: 1 to 2
- Stop words removal: No
- Vocabulary's size: No limits

Notes:
- It doesn't apply Naive Bayes and XGBoost models because there's no enough RAM to run them.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set4 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None)


In [45]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set4, 
    'SET 4', 
    verbose_vocab=True)

   Learned 89419 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.98,0.9261,0.9523
LinearSVC,0.9986,0.9676,0.9828
SVC,0.9985,0.9252,0.9605
LogisticRegression,0.9686,0.4588,0.6226
KNeighborsClassifier,0.6076,0.2443,0.3484
DecisionTreeClassifier,0.9945,0.9757,0.985
RandomForestClassifier,0.9991,0.9712,0.9849
AdaBoostClassifier,0.6698,0.3502,0.4599


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7093,0.4541,0.5537
LinearSVC,0.6903,0.3871,0.496
SVC,0.6893,0.1762,0.2806
LogisticRegression,0.703,0.1762,0.2817
KNeighborsClassifier,0.1774,0.0819,0.1121
DecisionTreeClassifier,0.4172,0.3499,0.3806
RandomForestClassifier,0.6774,0.1042,0.1806
AdaBoostClassifier,0.5089,0.2134,0.3007


CPU times: user 5min 44s, sys: 1min 39s, total: 7min 23s
Wall time: 5min 36s


### Set 5

- N-grams: 1 to 2
- Stop words removal: No
- Maximum vocabulary's size: 20,000

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set5 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=20000)


In [47]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set5, 
    'SET 5', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9333,0.763,0.8396
LinearSVC,0.9871,0.9342,0.9599
SVC,0.9945,0.8896,0.9391
LogisticRegression,0.9269,0.5029,0.6521
KNeighborsClassifier,0.6239,0.2587,0.3657
DecisionTreeClassifier,0.9931,0.9712,0.982
RandomForestClassifier,0.9977,0.9667,0.9819
AdaBoostClassifier,0.6738,0.3425,0.4541
XGBClassifier,0.9004,0.2199,0.3535
GaussianNB,0.6963,1.0,0.8209


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7626,0.4144,0.537
LinearSVC,0.66,0.4094,0.5054
SVC,0.7368,0.2432,0.3657
LogisticRegression,0.7154,0.2308,0.349
KNeighborsClassifier,0.2,0.1067,0.1392
DecisionTreeClassifier,0.4262,0.3797,0.4016
RandomForestClassifier,0.6477,0.1414,0.2322
AdaBoostClassifier,0.5576,0.2283,0.3239
XGBClassifier,0.5484,0.0844,0.1462
GaussianNB,0.4974,0.4764,0.4867


CPU times: user 2min 15s, sys: 41.7 s, total: 2min 56s
Wall time: 2min 15s


### Set 6

- N-grams: 1 to 2
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set6 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=2000)


In [49]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set6, 
    'SET 6', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.787,0.5777,0.6663
LinearSVC,0.8353,0.6607,0.7378
SVC,0.9705,0.7864,0.8688
LogisticRegression,0.8145,0.4709,0.5968
KNeighborsClassifier,0.6508,0.3646,0.4674
DecisionTreeClassifier,0.9916,0.9549,0.9729
RandomForestClassifier,0.9953,0.9513,0.9728
AdaBoostClassifier,0.6789,0.3574,0.4683
XGBClassifier,0.8966,0.2267,0.3619
GaussianNB,0.4179,0.941,0.5788


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6729,0.4442,0.5351
LinearSVC,0.6194,0.4442,0.5173
SVC,0.6828,0.3151,0.4312
LogisticRegression,0.6927,0.33,0.4471
KNeighborsClassifier,0.3004,0.196,0.2372
DecisionTreeClassifier,0.4016,0.3747,0.3877
RandomForestClassifier,0.6786,0.2357,0.3499
AdaBoostClassifier,0.5215,0.2109,0.3004
XGBClassifier,0.5758,0.0943,0.162
GaussianNB,0.379,0.871,0.5282


CPU times: user 1min 1s, sys: 5.19 s, total: 1min 6s
Wall time: 1min 1s


### Set 7

- N-grams: 1
- Stop words removal: No
- Vocabulary's size: No limits

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set7 =  TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None)


In [51]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set7, 
    'SET 7', 
    verbose_vocab=True)

   Learned 11376 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8765,0.7643,0.8166
LinearSVC,0.951,0.8571,0.9016
SVC,0.9877,0.8675,0.9237
LogisticRegression,0.8892,0.5061,0.645
KNeighborsClassifier,0.6646,0.2857,0.3996
DecisionTreeClassifier,0.994,0.9757,0.9848
RandomForestClassifier,0.9977,0.9721,0.9847
AdaBoostClassifier,0.6731,0.3452,0.4564


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6614,0.4169,0.5114
LinearSVC,0.5992,0.3747,0.4611
SVC,0.6556,0.2457,0.3574
LogisticRegression,0.6986,0.2531,0.3716
KNeighborsClassifier,0.2529,0.1092,0.1525
DecisionTreeClassifier,0.3983,0.34,0.3668
RandomForestClassifier,0.7105,0.134,0.2255
AdaBoostClassifier,0.5063,0.1985,0.2852


CPU times: user 1min 16s, sys: 21 s, total: 1min 37s
Wall time: 1min 14s


### Set 8

- N-grams: 1
- Stop words removal: No
- Maximum vocabulary's size: 20,000

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set8 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=20000)


In [53]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set8, 
    'SET 8', 
    verbose_vocab=True)

   Learned 11376 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8765,0.7643,0.8166
LinearSVC,0.951,0.8571,0.9016
SVC,0.9877,0.8675,0.9237
LogisticRegression,0.8892,0.5061,0.645
KNeighborsClassifier,0.6646,0.2857,0.3996
DecisionTreeClassifier,0.994,0.9757,0.9848
RandomForestClassifier,0.9977,0.9721,0.9847
AdaBoostClassifier,0.6731,0.3452,0.4564
XGBClassifier,0.8965,0.2186,0.3514
GaussianNB,0.5176,1.0,0.6821


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6614,0.4169,0.5114
LinearSVC,0.5992,0.3747,0.4611
SVC,0.6556,0.2457,0.3574
LogisticRegression,0.6986,0.2531,0.3716
KNeighborsClassifier,0.2529,0.1092,0.1525
DecisionTreeClassifier,0.3983,0.34,0.3668
RandomForestClassifier,0.7105,0.134,0.2255
AdaBoostClassifier,0.5063,0.1985,0.2852
XGBClassifier,0.5283,0.0695,0.1228
GaussianNB,0.2855,0.4789,0.3577


CPU times: user 1min 26s, sys: 29.8 s, total: 1min 55s
Wall time: 1min 28s


### Set 9

- N-grams: 1
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set9 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=2000)


In [55]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set9, 
    'SET 9', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8214,0.653,0.7276
LinearSVC,0.849,0.6715,0.7499
SVC,0.9783,0.7927,0.8758
LogisticRegression,0.8328,0.4894,0.6165
KNeighborsClassifier,0.6526,0.3335,0.4414
DecisionTreeClassifier,0.993,0.9599,0.9762
RandomForestClassifier,0.9962,0.9567,0.9761
AdaBoostClassifier,0.6875,0.3321,0.4479
XGBClassifier,0.8982,0.2226,0.3568
GaussianNB,0.3777,0.9779,0.545


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6641,0.4318,0.5233
LinearSVC,0.6142,0.3871,0.4749
SVC,0.6782,0.2928,0.409
LogisticRegression,0.7143,0.3226,0.4444
KNeighborsClassifier,0.2953,0.1414,0.1913
DecisionTreeClassifier,0.39,0.33,0.3575
RandomForestClassifier,0.6942,0.2084,0.3206
AdaBoostClassifier,0.5294,0.201,0.2914
XGBClassifier,0.614,0.0868,0.1522
GaussianNB,0.3326,0.799,0.4697


CPU times: user 52.5 s, sys: 5.36 s, total: 57.8 s
Wall time: 52.2 s


### Set 10

- N-grams: 1 to 3
- Stop words removal: Yes
- Vocabulary's size: No limits

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set10 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None, 
      stop_words='english')


In [57]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set10, 
    'SET 10', 
    verbose_vocab=True)

   Learned 168158 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.977,0.9013,0.9376
LinearSVC,0.9954,0.9653,0.9801
SVC,0.9961,0.932,0.963
LogisticRegression,0.9738,0.2677,0.4199
KNeighborsClassifier,0.5487,0.2564,0.3495
DecisionTreeClassifier,0.9935,0.9689,0.9811
RandomForestClassifier,0.9977,0.9648,0.981
AdaBoostClassifier,0.6895,0.2681,0.3861


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6453,0.3251,0.4323
LinearSVC,0.6235,0.3821,0.4738
SVC,0.5714,0.1092,0.1833
LogisticRegression,0.6341,0.129,0.2144
KNeighborsClassifier,0.2358,0.1241,0.1626
DecisionTreeClassifier,0.3889,0.469,0.4252
RandomForestClassifier,0.5986,0.2184,0.32
AdaBoostClassifier,0.4636,0.1737,0.2527


CPU times: user 7min 39s, sys: 1min 58s, total: 9min 37s
Wall time: 8min 7s


### Set 11

- N-grams: 1 to 3
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set11 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=20000, 
      stop_words='english')


In [59]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set11, 
    'SET 11', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9384,0.8026,0.8652
LinearSVC,0.9788,0.9148,0.9457
SVC,0.9911,0.854,0.9175
LogisticRegression,0.9351,0.4286,0.5878
KNeighborsClassifier,0.6303,0.315,0.4201
DecisionTreeClassifier,0.9917,0.9644,0.9778
RandomForestClassifier,0.9963,0.9599,0.9777
AdaBoostClassifier,0.6783,0.2803,0.3967
XGBClassifier,0.9358,0.1577,0.27
GaussianNB,0.6389,1.0,0.7797


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.671,0.3846,0.489
LinearSVC,0.5869,0.3772,0.4592
SVC,0.6545,0.1787,0.2807
LogisticRegression,0.6579,0.1861,0.2901
KNeighborsClassifier,0.2161,0.1067,0.1429
DecisionTreeClassifier,0.4205,0.3672,0.3921
RandomForestClassifier,0.5669,0.2208,0.3179
AdaBoostClassifier,0.4647,0.196,0.2757
XGBClassifier,0.6154,0.0596,0.1086
GaussianNB,0.3588,0.5707,0.4406


CPU times: user 1min 53s, sys: 45.5 s, total: 2min 39s
Wall time: 1min 56s


### Set 12

- N-grams: 1 to 3
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set12 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=2000, 
      stop_words='english')


In [61]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set12, 
    'SET 12', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.797,0.5521,0.6523
LinearSVC,0.8331,0.6323,0.7189
SVC,0.9742,0.7328,0.8364
LogisticRegression,0.8333,0.4214,0.5597
KNeighborsClassifier,0.6556,0.4367,0.5242
DecisionTreeClassifier,0.9872,0.9374,0.9616
RandomForestClassifier,0.9933,0.9315,0.9614
AdaBoostClassifier,0.6734,0.2844,0.3999
XGBClassifier,0.9417,0.1528,0.2629
GaussianNB,0.3783,0.9806,0.546


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6323,0.3499,0.4505
LinearSVC,0.5615,0.34,0.4235
SVC,0.6319,0.2258,0.3327
LogisticRegression,0.6447,0.2432,0.3532
KNeighborsClassifier,0.2966,0.2134,0.2482
DecisionTreeClassifier,0.4258,0.3846,0.4042
RandomForestClassifier,0.5905,0.3077,0.4046
AdaBoostClassifier,0.497,0.2084,0.2937
XGBClassifier,0.5814,0.062,0.1121
GaussianNB,0.3381,0.8139,0.4778


CPU times: user 40.8 s, sys: 4.69 s, total: 45.5 s
Wall time: 40.7 s


### Set 13

- N-grams: 1 to 2
- Stop words removal: Yes
- Vocabulary's size: No limits

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set13 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None, 
      stop_words='english')


In [63]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set13, 
    'SET 13', 
    verbose_vocab=True)

   Learned 82606 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9739,0.909,0.9403
LinearSVC,0.9949,0.9639,0.9792
SVC,0.9951,0.9198,0.956
LogisticRegression,0.9632,0.342,0.5048
KNeighborsClassifier,0.5595,0.265,0.3596
DecisionTreeClassifier,0.9935,0.9689,0.9811
RandomForestClassifier,0.9977,0.9648,0.981
AdaBoostClassifier,0.6895,0.2542,0.3714


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6164,0.3548,0.4504
LinearSVC,0.6147,0.3524,0.4479
SVC,0.6076,0.1191,0.1992
LogisticRegression,0.6471,0.1365,0.2254
KNeighborsClassifier,0.2389,0.134,0.1717
DecisionTreeClassifier,0.4252,0.4442,0.4345
RandomForestClassifier,0.527,0.1935,0.2831
AdaBoostClassifier,0.4861,0.1737,0.2559


CPU times: user 4min 54s, sys: 1min 41s, total: 6min 35s
Wall time: 4min 49s


### Set 14

- N-grams: 1 to 2
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set14 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=20000, 
      stop_words='english')


In [65]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set14, 
    'SET 14', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9449,0.8116,0.8732
LinearSVC,0.9808,0.9229,0.951
SVC,0.9906,0.858,0.9196
LogisticRegression,0.9306,0.4227,0.5813
KNeighborsClassifier,0.5969,0.329,0.4242
DecisionTreeClassifier,0.9926,0.9648,0.9785
RandomForestClassifier,0.9967,0.9608,0.9784
AdaBoostClassifier,0.6826,0.283,0.4001
XGBClassifier,0.9292,0.1478,0.2551
GaussianNB,0.6741,1.0,0.8053


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6754,0.3821,0.4881
LinearSVC,0.6024,0.3797,0.4658
SVC,0.6602,0.1687,0.2688
LogisticRegression,0.6667,0.1737,0.2756
KNeighborsClassifier,0.247,0.1514,0.1877
DecisionTreeClassifier,0.4393,0.3499,0.3895
RandomForestClassifier,0.6291,0.2357,0.343
AdaBoostClassifier,0.4837,0.1836,0.2662
XGBClassifier,0.6222,0.0695,0.125
GaussianNB,0.369,0.5558,0.4436


CPU times: user 1min 53s, sys: 41.9 s, total: 2min 35s
Wall time: 1min 55s


### Set 15

- N-grams: 1 to 2
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set15 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=2000, 
      stop_words='english')


In [67]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set15, 
    'SET 15', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8042,0.557,0.6581
LinearSVC,0.8377,0.6327,0.7209
SVC,0.9749,0.7364,0.839
LogisticRegression,0.8369,0.4232,0.5621
KNeighborsClassifier,0.6553,0.4259,0.5163
DecisionTreeClassifier,0.9872,0.9383,0.9621
RandomForestClassifier,0.9938,0.932,0.9619
AdaBoostClassifier,0.6787,0.2884,0.4048
XGBClassifier,0.9357,0.1573,0.2693
GaussianNB,0.3791,0.9815,0.547


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6295,0.3499,0.4498
LinearSVC,0.5647,0.3573,0.4377
SVC,0.6408,0.2258,0.3339
LogisticRegression,0.6351,0.2333,0.3412
KNeighborsClassifier,0.2886,0.2134,0.2454
DecisionTreeClassifier,0.3973,0.3648,0.3803
RandomForestClassifier,0.5714,0.3077,0.4
AdaBoostClassifier,0.4591,0.1811,0.2598
XGBClassifier,0.5909,0.0645,0.1163
GaussianNB,0.332,0.8065,0.4703


CPU times: user 40.1 s, sys: 4.49 s, total: 44.6 s
Wall time: 40.2 s


### Set 16

- N-grams: 1
- Stop words removal: Yes
- Vocabulary's size: No limits

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set16 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None, 
      stop_words='english')


In [69]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set16, 
    'SET 16', 
    verbose_vocab=True)

   Learned 11089 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8796,0.6553,0.751
LinearSVC,0.9531,0.8522,0.8998
SVC,0.9884,0.845,0.9111
LogisticRegression,0.8955,0.4286,0.5797
KNeighborsClassifier,0.6266,0.3078,0.4128
DecisionTreeClassifier,0.9931,0.9685,0.9806
RandomForestClassifier,0.9977,0.9639,0.9805
AdaBoostClassifier,0.6831,0.269,0.386


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.659,0.2829,0.3958
LinearSVC,0.5738,0.3375,0.425
SVC,0.5941,0.1489,0.2381
LogisticRegression,0.6476,0.1687,0.2677
KNeighborsClassifier,0.1927,0.1042,0.1353
DecisionTreeClassifier,0.3743,0.3474,0.3604
RandomForestClassifier,0.5705,0.2109,0.308
AdaBoostClassifier,0.4898,0.1787,0.2618


CPU times: user 1min 5s, sys: 21.7 s, total: 1min 27s
Wall time: 1min 4s


### Set 17

- N-grams: 1
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set17 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=20000, 
      stop_words='english')


In [71]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set17, 
    'SET 17', 
    verbose_vocab=True)

   Learned 11089 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8796,0.6553,0.751
LinearSVC,0.9531,0.8522,0.8998
SVC,0.9884,0.845,0.9111
LogisticRegression,0.8955,0.4286,0.5797
KNeighborsClassifier,0.6266,0.3078,0.4128
DecisionTreeClassifier,0.9931,0.9685,0.9806
RandomForestClassifier,0.9977,0.9639,0.9805
AdaBoostClassifier,0.6831,0.269,0.386
XGBClassifier,0.9304,0.1505,0.2591
GaussianNB,0.5147,1.0,0.6796


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.659,0.2829,0.3958
LinearSVC,0.5738,0.3375,0.425
SVC,0.5941,0.1489,0.2381
LogisticRegression,0.6476,0.1687,0.2677
KNeighborsClassifier,0.1927,0.1042,0.1353
DecisionTreeClassifier,0.3743,0.3474,0.3604
RandomForestClassifier,0.5705,0.2109,0.308
AdaBoostClassifier,0.4898,0.1787,0.2618
XGBClassifier,0.6222,0.0695,0.125
GaussianNB,0.2785,0.4789,0.3522


CPU times: user 1min 12s, sys: 22.4 s, total: 1min 35s
Wall time: 1min 12s


### Set 18

- N-grams: 1
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set18 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=2000, 
      stop_words='english')


In [73]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set18, 
    'SET 18', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.815,0.4745,0.5998
LinearSVC,0.8383,0.6494,0.7318
SVC,0.9749,0.753,0.8497
LogisticRegression,0.842,0.4322,0.5712
KNeighborsClassifier,0.6392,0.4128,0.5016
DecisionTreeClassifier,0.9882,0.9455,0.9664
RandomForestClassifier,0.9952,0.9387,0.9661
AdaBoostClassifier,0.6961,0.2704,0.3895
XGBClassifier,0.9563,0.1478,0.256
GaussianNB,0.3693,0.9883,0.5376


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6591,0.2878,0.4007
LinearSVC,0.562,0.3375,0.4217
SVC,0.6471,0.2184,0.3265
LogisticRegression,0.6711,0.2481,0.3623
KNeighborsClassifier,0.3043,0.2084,0.2474
DecisionTreeClassifier,0.3931,0.3697,0.3811
RandomForestClassifier,0.5769,0.2978,0.3928
AdaBoostClassifier,0.5036,0.1712,0.2556
XGBClassifier,0.6136,0.067,0.1208
GaussianNB,0.311,0.7841,0.4454


CPU times: user 38.7 s, sys: 4.37 s, total: 43.1 s
Wall time: 38.8 s


### Set 19

- N-grams: 1 to 3
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set19 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None, 
      max_df=0.85)


In [75]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set19, 
    'SET 19', 
    verbose_vocab=True)

   Learned 237147 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9793,0.9171,0.9472
LinearSVC,0.9986,0.9703,0.9842
SVC,0.9995,0.941,0.9694
LogisticRegression,0.983,0.3641,0.5314
KNeighborsClassifier,0.5633,0.2307,0.3274
DecisionTreeClassifier,0.9945,0.9757,0.985
RandomForestClassifier,0.9981,0.9721,0.9849
AdaBoostClassifier,0.6705,0.3384,0.4498


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6939,0.4218,0.5247
LinearSVC,0.7078,0.3846,0.4984
SVC,0.6941,0.1464,0.2418
LogisticRegression,0.6988,0.1439,0.2387
KNeighborsClassifier,0.2343,0.1017,0.1419
DecisionTreeClassifier,0.4062,0.3598,0.3816
RandomForestClassifier,0.6364,0.1042,0.1791
AdaBoostClassifier,0.522,0.2357,0.3248


CPU times: user 12min 4s, sys: 2min 13s, total: 14min 17s
Wall time: 12min 34s


### Set 20

- N-grams: 1 to 2
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set20 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None, 
      max_df=0.85)


In [77]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set20, 
    'SET 20', 
    verbose_vocab=True)

   Learned 89419 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.98,0.9261,0.9523
LinearSVC,0.9986,0.9676,0.9828
SVC,0.9985,0.9252,0.9605
LogisticRegression,0.9686,0.4588,0.6226
KNeighborsClassifier,0.6076,0.2443,0.3484
DecisionTreeClassifier,0.9945,0.9757,0.985
RandomForestClassifier,0.9991,0.9712,0.9849
AdaBoostClassifier,0.6698,0.3502,0.4599


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7093,0.4541,0.5537
LinearSVC,0.6903,0.3871,0.496
SVC,0.6893,0.1762,0.2806
LogisticRegression,0.703,0.1762,0.2817
KNeighborsClassifier,0.1774,0.0819,0.1121
DecisionTreeClassifier,0.4172,0.3499,0.3806
RandomForestClassifier,0.6774,0.1042,0.1806
AdaBoostClassifier,0.5089,0.2134,0.3007


CPU times: user 5min 44s, sys: 1min 37s, total: 7min 22s
Wall time: 5min 37s


### Set 21

- N-grams: 1
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set21 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None, 
      max_df=0.85)


In [79]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set21, 
    'SET 21', 
    verbose_vocab=True)

   Learned 11376 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8765,0.7643,0.8166
LinearSVC,0.951,0.8571,0.9016
SVC,0.9877,0.8675,0.9237
LogisticRegression,0.8892,0.5061,0.645
KNeighborsClassifier,0.6646,0.2857,0.3996
DecisionTreeClassifier,0.994,0.9757,0.9848
RandomForestClassifier,0.9977,0.9721,0.9847
AdaBoostClassifier,0.6731,0.3452,0.4564


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6614,0.4169,0.5114
LinearSVC,0.5992,0.3747,0.4611
SVC,0.6556,0.2457,0.3574
LogisticRegression,0.6986,0.2531,0.3716
KNeighborsClassifier,0.2529,0.1092,0.1525
DecisionTreeClassifier,0.3983,0.34,0.3668
RandomForestClassifier,0.7105,0.134,0.2255
AdaBoostClassifier,0.5063,0.1985,0.2852


CPU times: user 1min 18s, sys: 24.9 s, total: 1min 43s
Wall time: 1min 17s


### Summary

In [80]:
from IPython.display import display, update_display

pd.set_option("display.max_rows", None)
metrics_df = pd.DataFrame(columns=['Model', 'TF-IDF set', 'Precision', 'Recall', 'F1'])
i = 0
for model_name, metrics in test_metrics.items():
  for m in metrics:
    metrics_df.loc[i] = [model_name, m[0], f'{m[1]:.4f}', f'{m[2]:.4f}', f'{m[3]:.4f}']
    i += 1
metrics_display = display(metrics_df, display_id='metrics_table')

Unnamed: 0,Model,TF-IDF set,Precision,Recall,F1
0,MLPClassifier,SET 3,0.6442,0.4268,0.5134
1,MLPClassifier,SET 1,0.6939,0.4218,0.5247
2,MLPClassifier,SET 2,0.7277,0.3449,0.468
3,MLPClassifier,SET 4,0.7093,0.4541,0.5537
4,MLPClassifier,SET 5,0.7626,0.4144,0.537
5,MLPClassifier,SET 6,0.6729,0.4442,0.5351
6,MLPClassifier,SET 7,0.6614,0.4169,0.5114
7,MLPClassifier,SET 8,0.6614,0.4169,0.5114
8,MLPClassifier,SET 9,0.6641,0.4318,0.5233
9,MLPClassifier,SET 10,0.6453,0.3251,0.4323


###Reference paper:

> Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. **Identification of Rhetorical Roles of Sentences in Indian Legal Judgments**. In Proc. International Conference on Legal Knowledge and Information Systems (JURIX).

