<a href="https://colab.research.google.com/github/alexlimatds/fact_extraction/blob/main/AILA2020/FACTS_AILA_TF_IDF_approach_2_test_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Facts extraction with AILA data and TF-IDF features

This notebook experiments TF-IDF features in order to find the best hyperparameters.

The computation of the TF-IDF weights is based on documents and these are the steps to compute the TF-IDF vector for a sentence:

- The TF-IDF is fed with the document that contains the sentence.
- For the sentence, the vector is built with the TF-IDF weights of its terms whose weights were document-based computed.

Data used in this notebook:

- for training: the train dataset from AILA 2020. This can be obtained at https://github.com/Law-AI/semantic-segmentation;
- for test: additional train documents from AILA 2021.

### Loading dataset

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
g_drive_dir = '/content/gdrive/MyDrive/'
dataset_dir = 'fact_extraction_AILA/'

Mounted at /content/gdrive


In [2]:
!rm -r data
!mkdir data
!mkdir data/train
!tar -xf {g_drive_dir}{dataset_dir}/train.tar.xz -C data/train
!mkdir data/test
!tar -xf {g_drive_dir}{dataset_dir}/test.tar.xz -C data/test

train_dir = 'data/train/'
test_dir = 'data/test/'

rm: cannot remove 'data': No such file or directory


In [3]:
import pandas as pd
from os import listdir
import csv

def read_docs(dir_name):
  """
  Read the docs in a directory.
  Params:
    dir_name : the directory that contains the documents.
  Returns:
    A dictionary whose keys are the names of the read files and the values are 
    pandas dataframes. Each dataframe has the columns sentence and label.
  """
  docs = {} # key: file name, value: dataframe with sentences and labels
  for f in listdir(dir_name):
    df = pd.read_csv(
        dir_name + f, 
        sep='\t', 
        quoting=csv.QUOTE_NONE, 
        names=['sentence', 'label'])
    docs[f] = df
  return docs

docs_train = read_docs(train_dir)
docs_test = read_docs(test_dir)

print(f'TRAIN: {len(docs_train)} documents read.')
print(f'TEST: {len(docs_test)} documents read.')

TRAIN: 50 documents read.
TEST: 10 documents read.


### Counting sentences by label

In [4]:
def target_stats(set_name, dic_docs):
  stats = {}
  total = 0
  for doc_id, df in dic_docs.items():
    targets = df['label'].tolist()
    total += len(targets)
    for t in targets:
      stats[t] = stats.get(t, 0) + 1
  print(f'Statistics of the {set_name} set:')
  print(f'   Total number of sentences: {total}')
  for t, n in stats.items():
    print(f'   Number of {t} labels: {n}')

target_stats('TRAIN', docs_train)
target_stats('TEST', docs_test)

Statistics of the TRAIN set:
   Total number of sentences: 9380
   Number of Facts labels: 2219
   Number of Other labels: 7161
Statistics of the TEST set:
   Total number of sentences: 1905
   Number of Facts labels: 403
   Number of Other labels: 1502


### Evaluation functions

In [5]:
import sklearn
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from IPython.display import display, HTML
import numpy as np
import scipy.sparse as sparse

test_metrics = {}

def get_features(docs_dic, tfidf_model, to_dense):
  n_vocab = len(tfidf_model.vocabulary_)
  features = None
  targets = []
  for _, df in docs_dic.items():
    targets.extend(df['label'].tolist())
    sentences = df['sentence'].tolist()
    doc_str = " ".join(df['sentence'].tolist())
    doc_tfidf = tfidf_model.transform([doc_str])[0]
    # using the model to generate a sparse matrix with the correct structure
    # altough, the tf-idf weights will be get from doc_tfidf 
    sentences_tfidf = tfidf_model.transform(sentences)
    non_zeros_idx = sentences_tfidf.nonzero()
    for sent_idx, term_idx in zip(non_zeros_idx[0], non_zeros_idx[1]):
      # overriding with the weights from doc_tfidf
      sentences_tfidf[sent_idx, term_idx] = doc_tfidf[0, term_idx]
    if features is None:
      features = sentences_tfidf
    else:
      features = sparse.vstack([features, sentences_tfidf])
  if to_dense:
    features = features.toarray()
  
  return features, targets

def docs_as_strings(docs_dic):
  """
  Returns documents as strings.
  Params:
    docs_dic      : Dictionary of documents as returned by the read_docs function.
  Returns:
    A list of strings whose each element is a document.
  """
  docs = []
  for _, df in docs_dic.items():
    docs.append(" ".join(df['sentence'].tolist()))
  return docs

def docs_to_sentences_and_labels(docs_dic):
  """
  Extracts the sentences and the labels from a set of documents.
  Params:
    docs_dic      : Dictionary of documents as returned by the read_docs function.
  Returns:
    - A list of sentences (strings).
    - A list of labels (strings). The indexes of this list are 
    respective to the indexes in the returned sentence list.
  """
  sentences = []
  targets = []
  for _, df in docs_dic.items():
    sentences.extend(df['sentence'].tolist())
    targets.extend(df['label'].tolist())
  
  return sentences, targets

def report_set(metrics_dic, dataset_description):
  report_df = pd.DataFrame(columns=['Precision', 'Recall', 'F1'])
  for model_name, model_metrics in metrics_dic.items():
    report_df.loc[model_name] = [
        f'{model_metrics[0]:.4f}',  # precision
        f'{model_metrics[1]:.4f}',  # recall
        f'{model_metrics[2]:.4f}']  # f1
  display(HTML(f'<br><span style="font-weight: bold">{dataset_description} scores</span>'))
  display(report_df)

def evaluation(model_tuples, tfidf_model, set_description, verbose_vocab=False):
  """
  Params:
    model_tuples  : A list of tuples. For each tuple the first element is a function 
                    returning a unfited machine learning model and the second one 
                    is a flag to use numpy vectors or not.
    tfidf_model     : An unfitted TF-IDF model.
    set_description : Text description of the feature set.
    verbose_vocab   : If the size of the vocabulary must be printed or not.
  """
  train_metrics_set = {}
  test_metrics_set = {}

  docs_train_str = docs_as_strings(docs_train)
  tfidf_model.fit(docs_train_str)
  if verbose_vocab:
    print(f'   Learned {len(tfidf_model.vocabulary_)} terms.')

  last_to_dense = None
  for (model_builder, to_dense) in model_tuples:
    model = model_builder()
    model_name = model.__class__.__name__
    print(f'   Processing model: {model_name}')
    if last_to_dense != to_dense:
      train_features, train_targets = get_features(docs_train, tfidf_model, to_dense)
      test_features, test_targets = get_features(docs_test, tfidf_model, to_dense)
    last_to_dense = to_dense
    model.fit(train_features, train_targets)
    # test metrics
    predictions = model.predict(test_features)
    p_test, r_test, f1_test, _ = precision_recall_fscore_support(
        test_targets, 
        predictions, 
        average='binary', 
        pos_label='Facts', 
        zero_division=0)
    test_metrics_set[model_name] = (p_test, r_test, f1_test)
    # train metrics
    predictions = model.predict(train_features)
    p_train, r_train, f1_train, _ = precision_recall_fscore_support(
        train_targets, 
        predictions, 
        average='binary', 
        pos_label='Facts', 
        zero_division=0)
    train_metrics_set[model_name] = (p_train, r_train, f1_train)

    # metrics for the summary
    summary_model_metrics = test_metrics.get(model_name, [])
    summary_model_metrics.append((set_description, p_test, r_test, f1_test))
    test_metrics[model_name] = summary_model_metrics

  # reporting the achieved metrics
  report_set(train_metrics_set, 'TRAIN SET')
  report_set(test_metrics_set, 'TEST SET')
  

  ### Pre-processing function

In [6]:
import re

def preprocess(str):
  pstr = str
  pstr = re.sub(r'[/(){}\[\]\|@,;]', ' ', pstr) # replaces symbols with spaces
  pstr = re.sub(r'[^0-9a-z #+_]', '', pstr)     # removes bad symbols
  pstr = re.sub(r'\d+', '', pstr)               # removes numbers
  return pstr

### Models

#### MLP

In [7]:
from sklearn.neural_network import MLPClassifier

def mlp():
  # Default MLP from scikit-learn
  return MLPClassifier(early_stopping=True, random_state=1)

#### Linear SVM

In [8]:
from sklearn.svm import LinearSVC

def linear_svm():
  return LinearSVC(random_state=1)

#### RBF SVM

In [9]:
from sklearn.svm import SVC

def rbf_svm():
  return SVC(kernel='rbf', random_state=1)

#### Logistic regression

In [10]:
from sklearn.linear_model import LogisticRegression

def logistic_regression():
  return LogisticRegression(solver='sag', max_iter=200, random_state=1)

#### KNN

In [11]:
from sklearn.neighbors import KNeighborsClassifier

def knn():
  return KNeighborsClassifier(5)

#### Decision Tree

In [12]:
from sklearn.tree import DecisionTreeClassifier

def decision_tree():
  return DecisionTreeClassifier(random_state=1)

#### Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

def random_forest():
  return RandomForestClassifier(random_state=1)

#### AdaBoost

In [14]:
from sklearn.ensemble import AdaBoostClassifier

def adaboost():
  return AdaBoostClassifier(random_state=1)

#### Naive Bayes

In [15]:
from sklearn.naive_bayes import GaussianNB

def naive_bayes():
  return GaussianNB()

#### XGBoost

In [16]:
from xgboost.sklearn import XGBClassifier

def xgboost():
  return XGBClassifier(objective="binary:logistic", tree_method='hist')

### Set 1

- N-grams: 1 to 3
- Stop words removal: No
- Vocabulary's size: no limits

Notes:
- It doesn't apply Naive Bayes and XGBoost models because there's no enough RAM to run them.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set1 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None)


In [18]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set1, 
    'SET 1', 
    verbose_vocab=True)

   Learned 259573 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9764,0.8761,0.9235
LinearSVC,0.8228,0.205,0.3283
SVC,0.8281,0.1302,0.2251
LogisticRegression,0.788,0.0653,0.1207
KNeighborsClassifier,0.759,0.4358,0.5537
DecisionTreeClassifier,0.9977,0.9815,0.9896
RandomForestClassifier,0.9991,0.9802,0.9895
AdaBoostClassifier,0.6872,0.3267,0.4429


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5726,0.3325,0.4207
LinearSVC,0.52,0.0323,0.0607
SVC,0.0,0.0,0.0
LogisticRegression,0.3333,0.0025,0.0049
KNeighborsClassifier,0.3623,0.1241,0.1848
DecisionTreeClassifier,0.4542,0.2829,0.3486
RandomForestClassifier,0.7209,0.0769,0.139
AdaBoostClassifier,0.5785,0.1737,0.2672


CPU times: user 47min 44s, sys: 6min 7s, total: 53min 52s
Wall time: 46min 12s


### Set 2

- N-grams: 1 to 3
- Stop words removal: No
- Maximum vocabulary's size: 20,000


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set2 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=20000)


In [20]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set2, 
    'SET 2', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9009,0.5696,0.698
LinearSVC,0.8187,0.2136,0.3388
SVC,0.8234,0.1302,0.2249
LogisticRegression,0.7905,0.0748,0.1367
KNeighborsClassifier,0.6645,0.466,0.5478
DecisionTreeClassifier,0.9963,0.977,0.9866
RandomForestClassifier,0.9986,0.9748,0.9865
AdaBoostClassifier,0.6813,0.3285,0.4433
XGBClassifier,0.8828,0.2411,0.3788
GaussianNB,0.6726,1.0,0.8043


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6738,0.2357,0.3493
LinearSVC,0.5667,0.0422,0.0785
SVC,0.0,0.0,0.0
LogisticRegression,0.5,0.0074,0.0147
KNeighborsClassifier,0.4138,0.1787,0.2496
DecisionTreeClassifier,0.4517,0.2903,0.3535
RandomForestClassifier,0.6462,0.1042,0.1795
AdaBoostClassifier,0.5308,0.1712,0.2589
XGBClassifier,0.5283,0.0695,0.1228
GaussianNB,0.5047,0.531,0.5175


CPU times: user 4min 16s, sys: 1min 32s, total: 5min 48s
Wall time: 4min 10s


### Set 3

- N-grams: 1 to 3
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set3 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=2000)


In [22]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set3, 
    'SET 3', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7544,0.4624,0.5733
LinearSVC,0.7731,0.1888,0.3035
SVC,0.8106,0.11,0.1937
LogisticRegression,0.7583,0.082,0.148
KNeighborsClassifier,0.7394,0.4642,0.5703
DecisionTreeClassifier,0.9949,0.9626,0.9785
RandomForestClassifier,0.9977,0.9599,0.9784
AdaBoostClassifier,0.6641,0.3438,0.4531
XGBClassifier,0.8719,0.2393,0.3755
GaussianNB,0.4167,0.9193,0.5734


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.601,0.2953,0.396
LinearSVC,0.6383,0.0744,0.1333
SVC,0.0,0.0,0.0
LogisticRegression,0.5714,0.0099,0.0195
KNeighborsClassifier,0.335,0.1687,0.2244
DecisionTreeClassifier,0.3448,0.3226,0.3333
RandomForestClassifier,0.6349,0.1985,0.3025
AdaBoostClassifier,0.5704,0.201,0.2972
XGBClassifier,0.5,0.0596,0.1064
GaussianNB,0.3543,0.7395,0.4791


CPU times: user 1min 41s, sys: 11.9 s, total: 1min 53s
Wall time: 1min 42s


### Set 4

- N-grams: 1 to 2
- Stop words removal: No
- Vocabulary's size: No limits

Notes:
- It doesn't apply Naive Bayes and XGBoost models because there's no enough RAM to run them.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set4 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None)


In [24]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set4, 
    'SET 4', 
    verbose_vocab=True)

   Learned 95661 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9137,0.6823,0.7812
LinearSVC,0.8102,0.2078,0.3307
SVC,0.8121,0.1266,0.2191
LogisticRegression,0.7805,0.0721,0.132
KNeighborsClassifier,0.7747,0.434,0.5563
DecisionTreeClassifier,0.9977,0.9815,0.9896
RandomForestClassifier,0.9991,0.9802,0.9895
AdaBoostClassifier,0.6864,0.3393,0.4542


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6471,0.2457,0.3561
LinearSVC,0.5556,0.0372,0.0698
SVC,0.0,0.0,0.0
LogisticRegression,0.3333,0.0025,0.0049
KNeighborsClassifier,0.3182,0.1042,0.157
DecisionTreeClassifier,0.4279,0.2357,0.304
RandomForestClassifier,0.7708,0.0918,0.1641
AdaBoostClassifier,0.5956,0.201,0.3006


CPU times: user 9min 48s, sys: 2min 47s, total: 12min 35s
Wall time: 9min 35s


### Set 5

- N-grams: 1 to 2
- Stop words removal: No
- Maximum vocabulary's size: 20,000

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set5 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=20000)


In [26]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set5, 
    'SET 5', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9481,0.6589,0.7775
LinearSVC,0.8125,0.2109,0.3349
SVC,0.8092,0.1262,0.2183
LogisticRegression,0.7824,0.0762,0.1388
KNeighborsClassifier,0.7741,0.4308,0.5536
DecisionTreeClassifier,0.9963,0.9779,0.987
RandomForestClassifier,0.9986,0.9757,0.987
AdaBoostClassifier,0.6854,0.3457,0.4596
XGBClassifier,0.8859,0.2379,0.3751
GaussianNB,0.6998,1.0,0.8234


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6928,0.263,0.3813
LinearSVC,0.5806,0.0447,0.0829
SVC,0.0,0.0,0.0
LogisticRegression,0.5,0.0074,0.0147
KNeighborsClassifier,0.2993,0.1017,0.1519
DecisionTreeClassifier,0.4752,0.3325,0.3912
RandomForestClassifier,0.7042,0.1241,0.211
AdaBoostClassifier,0.5512,0.1737,0.2642
XGBClassifier,0.5,0.067,0.1182
GaussianNB,0.4986,0.4566,0.4767


CPU times: user 4min 24s, sys: 1min 54s, total: 6min 19s
Wall time: 4min 19s


### Set 6

- N-grams: 1 to 2
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set6 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=2000)


In [28]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set6, 
    'SET 6', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7493,0.3664,0.4921
LinearSVC,0.7807,0.1893,0.3047
SVC,0.7987,0.1091,0.1919
LogisticRegression,0.7603,0.0829,0.1495
KNeighborsClassifier,0.7354,0.4723,0.5752
DecisionTreeClassifier,0.9949,0.9635,0.9789
RandomForestClassifier,0.9981,0.9603,0.9789
AdaBoostClassifier,0.6675,0.3511,0.4601
XGBClassifier,0.8846,0.2384,0.3756
GaussianNB,0.4196,0.9261,0.5776


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6692,0.2159,0.3265
LinearSVC,0.6818,0.0744,0.1342
SVC,0.0,0.0,0.0
LogisticRegression,0.5714,0.0099,0.0195
KNeighborsClassifier,0.3608,0.1737,0.2345
DecisionTreeClassifier,0.3391,0.2903,0.3128
RandomForestClassifier,0.6889,0.2308,0.3457
AdaBoostClassifier,0.5494,0.2208,0.315
XGBClassifier,0.56,0.0695,0.1236
GaussianNB,0.3675,0.7295,0.4888


CPU times: user 1min 31s, sys: 8.13 s, total: 1min 40s
Wall time: 1min 31s


### Set 7

- N-grams: 1
- Stop words removal: No
- Vocabulary's size: No limits

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set7 =  TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None)


In [30]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set7, 
    'SET 7', 
    verbose_vocab=True)

   Learned 11376 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8189,0.4971,0.6186
LinearSVC,0.7879,0.2059,0.3265
SVC,0.7946,0.1064,0.1876
LogisticRegression,0.7556,0.0766,0.1391
KNeighborsClassifier,0.6709,0.4732,0.555
DecisionTreeClassifier,0.9968,0.9815,0.9891
RandomForestClassifier,0.9977,0.9802,0.9889
AdaBoostClassifier,0.6953,0.3249,0.4429


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6364,0.1737,0.2729
LinearSVC,0.5926,0.0397,0.0744
SVC,0.0,0.0,0.0
LogisticRegression,0.5714,0.0099,0.0195
KNeighborsClassifier,0.4627,0.2308,0.3079
DecisionTreeClassifier,0.4103,0.2953,0.3434
RandomForestClassifier,0.7108,0.1464,0.2428
AdaBoostClassifier,0.5344,0.1737,0.2622


CPU times: user 1min 50s, sys: 43.2 s, total: 2min 34s
Wall time: 1min 47s


### Set 8

- N-grams: 1
- Stop words removal: No
- Maximum vocabulary's size: 20,000

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set8 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=20000)


In [32]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set8, 
    'SET 8', 
    verbose_vocab=True)

   Learned 11376 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8189,0.4971,0.6186
LinearSVC,0.7879,0.2059,0.3265
SVC,0.7946,0.1064,0.1876
LogisticRegression,0.7556,0.0766,0.1391
KNeighborsClassifier,0.6709,0.4732,0.555
DecisionTreeClassifier,0.9968,0.9815,0.9891
RandomForestClassifier,0.9977,0.9802,0.9889
AdaBoostClassifier,0.6953,0.3249,0.4429
XGBClassifier,0.87,0.2474,0.3853
GaussianNB,0.5456,1.0,0.706


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6364,0.1737,0.2729
LinearSVC,0.5926,0.0397,0.0744
SVC,0.0,0.0,0.0
LogisticRegression,0.5714,0.0099,0.0195
KNeighborsClassifier,0.4627,0.2308,0.3079
DecisionTreeClassifier,0.4103,0.2953,0.3434
RandomForestClassifier,0.7108,0.1464,0.2428
AdaBoostClassifier,0.5344,0.1737,0.2622
XGBClassifier,0.4889,0.0546,0.0982
GaussianNB,0.2966,0.4541,0.3588


CPU times: user 2min 16s, sys: 45.6 s, total: 3min 2s
Wall time: 2min 14s


### Set 9

- N-grams: 1
- Stop words removal: No
- Maximum vocabulary's size: 2,000

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set9 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=2000)


In [34]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set9, 
    'SET 9', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7932,0.5065,0.6183
LinearSVC,0.7778,0.205,0.3245
SVC,0.7946,0.1064,0.1876
LogisticRegression,0.7586,0.0793,0.1436
KNeighborsClassifier,0.7476,0.4552,0.5658
DecisionTreeClassifier,0.9949,0.9698,0.9822
RandomForestClassifier,0.9977,0.9671,0.9822
AdaBoostClassifier,0.6925,0.338,0.4543
XGBClassifier,0.8707,0.2519,0.3908
GaussianNB,0.4137,0.954,0.5772


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6358,0.2556,0.3646
LinearSVC,0.6216,0.0571,0.1045
SVC,0.0,0.0,0.0
LogisticRegression,0.5714,0.0099,0.0195
KNeighborsClassifier,0.3856,0.1464,0.2122
DecisionTreeClassifier,0.4156,0.33,0.3679
RandomForestClassifier,0.6721,0.2035,0.3124
AdaBoostClassifier,0.527,0.1935,0.2831
XGBClassifier,0.46,0.0571,0.1015
GaussianNB,0.3719,0.67,0.4783


CPU times: user 1min 19s, sys: 10.6 s, total: 1min 30s
Wall time: 1min 18s


### Set 10

- N-grams: 1 to 3
- Stop words removal: Yes
- Vocabulary's size: No limits

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set10 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None, 
      stop_words='english')


In [36]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set10, 
    'SET 10', 
    verbose_vocab=True)

   Learned 192965 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9694,0.8423,0.9014
LinearSVC,0.8057,0.1906,0.3083
SVC,0.9055,0.3542,0.5092
LogisticRegression,0.7746,0.0604,0.112
KNeighborsClassifier,0.6932,0.5183,0.5931
DecisionTreeClassifier,0.9977,0.9748,0.9861
RandomForestClassifier,0.9991,0.9734,0.9861
AdaBoostClassifier,0.6734,0.2713,0.3868


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.4455,0.2333,0.3062
LinearSVC,0.6739,0.0769,0.1381
SVC,0.5263,0.0248,0.0474
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.3966,0.2854,0.3319
DecisionTreeClassifier,0.3786,0.3251,0.3498
RandomForestClassifier,0.5935,0.1811,0.2776
AdaBoostClassifier,0.5143,0.1787,0.2652


CPU times: user 21min 35s, sys: 4min 21s, total: 25min 57s
Wall time: 22min 2s


### Set 11

- N-grams: 1 to 3
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set11 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=20000, 
      stop_words='english')


In [38]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set11, 
    'SET 11', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9257,0.7413,0.8233
LinearSVC,0.8115,0.2096,0.3331
SVC,0.8937,0.3524,0.5055
LogisticRegression,0.7767,0.0721,0.132
KNeighborsClassifier,0.7908,0.4975,0.6108
DecisionTreeClassifier,0.9972,0.9703,0.9836
RandomForestClassifier,0.9986,0.9689,0.9835
AdaBoostClassifier,0.655,0.2866,0.3987
XGBClassifier,0.9178,0.146,0.2519
GaussianNB,0.6406,1.0,0.7809


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5498,0.3697,0.4421
LinearSVC,0.7121,0.1166,0.2004
SVC,0.64,0.0397,0.0748
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4294,0.1737,0.2473
DecisionTreeClassifier,0.3717,0.4169,0.393
RandomForestClassifier,0.4903,0.3127,0.3818
AdaBoostClassifier,0.4497,0.1886,0.2657
XGBClassifier,0.5556,0.0496,0.0911
GaussianNB,0.3963,0.536,0.4557


CPU times: user 3min 33s, sys: 2min, total: 5min 33s
Wall time: 3min 35s


### Set 12

- N-grams: 1 to 3
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set12 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=2000, 
      stop_words='english')


In [40]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set12, 
    'SET 12', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8196,0.5264,0.6411
LinearSVC,0.7943,0.2123,0.335
SVC,0.8701,0.311,0.4582
LogisticRegression,0.7968,0.0901,0.1619
KNeighborsClassifier,0.6284,0.5602,0.5923
DecisionTreeClassifier,0.9915,0.9491,0.9698
RandomForestClassifier,0.9976,0.9428,0.9694
AdaBoostClassifier,0.6714,0.2763,0.3914
XGBClassifier,0.9211,0.142,0.246
GaussianNB,0.3972,0.9721,0.5639


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5522,0.3151,0.4013
LinearSVC,0.7027,0.129,0.218
SVC,0.5789,0.0546,0.0998
LogisticRegression,1.0,0.0099,0.0197
KNeighborsClassifier,0.3946,0.2556,0.3102
DecisionTreeClassifier,0.3388,0.4119,0.3718
RandomForestClassifier,0.4494,0.397,0.4216
AdaBoostClassifier,0.4036,0.1663,0.2355
XGBClassifier,0.6765,0.0571,0.1053
GaussianNB,0.3713,0.6799,0.4803


CPU times: user 59.8 s, sys: 12.1 s, total: 1min 11s
Wall time: 1min


### Set 13

- N-grams: 1 to 2
- Stop words removal: Yes
- Vocabulary's size: No limits

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set13 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None, 
      stop_words='english')


In [42]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set13, 
    'SET 13', 
    verbose_vocab=True)

   Learned 90289 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9519,0.7048,0.8099
LinearSVC,0.8109,0.201,0.3221
SVC,0.8922,0.3506,0.5034
LogisticRegression,0.791,0.0717,0.1314
KNeighborsClassifier,0.7827,0.4691,0.5866
DecisionTreeClassifier,0.9977,0.9748,0.9861
RandomForestClassifier,0.9991,0.9734,0.9861
AdaBoostClassifier,0.6649,0.2871,0.401


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.4821,0.201,0.2837
LinearSVC,0.7442,0.0794,0.1435
SVC,0.5263,0.0248,0.0474
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4356,0.2184,0.2909
DecisionTreeClassifier,0.416,0.3747,0.3943
RandomForestClassifier,0.5786,0.2283,0.3274
AdaBoostClassifier,0.5,0.201,0.2867


CPU times: user 9min 7s, sys: 2min 37s, total: 11min 44s
Wall time: 8min 55s


### Set 14

- N-grams: 1 to 2
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set14 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=20000, 
      stop_words='english')


In [44]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set14, 
    'SET 14', 
    verbose_vocab=True)

   Learned 20000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9501,0.7729,0.8524
LinearSVC,0.8188,0.2118,0.3366
SVC,0.8916,0.352,0.5047
LogisticRegression,0.786,0.0762,0.1389
KNeighborsClassifier,0.784,0.4808,0.5961
DecisionTreeClassifier,0.9972,0.9712,0.984
RandomForestClassifier,0.9986,0.9698,0.984
AdaBoostClassifier,0.661,0.2803,0.3937
XGBClassifier,0.9282,0.1456,0.2517
GaussianNB,0.6774,1.0,0.8076


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5229,0.34,0.412
LinearSVC,0.7333,0.1092,0.1901
SVC,0.64,0.0397,0.0748
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4345,0.1811,0.2557
DecisionTreeClassifier,0.381,0.3573,0.3688
RandomForestClassifier,0.5224,0.3176,0.3951
AdaBoostClassifier,0.4777,0.1861,0.2679
XGBClassifier,0.75,0.0596,0.1103
GaussianNB,0.3992,0.5161,0.4502


CPU times: user 3min 59s, sys: 2min 29s, total: 6min 29s
Wall time: 4min 4s


### Set 15

- N-grams: 1 to 2
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set15 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=2000, 
      stop_words='english')


In [46]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set15, 
    'SET 15', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7927,0.5651,0.6598
LinearSVC,0.798,0.2118,0.3348
SVC,0.8747,0.3082,0.4558
LogisticRegression,0.7961,0.0915,0.1641
KNeighborsClassifier,0.7706,0.4845,0.5949
DecisionTreeClassifier,0.9915,0.95,0.9703
RandomForestClassifier,0.9976,0.9441,0.9701
AdaBoostClassifier,0.6719,0.2713,0.3865
XGBClassifier,0.9212,0.1528,0.2621
GaussianNB,0.3997,0.9734,0.5667


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5407,0.33,0.4099
LinearSVC,0.7162,0.1315,0.2222
SVC,0.5946,0.0546,0.1
LogisticRegression,1.0,0.0099,0.0197
KNeighborsClassifier,0.38,0.1414,0.2061
DecisionTreeClassifier,0.3149,0.3945,0.3502
RandomForestClassifier,0.4254,0.3747,0.3984
AdaBoostClassifier,0.5071,0.1762,0.2615
XGBClassifier,0.6571,0.0571,0.105
GaussianNB,0.373,0.6774,0.4811


CPU times: user 58.5 s, sys: 10.7 s, total: 1min 9s
Wall time: 57.9 s


### Set 16

- N-grams: 1
- Stop words removal: Yes
- Vocabulary's size: No limits

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set16 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None, 
      stop_words='english')


In [48]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set16, 
    'SET 16', 
    verbose_vocab=True)

   Learned 11089 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8633,0.5548,0.6754
LinearSVC,0.8049,0.2213,0.3471
SVC,0.8867,0.3384,0.4899
LogisticRegression,0.7826,0.0892,0.1602
KNeighborsClassifier,0.7865,0.4849,0.5999
DecisionTreeClassifier,0.9968,0.9743,0.9854
RandomForestClassifier,0.9986,0.9721,0.9852
AdaBoostClassifier,0.6644,0.265,0.3789


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5029,0.2159,0.3021
LinearSVC,0.775,0.0769,0.14
SVC,0.5667,0.0422,0.0785
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4,0.1687,0.2373
DecisionTreeClassifier,0.374,0.3648,0.3693
RandomForestClassifier,0.5019,0.3275,0.3964
AdaBoostClassifier,0.4832,0.1787,0.2609


CPU times: user 1min 38s, sys: 43.2 s, total: 2min 21s
Wall time: 1min 34s


### Set 17

- N-grams: 1
- Stop words removal: Yes
- Maximum vocabulary's size: 20,000

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set17 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=20000, 
      stop_words='english')


In [50]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set17, 
    'SET 17', 
    verbose_vocab=True)

   Learned 11089 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.8633,0.5548,0.6754
LinearSVC,0.8049,0.2213,0.3471
SVC,0.8867,0.3384,0.4899
LogisticRegression,0.7826,0.0892,0.1602
KNeighborsClassifier,0.7865,0.4849,0.5999
DecisionTreeClassifier,0.9968,0.9743,0.9854
RandomForestClassifier,0.9986,0.9721,0.9852
AdaBoostClassifier,0.6644,0.265,0.3789
XGBClassifier,0.9056,0.1469,0.2528
GaussianNB,0.5356,1.0,0.6976


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.5029,0.2159,0.3021
LinearSVC,0.775,0.0769,0.14
SVC,0.5667,0.0422,0.0785
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4,0.1687,0.2373
DecisionTreeClassifier,0.374,0.3648,0.3693
RandomForestClassifier,0.5019,0.3275,0.3964
AdaBoostClassifier,0.4832,0.1787,0.2609
XGBClassifier,0.6129,0.0471,0.0876
GaussianNB,0.2864,0.4442,0.3482


CPU times: user 1min 53s, sys: 49.4 s, total: 2min 43s
Wall time: 1min 53s


### Set 18

- N-grams: 1
- Stop words removal: Yes
- Maximum vocabulary's size: 2,000

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set18 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=2000, 
      stop_words='english')


In [52]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False), 
    (xgboost, False), (naive_bayes, True)], 
    model_set18, 
    'SET 18', 
    verbose_vocab=True)

   Learned 2000 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier
   Processing model: XGBClassifier
   Processing model: GaussianNB


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.7954,0.4502,0.575
LinearSVC,0.7845,0.215,0.3375
SVC,0.8762,0.3159,0.4644
LogisticRegression,0.7799,0.0942,0.1681
KNeighborsClassifier,0.7668,0.4948,0.6015
DecisionTreeClassifier,0.9939,0.9554,0.9743
RandomForestClassifier,0.9976,0.9518,0.9742
AdaBoostClassifier,0.6682,0.2686,0.3832
XGBClassifier,0.9157,0.1469,0.2532
GaussianNB,0.3997,0.9766,0.5672


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.6039,0.2308,0.3339
LinearSVC,0.7681,0.1315,0.2246
SVC,0.68,0.0844,0.1501
LogisticRegression,0.8,0.0099,0.0196
KNeighborsClassifier,0.4124,0.1985,0.268
DecisionTreeClassifier,0.3045,0.3325,0.3179
RandomForestClassifier,0.4476,0.3921,0.418
AdaBoostClassifier,0.473,0.1737,0.2541
XGBClassifier,0.6875,0.0546,0.1011
GaussianNB,0.3636,0.6551,0.4677


CPU times: user 53.3 s, sys: 7.58 s, total: 1min
Wall time: 52.9 s


### Set 19

- N-grams: 1 to 3
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set19 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 3), 
      max_features=None, 
      max_df=0.85)


In [54]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set19, 
    'SET 19', 
    verbose_vocab=True)

   Learned 259420 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9755,0.9166,0.9452
LinearSVC,0.8337,0.1807,0.297
SVC,0.932,0.3767,0.5366
LogisticRegression,0.8041,0.0536,0.1005
KNeighborsClassifier,0.7609,0.4876,0.5943
DecisionTreeClassifier,0.9982,0.9784,0.9882
RandomForestClassifier,0.9995,0.9766,0.9879
AdaBoostClassifier,0.6585,0.2799,0.3928


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.4828,0.3821,0.4266
LinearSVC,0.8182,0.0447,0.0847
SVC,0.7143,0.0248,0.048
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4241,0.1663,0.2389
DecisionTreeClassifier,0.4259,0.3424,0.3796
RandomForestClassifier,0.5802,0.1886,0.2846
AdaBoostClassifier,0.451,0.1141,0.1822


CPU times: user 49min 19s, sys: 8min 11s, total: 57min 31s
Wall time: 49min 37s


### Set 20

- N-grams: 1 to 2
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set20 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 2), 
      max_features=None, 
      max_df=0.85)


In [56]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set20, 
    'SET 20', 
    verbose_vocab=True)

   Learned 95509 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9678,0.7981,0.8748
LinearSVC,0.8321,0.1987,0.3208
SVC,0.9154,0.3704,0.5274
LogisticRegression,0.7513,0.0653,0.1202
KNeighborsClassifier,0.7713,0.4727,0.5862
DecisionTreeClassifier,0.9982,0.9784,0.9882
RandomForestClassifier,0.9995,0.977,0.9881
AdaBoostClassifier,0.6577,0.2744,0.3873


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.533,0.2605,0.35
LinearSVC,0.8261,0.0471,0.0892
SVC,0.7059,0.0298,0.0571
LogisticRegression,0.0,0.0,0.0
KNeighborsClassifier,0.4788,0.196,0.2782
DecisionTreeClassifier,0.4429,0.3945,0.4173
RandomForestClassifier,0.5858,0.2457,0.3462
AdaBoostClassifier,0.4352,0.1166,0.184


CPU times: user 10min 15s, sys: 2min 50s, total: 13min 6s
Wall time: 10min 2s


### Set 21

- N-grams: 1
- Stop words removal: No
- Vocabulary's size: No limits
- Maximum DF: 0.85

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

model_set21 = TfidfVectorizer(
      preprocessor=preprocess, 
      ngram_range=(1, 1), 
      max_features=None, 
      max_df=0.85)


In [58]:
%%time

evaluation(
    [(mlp, False), (linear_svm, False), (rbf_svm, False), (logistic_regression, False), 
    (knn, False), (decision_tree, False), (random_forest, False), (adaboost, False)], 
    model_set21, 
    'SET 21', 
    verbose_vocab=True)

   Learned 11268 terms.
   Processing model: MLPClassifier
   Processing model: LinearSVC
   Processing model: SVC
   Processing model: LogisticRegression
   Processing model: KNeighborsClassifier
   Processing model: DecisionTreeClassifier
   Processing model: RandomForestClassifier
   Processing model: AdaBoostClassifier


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.9268,0.8103,0.8646
LinearSVC,0.8,0.2127,0.3361
SVC,0.8737,0.3493,0.499
LogisticRegression,0.7686,0.0838,0.1512
KNeighborsClassifier,0.7735,0.5079,0.6132
DecisionTreeClassifier,0.9958,0.973,0.9843
RandomForestClassifier,0.9981,0.9698,0.9838
AdaBoostClassifier,0.6897,0.2605,0.3781


Unnamed: 0,Precision,Recall,F1
MLPClassifier,0.43,0.3201,0.367
LinearSVC,0.8,0.0496,0.0935
SVC,0.5926,0.0397,0.0744
LogisticRegression,1.0,0.0025,0.005
KNeighborsClassifier,0.3716,0.1687,0.2321
DecisionTreeClassifier,0.3576,0.2804,0.3143
RandomForestClassifier,0.4696,0.2878,0.3569
AdaBoostClassifier,0.5584,0.1067,0.1792


CPU times: user 2min 31s, sys: 1min 26s, total: 3min 57s
Wall time: 2min 25s


### Summary

In [59]:
from IPython.display import display, update_display

pd.set_option("display.max_rows", None)
metrics_df = pd.DataFrame(columns=['Model', 'TF-IDF set', 'Precision', 'Recall', 'F1'])
i = 0
for model_name, metrics in test_metrics.items():
  for m in metrics:
    metrics_df.loc[i] = [model_name, m[0], f'{m[1]:.4f}', f'{m[2]:.4f}', f'{m[3]:.4f}']
    i += 1
metrics_display = display(metrics_df, display_id='metrics_table')

Unnamed: 0,Model,TF-IDF set,Precision,Recall,F1
0,MLPClassifier,SET 1,0.5726,0.3325,0.4207
1,MLPClassifier,SET 2,0.6738,0.2357,0.3493
2,MLPClassifier,SET 3,0.601,0.2953,0.396
3,MLPClassifier,SET 4,0.6471,0.2457,0.3561
4,MLPClassifier,SET 5,0.6928,0.263,0.3813
5,MLPClassifier,SET 6,0.6692,0.2159,0.3265
6,MLPClassifier,SET 7,0.6364,0.1737,0.2729
7,MLPClassifier,SET 8,0.6364,0.1737,0.2729
8,MLPClassifier,SET 9,0.6358,0.2556,0.3646
9,MLPClassifier,SET 10,0.4455,0.2333,0.3062


###Reference paper:

> Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. **Identification of Rhetorical Roles of Sentences in Indian Legal Judgments**. In Proc. International Conference on Legal Knowledge and Information Systems (JURIX).

