## Preparing the Data

Firstly, we must import the dataset into the notebook

In [1]:
import pandas as pd

# Importing the dataset
train = pd.read_excel('OpArticles_ADUs.xlsx')
test = pd.read_excel('OpArticles.xlsx')

In [10]:
train.head()

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
0,5d04a31b896a7fea069ef06f,A,0,"[[2516, 2556]]",O facto não é apenas fruto da ignorância,Value
1,5d04a31b896a7fea069ef06f,A,1,"[[2568, 2806]]",havia no seu humor mais jornalismo (mais inves...,Value
2,5d04a31b896a7fea069ef06f,A,3,"[[3169, 3190]]",É tudo cómico na FIFA,Value
3,5d04a31b896a7fea069ef06f,A,4,"[[3198, 3285]]",o que todos nós permitimos que esta organizaçã...,Value
4,5d04a31b896a7fea069ef06f,A,6,"[[4257, 4296]]",não nos fazem rir à custa dos poderosos,Value


In [11]:
test.head()

Unnamed: 0,article_id,title,authors,body,meta_description,topics,keywords,publish_date,url_canonical
0,5d04a31b896a7fea069ef06f,"Pouco pão e muito circo, morte e bocejo",['José Vítor Malheiros'],"O poeta espanhol António Machado escrevia, uns...","É tudo cómico na FIFA, porque todos os dias a ...",Sports,"['Brasil', 'Campeonato do Mundo', 'Desporto', ...",2014-06-17 00:16:00,https://www.publico.pt/2014/06/17/desporto/opi...
1,5d04a3fc896a7fea069f0717,Portugal nos Mundiais de Futebol de 2010 e 2014,['Rui J. Baptista'],“O mais excelente quadro posto a uma luz logo ...,Deve ser evidenciado o clima favorável criado ...,Sports,"['Brasil', 'Campeonato do Mundo', 'Coreia do N...",2014-07-05 02:46:00,https://www.publico.pt/2014/07/05/desporto/opi...
2,5d04a455896a7fea069f07ab,"Futebol, guerra, religião",['Fernando Belo'],1. As sociedades humanas parecem ser regidas p...,O futebol parece ser um sucedâneo quer da lei ...,Sports,"['A guerra na Síria', 'Desporto', 'Futebol', '...",2014-07-12 16:05:33,https://www.publico.pt/2014/07/12/desporto/opi...
3,5d04a52f896a7fea069f0921,As razões do Qatar para acolher o Mundial em 2022,['Hamad bin Khalifa bin Ahmad Al Thani'],Este foi um Mundial incrível. Vimos actuações ...,Queremos cooperar plenamente com a investigaçã...,Sports,"['Desporto', 'FIFA', 'Futebol', 'Mundial de fu...",2014-07-27 02:00:00,https://www.publico.pt/2014/07/27/desporto/opi...
4,5d04a8d7896a7fea069f6997,A política no campo de futebol,['Carlos Nolasco'],O futebol sempre foi um jogo aparentemente sim...,Retirar a expressão política do futebol é reti...,Sports,"['Albânia', 'Campeonato da Europa', 'Desporto'...",2014-10-23 00:16:00,https://www.publico.pt/2014/10/23/desporto/opi...


## Cleanup and normalization

The next step is to cleanup our dataset and normalize some data

#### Removing non-alphabetic chars

Let's start by removing any non-alpha chars, using a regular expression. We'll create a separate corpus (a list of tokens), so that we leave the original dataset untouched.

#### Lowercasing

We can then apply lowercasing, so that words such as *Amazing*, *AMAZING* and *amazing* all have the same representation.

#### Removing stop words

Another common step which is sometimes applied is to remove any stop words (words that do not have domain semantics attached). We can use the stop words list provided in NLTK for English:

#### Stemming

Finally, we can apply stemming to further reduce the size of the vocabulary through normalization.

In [2]:
import re
from nltk.stem import RSLPStemmer
from nltk.corpus import stopwords

stopwords_list = stopwords.words('portuguese')
stopwords_list.remove('não')

corpus = []
stemmer = RSLPStemmer()
for i in range(0, train['tokens'].size):
    # get review and remove non alpha chars
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', train['tokens'][i])
    # to lower-case 
    review = review.lower()
    # split into tokens, apply stemming and remove stop words
    review = ' '.join([stemmer.stem(w) for w in review.split() if not w in set(stopwords_list)])
    corpus.append(review)

print(corpus[:5])

['fact não apen frut ignor', 'hav hum jorn investig preocup aprofund contextual histór isenç relat preocup soc urg denunci muit peç real jorn', 'tud cómic fif', 'tod permit organiz faç total absurd sent', 'não faz rir cust poder']


## Obtaining Features and Classes

The next step is to obtain the features we will use to train our model.

For this, we will use TF-IDF with N-Grams

TODO: explore [TfidfVectorizer params](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.8)
X = vectorizer.fit_transform(corpus).toarray()

print("(Number of samples, Number of features):", X.shape)

(Number of samples, Number of features): (16743, 19262)


In [4]:
y = train['label']

print(y.shape)

(16743,)


## Training classifiers

- *Naive Bayes*, the two most effective variants are [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) and [ComplementNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html).
- *Logistic Regression*, through scikit-learn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class.
- *Decision Tree*, through scikit-learn's [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class. This model always assigns a probability of 1 to one of the classes.
- *Random Forest*, through scikit-learn's [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) class.
- *Support Vector Machines (SVM)*, through scikit-learn's [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) class. The SVM model also allows you to get probabilities, but for that you need to use the *probability=True* parameter setting in its constructor.
- *Perceptron*, through scikit-learn's [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) class. This model does not allow you to get probabilities.
- *eXtreme Gradient Boosting*, through [XGBoost](https://xgboost.readthedocs.io/en/stable/).

TODO:
- Tune parameters
- [explore more CVs](https://scikit-learn.org/stable/modules/classes.html?highlight=model_selection#splitter-classes)

In [5]:
# Metrics
import sklearn.metrics as metrics
import time

# Cross Validation and Hyper Tuning
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, GridSearchCV

# Classifiers
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
import xgboost as xgb

To train machine learning classifiers, we first split the data into training and test sets.
We are using 80% of the data to create a train set, and the rest 20% for the test set.
We specify the _stratify_ parameter in order to create balanced distribution regarding labels percentages

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print("\nLabel distribution in the training set:")
print(y_train.value_counts())

print("\nLabel distribution in the test set:")
print(y_test.value_counts())

(13394, 19262) (13394,)
(3349, 19262) (3349,)

Label distribution in the training set:
Value       6481
Fact        2930
Value(-)    2320
Value(+)    1129
Policy       534
Name: label, dtype: int64

Label distribution in the test set:
Value       1621
Fact         733
Value(-)     580
Value(+)     282
Policy       133
Name: label, dtype: int64


### Baseline Predictions

In [7]:
def predict(clf):
    start = time.time()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    stop = time.time()

    # Metrics
    print("Elapsed time: %0.2fs" % (stop - start))
    print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, y_pred))
    print("Classification report:\n", metrics.classification_report(y_test, y_pred))

#### Naive Bayes

In [10]:
mnb = predict(MultinomialNB())

Elapsed time: 1.02s

Confusion matrix:
 [[ 116    0  598    1   18]
 [   0    0  133    0    0]
 [  74    0 1517    0   30]
 [  15    0  260    5    2]
 [  14    0  496    0   70]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.53      0.16      0.24       733
      Policy       0.00      0.00      0.00       133
       Value       0.50      0.94      0.66      1621
    Value(+)       0.83      0.02      0.03       282
    Value(-)       0.58      0.12      0.20       580

    accuracy                           0.51      3349
   macro avg       0.49      0.25      0.23      3349
weighted avg       0.53      0.51      0.41      3349



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [11]:
cnb = predict(ComplementNB())

Elapsed time: 0.89s

Confusion matrix:
 [[301  16 307  40  69]
 [  4  58  58   8   5]
 [286  54 977  95 209]
 [ 39  13  96 115  19]
 [ 75  10 201  13 281]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.43      0.41      0.42       733
      Policy       0.38      0.44      0.41       133
       Value       0.60      0.60      0.60      1621
    Value(+)       0.42      0.41      0.42       282
    Value(-)       0.48      0.48      0.48       580

    accuracy                           0.52      3349
   macro avg       0.46      0.47      0.47      3349
weighted avg       0.52      0.52      0.52      3349



#### SGD

In [12]:
sgd = predict(SGDClassifier(random_state=0, n_jobs=-1))

Elapsed time: 27.14s

Confusion matrix:
 [[ 230    3  418   28   54]
 [   7   38   72   14    2]
 [ 182   18 1249   48  124]
 [  32    8  143   84   15]
 [  47    2  310    6  215]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.46      0.31      0.37       733
      Policy       0.55      0.29      0.38       133
       Value       0.57      0.77      0.66      1621
    Value(+)       0.47      0.30      0.36       282
    Value(-)       0.52      0.37      0.43       580

    accuracy                           0.54      3349
   macro avg       0.51      0.41      0.44      3349
weighted avg       0.53      0.54      0.52      3349



#### Logistic Regression

In [13]:
lg = predict(LogisticRegression(random_state=0, n_jobs=-1))

Elapsed time: 198.74s

Confusion matrix:
 [[ 217    1  472    7   36]
 [   4   14  110    4    1]
 [ 158    4 1372    5   82]
 [  31    2  205   34   10]
 [  47    1  389    0  143]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.47      0.30      0.36       733
      Policy       0.64      0.11      0.18       133
       Value       0.54      0.85      0.66      1621
    Value(+)       0.68      0.12      0.20       282
    Value(-)       0.53      0.25      0.34       580

    accuracy                           0.53      3349
   macro avg       0.57      0.32      0.35      3349
weighted avg       0.54      0.53      0.48      3349



#### SVC

In [14]:
svc = predict(SVC(random_state=0, max_iter=100))



Elapsed time: 225.82s

Confusion matrix:
 [[ 189   19  518    6    1]
 [  13   41   78    1    0]
 [ 402   66 1141   10    2]
 [  82   12  184    4    0]
 [ 134   14  424    6    2]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.23      0.26      0.24       733
      Policy       0.27      0.31      0.29       133
       Value       0.49      0.70      0.58      1621
    Value(+)       0.15      0.01      0.03       282
    Value(-)       0.40      0.00      0.01       580

    accuracy                           0.41      3349
   macro avg       0.31      0.26      0.23      3349
weighted avg       0.38      0.41      0.35      3349



#### Decision Tree

In [15]:
dt = predict(DecisionTreeClassifier(random_state=0, max_depth=5))

Elapsed time: 21.25s

Confusion matrix:
 [[  22    3  704    1    3]
 [   1   17  113    1    1]
 [  13    6 1597    0    5]
 [   0    0  280    1    1]
 [   9    0  568    0    3]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.49      0.03      0.06       733
      Policy       0.65      0.13      0.21       133
       Value       0.49      0.99      0.65      1621
    Value(+)       0.33      0.00      0.01       282
    Value(-)       0.23      0.01      0.01       580

    accuracy                           0.49      3349
   macro avg       0.44      0.23      0.19      3349
weighted avg       0.44      0.49      0.34      3349



#### Random Forest

In [16]:
rf = predict(RandomForestClassifier(random_state=0, max_depth=5, n_jobs=-1))

Elapsed time: 5.00s

Confusion matrix:
 [[   0    0  733    0    0]
 [   0    0  133    0    0]
 [   0    0 1621    0    0]
 [   0    0  282    0    0]
 [   0    0  580    0    0]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.00      0.00      0.00       733
      Policy       0.00      0.00      0.00       133
       Value       0.48      1.00      0.65      1621
    Value(+)       0.00      0.00      0.00       282
    Value(-)       0.00      0.00      0.00       580

    accuracy                           0.48      3349
   macro avg       0.10      0.20      0.13      3349
weighted avg       0.23      0.48      0.32      3349



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### K Nearest Neighbors

In [17]:
knn = predict(KNeighborsClassifier(n_neighbors=10, n_jobs=-1))

Elapsed time: 14.77s

Confusion matrix:
 [[ 199    0  451   33   50]
 [  36    1   76    8   12]
 [ 344    0 1120   66   91]
 [  60    0  169   36   17]
 [ 110    0  352   26   92]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.27      0.27      0.27       733
      Policy       1.00      0.01      0.01       133
       Value       0.52      0.69      0.59      1621
    Value(+)       0.21      0.13      0.16       282
    Value(-)       0.35      0.16      0.22       580

    accuracy                           0.43      3349
   macro avg       0.47      0.25      0.25      3349
weighted avg       0.43      0.43      0.40      3349



#### Perceptron

In [18]:
per = predict(Perceptron(random_state=0))

Elapsed time: 48.11s

Confusion matrix:
 [[284  12 307  37  93]
 [  4  67  45   9   8]
 [333  40 931  66 251]
 [ 46  13 114  86  23]
 [ 89   4 197   4 286]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.38      0.39      0.38       733
      Policy       0.49      0.50      0.50       133
       Value       0.58      0.57      0.58      1621
    Value(+)       0.43      0.30      0.36       282
    Value(-)       0.43      0.49      0.46       580

    accuracy                           0.49      3349
   macro avg       0.46      0.45      0.46      3349
weighted avg       0.50      0.49      0.49      3349



#### XGBoost

In [19]:
xgb = predict(xgb.XGBClassifier(random_state=0))



Elapsed time: 589.44s

Confusion matrix:
 [[ 119    2  576   15   21]
 [   2   35   89    4    3]
 [  71   18 1463   14   55]
 [  13    2  238   25    4]
 [  21    3  465    2   89]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.53      0.16      0.25       733
      Policy       0.58      0.26      0.36       133
       Value       0.52      0.90      0.66      1621
    Value(+)       0.42      0.09      0.15       282
    Value(-)       0.52      0.15      0.24       580

    accuracy                           0.52      3349
   macro avg       0.51      0.31      0.33      3349
weighted avg       0.51      0.52      0.44      3349



### Parameter tuning

To further better the training of the model we explored different parameters within each classifier.

TODO [explore more scoring methods](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

In [20]:
def grid_search(clf, parameter_grid):
    cross_validation = StratifiedKFold(n_splits=5)

    grid_search = GridSearchCV(clf,
                               param_grid=parameter_grid,
                               scoring='accuracy',
                               cv=cross_validation,
                               verbose=4,
                               n_jobs=2,
                               refit=True)

    start = time.time()
    grid_search.fit(X_train, y_train)
    stop = time.time()
    print(f"Fit time: {stop - start}s")

    print("\nBest score:", grid_search.best_score_)
    print("Best parameters:", grid_search.best_params_)
    print("Best estimator:", grid_search.best_estimator_)
    
    best_model = grid_search.best_estimator_
    best_model_pred = best_model.predict(X_test)

    # Metrics
    print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, best_model_pred))
    print("Classification report:\n", metrics.classification_report(y_test, best_model_pred))

    return best_model

#### SGD

In [22]:
clf = SGDClassifier(random_state=0, n_jobs=-1, early_stopping=True)

# The ‘log’ loss gives logistic regression, ‘perceptron’ is the linear loss used by the perceptron algorithm
parameter_grid= {'loss': ['log', 'hinge', 'perceptron'], 'penalty': ['elasticnet', 'l1', 'l2'], 'class_weight': [None, 'balanced']}

sgd = grid_search(clf, parameter_grid)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fit time: 2206.660979270935s

Best score: 0.5224735802291549
Best parameters: {'class_weight': None, 'loss': 'hinge', 'penalty': 'l2'}
Best estimator: SGDClassifier(early_stopping=True, n_jobs=-1, random_state=0)

Confusion matrix:
 [[ 146    2  515   11   59]
 [   1   33   90    5    4]
 [  95   11 1373   13  129]
 [  18    2  192   41   29]
 [  20    1  363    0  196]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.52      0.20      0.29       733
      Policy       0.67      0.25      0.36       133
       Value       0.54      0.85      0.66      1621
    Value(+)       0.59      0.15      0.23       282
    Value(-)       0.47      0.34      0.39       580

    accuracy                           0.53      3349
   macro avg       0.56      0.36      0.39      3349
weighted avg       0.53      0.53      0.49      3349



#### Logistic Regression

takes too long

In [None]:
clf = LogisticRegression(random_state=0)

parameter_grid= {'solver': ['saga'], 'penalty': ['elasticnet', 'l1', 'l2'], 'l1_ratio': [0.5], 'class_weight': [None, 'balanced']}

lg = grid_search(clf, parameter_grid)

## Training models per annotator and using ensemble

Separating each annotator's judgment will make the task of training the model significantly easier and faster, since we are dealing with less data. At the end of the training, we will use ensemble methods, such as voting, to make a more accurate prediction, by using the predictions of each trained model to reach a consensus on a label.

In [18]:
X_annotated = pd.concat([train['annotator'], pd.DataFrame(X)], axis=1)

print(X_annotated)

      annotator    0    1    2    3    4    5    6    7    8  ...  19252  \
0             A  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
1             A  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
2             A  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
3             A  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
4             A  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
...         ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...    ...   
16738         D  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
16739         D  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
16740         D  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
16741         D  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
16742         D  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   

       19253  19254  19255  19256  19257  19258  19259  19260  19261  
0        0.0    

In [19]:
X_A = X_annotated[X_annotated['annotator'] == 'A']
X_A = X_A.drop('annotator', 1)
y_A = train.loc[train['annotator'] == 'A', 'label']

X_B = X_annotated[X_annotated['annotator'] == 'B']
X_B = X_B.drop('annotator', 1)
y_B = train.loc[train['annotator'] == 'B', 'label']

X_C = X_annotated[X_annotated['annotator'] == 'C']
X_C = X_C.drop('annotator', 1)
y_C = train.loc[train['annotator'] == 'C', 'label']

X_D = X_annotated[X_annotated['annotator'] == 'D']
X_D = X_D.drop('annotator', 1)
y_D = train.loc[train['annotator'] == 'D', 'label']

print(X_A.shape)
print(y_A.shape)

(3335, 19262)
(3335,)


In [20]:
X_A_tr, X_A_te, y_A_tr, y_A_te = train_test_split(X_A, y_A, test_size=0.20, random_state=0, stratify=y_A)

X_B_tr, X_B_te, y_B_tr, y_B_te = train_test_split(X_B, y_B, test_size=0.20, random_state=0, stratify=y_B)

X_C_tr, X_C_te, y_C_tr, y_C_te = train_test_split(X_C, y_C, test_size=0.20, random_state=0, stratify=y_C)

X_D_tr, X_D_te, y_D_tr, y_D_te = train_test_split(X_D, y_D, test_size=0.20, random_state=0, stratify=y_D)

Now we can procede to use Voting in order to obtain a better prediction:

In [23]:
from mlxtend.classifier import EnsembleVoteClassifier
import copy

clf_A = SGDClassifier(random_state=0, n_jobs=-1, loss="log")
clf_A.fit(X_A_tr, y_A_tr)

clf_B = SGDClassifier(random_state=0, n_jobs=-1, loss="log")
clf_B.fit(X_B_tr, y_B_tr)

clf_C = SGDClassifier(random_state=0, n_jobs=-1, loss="log")
clf_C.fit(X_C_tr, y_C_tr)

clf_D = SGDClassifier(random_state=0, n_jobs=-1, loss="log")
clf_D.fit(X_D_tr, y_D_tr)

clf_list = [clf_A, clf_B, clf_C, clf_D]

eclf = EnsembleVoteClassifier(clfs=clf_list, fit_base_estimators=False, voting='soft')
eclf.fit(X,y)
y_pred_vote = eclf.predict(X)
print(y_test)
print("Classification report:\n", metrics.classification_report(y, y_pred_vote))



2674     Value(-)
7609     Value(-)
10747    Value(-)
12931        Fact
4107     Value(+)
           ...   
8773        Value
13538    Value(-)
1782        Value
9772        Value
11212       Value
Name: label, Length: 3349, dtype: object
Classification report:
               precision    recall  f1-score   support

        Fact       0.80      0.21      0.34      3663
      Policy       0.87      0.04      0.08       667
       Value       0.52      0.99      0.68      8102
    Value(+)       0.93      0.04      0.07      1411
    Value(-)       0.89      0.13      0.22      2900

    accuracy                           0.55     16743
   macro avg       0.80      0.28      0.28     16743
weighted avg       0.69      0.55      0.45     16743



## Removing minorities

Next, we will remove the entries of the annotators that will be in a minority, refering to the classification of a label. Our idea was that if a majority of the annotators agrees 
this will limit the amount of data we will work with, by grouping data by tokens and labels, and count the max number of annotators per token-label pair. 

In [8]:
df_tmp = train.groupby(['tokens', 'label']).agg({
    'annotator': 'count'
}).reset_index()

train_no_duplicates = df_tmp.groupby(['tokens'], as_index=False).agg({'annotator': 'max', 'label': 'first'})
train_no_duplicates = train_no_duplicates.drop('annotator',1)
print(train_no_duplicates)

corpus = []
for i in range(0, train_no_duplicates['tokens'].size):
    # get review and remove non alpha chars
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', train_no_duplicates['tokens'][i])
    # to lower-case 
    review = review.lower()
    # split into tokens, apply stemming and remove stop words
    review = ' '.join([stemmer.stem(w) for w in review.split() if not w in set(stopwords_list)])
    corpus.append(review)
    
X = vectorizer.fit_transform(corpus).toarray()
y = train_no_duplicates['label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0, stratify=y, shuffle=True)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print("\nLabel distribution in the training set:")
print(y_train.value_counts())

print("\nLabel distribution in the test set:")
print(y_test.value_counts())

                                                  tokens  label
0      "porque é mais prático e, o que é pior, para t...   Fact
1      "revolução" é um termo frequente na literatura...   Fact
2      (In)felizmente, essas relações acabam, mais ce...   Fact
3                    (em 2017 foram quatro vezes maiores   Fact
4      (pelo que julgo poder deduzir) não está obriga...   Fact
...                                                  ...    ...
12003  “tratando-se de pares cujos elementos pertence...   Fact
12004  “um instituto universitário especializado com ...   Fact
12005  “um militar na polícia, nem consegue ser bom p...  Value
12006                           “Água pura, cristalina.”   Fact
12007      “área verde de enquadramento de espaço canal”  Value

[12008 rows x 2 columns]
(9606, 8054) (9606,)
(2402, 8054) (2402,)

Label distribution in the training set:
Value       4537
Fact        2549
Value(-)    1412
Value(+)     678
Policy       430
Name: label, dtype: int64

Label distr