## Preparing the Data

Firstly, we must import the dataset into the notebook

In [1]:
import pandas as pd

# Importing the dataset
train = pd.read_excel('OpArticles_ADUs.xlsx')
test = pd.read_excel('OpArticles.xlsx')

In [2]:
train.head()

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
0,5d04a31b896a7fea069ef06f,A,0,"[[2516, 2556]]",O facto não é apenas fruto da ignorância,Value
1,5d04a31b896a7fea069ef06f,A,1,"[[2568, 2806]]",havia no seu humor mais jornalismo (mais inves...,Value
2,5d04a31b896a7fea069ef06f,A,3,"[[3169, 3190]]",É tudo cómico na FIFA,Value
3,5d04a31b896a7fea069ef06f,A,4,"[[3198, 3285]]",o que todos nós permitimos que esta organizaçã...,Value
4,5d04a31b896a7fea069ef06f,A,6,"[[4257, 4296]]",não nos fazem rir à custa dos poderosos,Value


In [3]:
test.head()

Unnamed: 0,article_id,title,authors,body,meta_description,topics,keywords,publish_date,url_canonical
0,5d04a31b896a7fea069ef06f,"Pouco pão e muito circo, morte e bocejo",['José Vítor Malheiros'],"O poeta espanhol António Machado escrevia, uns...","É tudo cómico na FIFA, porque todos os dias a ...",Sports,"['Brasil', 'Campeonato do Mundo', 'Desporto', ...",2014-06-17 00:16:00,https://www.publico.pt/2014/06/17/desporto/opi...
1,5d04a3fc896a7fea069f0717,Portugal nos Mundiais de Futebol de 2010 e 2014,['Rui J. Baptista'],“O mais excelente quadro posto a uma luz logo ...,Deve ser evidenciado o clima favorável criado ...,Sports,"['Brasil', 'Campeonato do Mundo', 'Coreia do N...",2014-07-05 02:46:00,https://www.publico.pt/2014/07/05/desporto/opi...
2,5d04a455896a7fea069f07ab,"Futebol, guerra, religião",['Fernando Belo'],1. As sociedades humanas parecem ser regidas p...,O futebol parece ser um sucedâneo quer da lei ...,Sports,"['A guerra na Síria', 'Desporto', 'Futebol', '...",2014-07-12 16:05:33,https://www.publico.pt/2014/07/12/desporto/opi...
3,5d04a52f896a7fea069f0921,As razões do Qatar para acolher o Mundial em 2022,['Hamad bin Khalifa bin Ahmad Al Thani'],Este foi um Mundial incrível. Vimos actuações ...,Queremos cooperar plenamente com a investigaçã...,Sports,"['Desporto', 'FIFA', 'Futebol', 'Mundial de fu...",2014-07-27 02:00:00,https://www.publico.pt/2014/07/27/desporto/opi...
4,5d04a8d7896a7fea069f6997,A política no campo de futebol,['Carlos Nolasco'],O futebol sempre foi um jogo aparentemente sim...,Retirar a expressão política do futebol é reti...,Sports,"['Albânia', 'Campeonato da Europa', 'Desporto'...",2014-10-23 00:16:00,https://www.publico.pt/2014/10/23/desporto/opi...


## Cleanup and normalization

The next step is to cleanup our dataset and normalize some data

#### Removing non-alphabetic chars

Let's start by removing any non-alpha chars, using a regular expression. We'll create a separate corpus (a list of tokens), so that we leave the original dataset untouched.

#### Lowercasing

We can then apply lowercasing, so that words such as *Amazing*, *AMAZING* and *amazing* all have the same representation.

#### Removing stop words

Another common step which is sometimes applied is to remove any stop words (words that do not have domain semantics attached). We can use the stop words list provided in NLTK for English:

#### Stemming

Finally, we can apply stemming to further reduce the size of the vocabulary through normalization.

In [4]:
import re
from nltk.stem import RSLPStemmer
from nltk.corpus import stopwords

stopwords_list = stopwords.words('portuguese')
stopwords_list.remove('não')

corpus = []
stemmer = RSLPStemmer()
for i in range(0, train['tokens'].size):
    # get review and remove non alpha chars
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', train['tokens'][i])
    # to lower-case 
    review = review.lower()
    # split into tokens, apply stemming and remove stop words
    review = ' '.join([stemmer.stem(w) for w in review.split() if not w in set(stopwords_list)])
    corpus.append(review)

print(corpus[:5])

['fact não apen frut ignor', 'hav hum jorn investig preocup aprofund contextual histór isenç relat preocup soc urg denunci muit peç real jorn', 'tud cómic fif', 'tod permit organiz faç total absurd sent', 'não faz rir cust poder']


## Obtaining Features and Classes

The next step is to obtain the features we will use to train our model.

For this, we will use TF-IDF with N-Grams

TODO: explore [TfidfVectorizer params](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).toarray()

print("(Number of samples, Number of features):", X.shape)

(Number of samples, Number of features): (16743, 69814)


In [6]:
y = train['label']

print(y.shape)

(16743,)


## Training classifiers

- *Naive Bayes*, the two most effective variants are [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) and [ComplementNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html).
- *Logistic Regression*, through scikit-learn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class.
- *Decision Tree*, through scikit-learn's [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class. This model always assigns a probability of 1 to one of the classes.
- *Random Forest*, through scikit-learn's [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) class.
- *Support Vector Machines (SVM)*, through scikit-learn's [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) class. The SVM model also allows you to get probabilities, but for that you need to use the *probability=True* parameter setting in its constructor.
- *Perceptron*, through scikit-learn's [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) class. This model does not allow you to get probabilities.
- *eXtreme Gradient Boosting*, through [XGBoost](https://xgboost.readthedocs.io/en/stable/).

TODO:
- Tune parameters
- [explore more CVs](https://scikit-learn.org/stable/modules/classes.html?highlight=model_selection#splitter-classes)

In [7]:
# Metrics
import sklearn.metrics as metrics
import time

# Cross Validation and Hyper Tuning
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, GridSearchCV

# Classifiers
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
import xgboost as xgb

  from pandas import MultiIndex, Int64Index


To train machine learning classifiers, we first split the data into training and test sets.
We are using 80% of the data to create a train set, and the rest 20% for the test set.
We specify the _stratify_ parameter in order to create balanced distribution regarding labels percentages

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print("\nLabel distribution in the training set:")
print(y_train.value_counts())

print("\nLabel distribution in the test set:")
print(y_test.value_counts())

(13394, 69814) (13394,)
(3349, 69814) (3349,)

Label distribution in the training set:
Value       6481
Fact        2930
Value(-)    2320
Value(+)    1129
Policy       534
Name: label, dtype: int64

Label distribution in the test set:
Value       1621
Fact         733
Value(-)     580
Value(+)     282
Policy       133
Name: label, dtype: int64


### Baseline Predictions

In [9]:
def predict(clf):
    start = time.time()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    stop = time.time()

    # Metrics
    print("Elapsed time: %0.2fs" % (stop - start))
    print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, y_pred))
    print("Classification report:\n", metrics.classification_report(y_test, y_pred))

#### Naive Bayes

In [10]:
mnb = predict(MultinomialNB())

Elapsed time: 10.39s

Confusion matrix:
 [[  77    0  652    0    4]
 [   0    0  133    0    0]
 [  25    0 1590    0    6]
 [   5    0  277    0    0]
 [   7    0  555    0   18]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.68      0.11      0.18       733
      Policy       0.00      0.00      0.00       133
       Value       0.50      0.98      0.66      1621
    Value(+)       0.00      0.00      0.00       282
    Value(-)       0.64      0.03      0.06       580

    accuracy                           0.50      3349
   macro avg       0.36      0.22      0.18      3349
weighted avg       0.50      0.50      0.37      3349



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [11]:
cnb = predict(ComplementNB())

Elapsed time: 4.13s

Confusion matrix:
 [[ 292   26  287   55   73]
 [   3   63   54   10    3]
 [ 272   59 1003  101  186]
 [  36   14  101  110   21]
 [  64   21  196   14  285]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.44      0.40      0.42       733
      Policy       0.34      0.47      0.40       133
       Value       0.61      0.62      0.61      1621
    Value(+)       0.38      0.39      0.38       282
    Value(-)       0.50      0.49      0.50       580

    accuracy                           0.52      3349
   macro avg       0.45      0.47      0.46      3349
weighted avg       0.52      0.52      0.52      3349



#### SGD

In [12]:
sgd = predict(SGDClassifier(random_state=0, n_jobs=-1))

Elapsed time: 151.62s

Confusion matrix:
 [[ 225    1  432   25   50]
 [   1   47   77    6    2]
 [ 178   12 1279   40  112]
 [  36    4  156   72   14]
 [  43    1  314    0  222]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.47      0.31      0.37       733
      Policy       0.72      0.35      0.47       133
       Value       0.57      0.79      0.66      1621
    Value(+)       0.50      0.26      0.34       282
    Value(-)       0.56      0.38      0.45       580

    accuracy                           0.55      3349
   macro avg       0.56      0.42      0.46      3349
weighted avg       0.54      0.55      0.53      3349



#### Logistic Regression

In [13]:
lg = predict(LogisticRegression(random_state=0, n_jobs=-1))

Elapsed time: 3282.09s

Confusion matrix:
 [[ 185    0  512    7   29]
 [   0   11  121    0    1]
 [ 124    3 1436    5   53]
 [  23    1  231   25    2]
 [  33    1  416    0  130]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.51      0.25      0.34       733
      Policy       0.69      0.08      0.15       133
       Value       0.53      0.89      0.66      1621
    Value(+)       0.68      0.09      0.16       282
    Value(-)       0.60      0.22      0.33       580

    accuracy                           0.53      3349
   macro avg       0.60      0.31      0.33      3349
weighted avg       0.56      0.53      0.47      3349



#### SVC

In [15]:
svc = predict(SVC(random_state=0, max_iter=100))



Elapsed time: 927.96s

Confusion matrix:
 [[ 522    7  161   24   19]
 [  60   34   33    5    1]
 [1077   42  419   44   39]
 [ 193    6   66   12    5]
 [ 414   13  127    7   19]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.23      0.71      0.35       733
      Policy       0.33      0.26      0.29       133
       Value       0.52      0.26      0.35      1621
    Value(+)       0.13      0.04      0.06       282
    Value(-)       0.23      0.03      0.06       580

    accuracy                           0.30      3349
   macro avg       0.29      0.26      0.22      3349
weighted avg       0.37      0.30      0.27      3349



#### Decision Tree

In [16]:
dt = predict(DecisionTreeClassifier(random_state=0, max_depth=5))

Elapsed time: 116.62s

Confusion matrix:
 [[  12    1  718    0    2]
 [   0    5  128    0    0]
 [   5    1 1608    0    7]
 [   0    0  282    0    0]
 [   2    0  577    0    1]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.63      0.02      0.03       733
      Policy       0.71      0.04      0.07       133
       Value       0.49      0.99      0.65      1621
    Value(+)       0.00      0.00      0.00       282
    Value(-)       0.10      0.00      0.00       580

    accuracy                           0.49      3349
   macro avg       0.39      0.21      0.15      3349
weighted avg       0.42      0.49      0.33      3349



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Random Forest

In [17]:
rf = predict(RandomForestClassifier(random_state=0, max_depth=5, n_jobs=2))

Elapsed time: 22.92s

Confusion matrix:
 [[   0    0  733    0    0]
 [   0    0  133    0    0]
 [   0    0 1621    0    0]
 [   0    0  282    0    0]
 [   0    0  580    0    0]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.00      0.00      0.00       733
      Policy       0.00      0.00      0.00       133
       Value       0.48      1.00      0.65      1621
    Value(+)       0.00      0.00      0.00       282
    Value(-)       0.00      0.00      0.00       580

    accuracy                           0.48      3349
   macro avg       0.10      0.20      0.13      3349
weighted avg       0.23      0.48      0.32      3349



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### K Nearest Neighbors

In [18]:
knn = predict(KNeighborsClassifier(n_neighbors=10, n_jobs=2))

Elapsed time: 51.21s

Confusion matrix:
 [[  78    0  654    0    1]
 [   1    1  131    0    0]
 [  67    0 1550    0    4]
 [  18    0  259    4    1]
 [  36    0  538    0    6]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.39      0.11      0.17       733
      Policy       1.00      0.01      0.01       133
       Value       0.49      0.96      0.65      1621
    Value(+)       1.00      0.01      0.03       282
    Value(-)       0.50      0.01      0.02       580

    accuracy                           0.49      3349
   macro avg       0.68      0.22      0.18      3349
weighted avg       0.54      0.49      0.36      3349



#### Perceptron

In [19]:
per = predict(Perceptron(random_state=0))

Elapsed time: 128.68s

Confusion matrix:
 [[ 192    2  461   28   50]
 [   2   52   72    5    2]
 [ 198   18 1241   45  119]
 [  27    4  174   70    7]
 [  47    1  330    1  201]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.41      0.26      0.32       733
      Policy       0.68      0.39      0.50       133
       Value       0.54      0.77      0.64      1621
    Value(+)       0.47      0.25      0.32       282
    Value(-)       0.53      0.35      0.42       580

    accuracy                           0.52      3349
   macro avg       0.53      0.40      0.44      3349
weighted avg       0.51      0.52      0.50      3349



#### XGBoost

In [20]:
xgb = predict(xgb.XGBClassifier(random_state=0))



Elapsed time: 3015.26s

Confusion matrix:
 [[ 116    4  576   17   20]
 [   2   36   91    2    2]
 [  67   18 1474   17   45]
 [  14    2  228   34    4]
 [  22    2  478    3   75]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.52      0.16      0.24       733
      Policy       0.58      0.27      0.37       133
       Value       0.52      0.91      0.66      1621
    Value(+)       0.47      0.12      0.19       282
    Value(-)       0.51      0.13      0.21       580

    accuracy                           0.52      3349
   macro avg       0.52      0.32      0.33      3349
weighted avg       0.52      0.52      0.44      3349



### Parameter tuning

TODO [explore more scoring methods](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

In [21]:
def grid_search(clf, parameter_grid):
    cross_validation = StratifiedKFold(n_splits=5)

    grid_search = GridSearchCV(clf,
                               param_grid=parameter_grid,
                               scoring='accuracy',
                               cv=cross_validation,
                               verbose=4,
                               n_jobs=2,
                               refit=True)

    start = time.time()
    grid_search.fit(X_train, y_train)
    stop = time.time()
    print(f"Fit time: {stop - start}s")

    print("\nBest score:", grid_search.best_score_)
    print("Best parameters:", grid_search.best_params_)
    print("Best estimator:", grid_search.best_estimator_)
    
    best_model = grid_search.best_estimator_
    best_model_pred = best_model.predict(X_test)

    # Metrics
    print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, best_model_pred))
    print("Classification report:\n", metrics.classification_report(y_test, best_model_pred))

    return best_model

#### SGD

In [22]:
clf = SGDClassifier(random_state=0)

# The ‘log’ loss gives logistic regression, ‘perceptron’ is the linear loss used by the perceptron algorithm
parameter_grid= {'loss': ['log', 'hinge', 'perceptron'], 'penalty': ['elasticnet', 'l1', 'l2'], 'class_weight': [None, 'balanced']}

sgd = grid_search(clf, parameter_grid)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fit time: 12814.862543821335s

Best score: 0.5384496628411
Best parameters: {'class_weight': 'balanced', 'loss': 'log', 'penalty': 'l2'}
Best estimator: SGDClassifier(class_weight='balanced', loss='log', random_state=0)

Confusion matrix:
 [[ 184    3  444   63   39]
 [   0   70   48   14    1]
 [ 113   51 1261  118   78]
 [  12    8  121  137    4]
 [  27    6  350   26  171]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.55      0.25      0.34       733
      Policy       0.51      0.53      0.52       133
       Value       0.57      0.78      0.66      1621
    Value(+)       0.38      0.49      0.43       282
    Value(-)       0.58      0.29      0.39       580

    accuracy                           0.54      3349
   macro avg       0.52      0.47      0.47      3349
weighted avg       0.55      0.54      0.52      3349



#### Logistic Regression

takes too long

In [None]:
clf = LogisticRegression(random_state=0)

parameter_grid= {'solver': ['saga'], 'penalty': ['elasticnet', 'l1', 'l2'], 'l1_ratio': [0.5], 'class_weight': [None, 'balanced']}

lg = grid_search(clf, parameter_grid)

## Training models per annotator and using ensemble

Separating each annotator's judgment will make the task of training the model significantly easier and faster, since we are dealing with less data. At the end of the training, we will use ensemble methods, such as voting, to make a more accurate prediction, by using the predictions of each trained model to reach a consensus on a label.

In [None]:
X_annotated = pd.concat([train['annotator'], pd.DataFrame(X)], axis=1)

print(X_annotated)

In [None]:
X_A = X_annotated[X_annotated['annotator'] == 'A']
X_A = X_A.drop('annotator', 1)
y_A = train.loc[train['annotator'] == 'A', 'label']

X_B = X_annotated[X_annotated['annotator'] == 'B']
X_B = X_B.drop('annotator', 1)
y_B = train.loc[train['annotator'] == 'B', 'label']

X_C = X_annotated[X_annotated['annotator'] == 'C']
X_C = X_C.drop('annotator', 1)
y_C = train.loc[train['annotator'] == 'C', 'label']

X_D = X_annotated[X_annotated['annotator'] == 'D']
X_D = X_D.drop('annotator', 1)
y_D = train.loc[train['annotator'] == 'D', 'label']

print(X_A.shape)
print(y_A.shape)

In [None]:
X_A_tr, X_A_te, y_A_tr, y_A_te = train_test_split(X_A, y_A, test_size = 0.20, random_state=0, stratify=y_A, shuffle=True)

X_B_tr, X_B_te, y_B_tr, y_B_te = train_test_split(X_B, y_B, test_size = 0.20, random_state=0, stratify=y_B, shuffle=True)

X_C_tr, X_C_te, y_C_tr, y_C_te = train_test_split(X_C, y_C, test_size = 0.20, random_state=0, stratify=y_C, shuffle=True)

X_D_tr, X_D_te, y_D_tr, y_D_te = train_test_split(X_D, y_D, test_size = 0.20, random_state=0, stratify=y_D, shuffle=True)

Now we can procede to use Voting in order to obtain a better prediction:

In [None]:
from mlxtend.classifier import EnsembleVoteClassifier
import copy

clf_A = SGDClassifier(random_state=0, n_jobs=-1, loss='log')
clf_A.fit(X_A_tr, y_A_tr)

clf_B = SGDClassifier(random_state=0, n_jobs=-1, loss='log')
clf_B.fit(X_B_tr, y_B_tr)

clf_C = SGDClassifier(random_state=0, n_jobs=-1, loss='log')
clf_C.fit(X_C_tr, y_C_tr)

clf_D = SGDClassifier(random_state=0, n_jobs=-1, loss='log')
clf_D.fit(X_D_tr, y_D_tr)

clf_list = [clf_A, clf_B, clf_C, clf_D]

eclf = EnsembleVoteClassifier(clfs=clf_list, fit_base_estimators=False, voting='soft')
eclf.fit(X,y)
y_pred_vote = eclf.predict(X)
print(y_test)
print("Classification report:\n", metrics.classification_report(y, y_pred_vote))

## Removing minorities

To limit the amount of data we will work with, we will group data by tokens and labels, and count the max number of annotators per token-label pair. Next, we will remove the entries of the annotators that will be in a minority.

In [None]:
df_tmp = train.groupby(['tokens', 'label']).agg({
    'annotator': 'count'
}).reset_index()

train_no_duplicates = df_tmp.groupby(['tokens'], as_index=False).agg({'annotator': 'max', 'label': 'first'})
train_no_duplicates = train_no_duplicates.drop('annotator',1)
print(train_no_duplicates)

corpus = []
for i in range(0, train_no_duplicates['tokens'].size):
    # get review and remove non alpha chars
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', train_no_duplicates['tokens'][i])
    # to lower-case 
    review = review.lower()
    # split into tokens, apply stemming and remove stop words
    review = ' '.join([stemmer.stem(w) for w in review.split() if not w in set(stopwords_list)])
    corpus.append(review)
    
X = vectorizer.fit_transform(corpus).toarray()
y = train_no_duplicates['label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0, stratify=y, shuffle=True)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print("\nLabel distribution in the training set:")
print(y_train.value_counts())

print("\nLabel distribution in the test set:")
print(y_test.value_counts())