# NLP Project - Political line classification

This project consists in classifying the political line of a short text, typically a tweet. The dabase used is the one used for [INSERT LINK]. It is exclusively made of tweets posted during the French 2017 presidential election. 

The objective of this project is to classify the tweets into one of the five following classes: far-right, right, center, left, far-left. 

We use different types of models, ranging from the most simple ones (such as logistic regression) to more sophisticated one (fine-tuning CamemBERT).

# Set up on Google Colab

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
import os 
os.chdir('drive/My Drive/ENSAE/NLP')

In [0]:
%load_ext autoreload
%autoreload 2

# 1. Data importation and description

In [0]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import random 

np.seed(0)
random.seed(0)

In [0]:
df = pd.read_csv('data/base_tweets_propre_sans_emoji.csv')

In [0]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(14,5))
ax1.hist(df['couleur_politique'])
ax1.set_title('Distribution of political line across the dataset')
ax2.hist(df['mois'])
ax2.set_title('Time distribution')
plt.show()

Our dataset is approximately balanced, even if we have a slightly dominant class (Right). 

In [0]:
pd.crosstab(df.mois, df.couleur_politique)

# 2. Classification with random splitting

We take a first approach where we split the data without any time consideration (ie evenly across March, April and May).

In [0]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from utils.evaluation import evaluate,plot_confusion_matrix
from utils.utils import random_split_dataset
import time

In [0]:
Xtrain,ytrain,Xval,yval,Xtest,ytest,label_map = random_split_dataset(df,validation=True)

## 2.1 Classification without deep learning

We will evaluate a selection of models in order to have a benchmark for the deep learning methods we will use afterwards.

In [0]:
classifiers = {'Logit':LogisticRegression(max_iter = 150, C = 1),
               'SVM': SVC(C=10),
               'RF': RandomForestClassifier(n_estimators=300)}

We use the following code to evaluate the models on the validation set. 

In [0]:
for name,clf in classifiers.items():
    clf.fit(Xtrain,ytrain)
    ypred = clf.predict(Xval)
    print('------'+name+'------')
    print(evaluate(yval,ypred))

We only show here, for clarity sake, the results with the best hyperparameters.

In [0]:
for name,clf in classifiers.items():
    print('------'+name+'------')
    start = time.time()
    clf.fit(Xtrain,ytrain)
    end = time.time()
    print(f'Training time: {:.1f}s')
    ypred = clf.predict(Xtest)
    print(classification_report(ytest,ypred,target_names = sorted(label_map)))
    plot_confusion_matrix(ytest,ypred,label_map=label_map)

## 2.2 Classification using custom neural networks 

In [0]:
!pip install transformers

In [0]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
from transformers import BertTokenizer, AutoModel, AutoTokenizer, CamembertModel
import pandas as pd
from torch.utils.data import DataLoader
import pdb
from sklearn.metrics import accuracy_score
from utils.tweet_dataset import TweetDatasetBERT,TweetDataset,create_word_ix
from utils.evaluation import get_predictions
from classifiers import CamembertClassifier, RNN, train

In [0]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [0]:
dftrain,ytrain,dftest,ytest,label_map = random_split_dataset(df,validation=False,deep=True)

### 2.2.1 Custom LSTM

We implement a relatively shallow LSTM (2 layers), which uses no pretrained embedding. 

In [0]:
word_ix = create_word_ix(dftrain)
train_set = TweetDataset(dftrain,word_ix)
test_set = TweetDataset(dftest,word_ix)

train_loader = DataLoader(train_set,batch_size = 32,num_workers = 5)
test_loader = DataLoader(test_set,batch_size = 32,num_workers = 5)

In [0]:
vocab_size = len(word_ix) + 1
output_size = 5
embedding_dim = 128
hidden_dim = 100
n_layers = 2
model = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
model.to(device)

In [0]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)

In [0]:
train(model, 'rnn',criterion, optimizer,scheduler, train_loader, val_loader, n_epochs=10, print_every=100, gpu=True)

### 2.2.2 Custom Convolutional Network

Convolutional networks are quite commonly used in the literature of NLP problems, so we implement a custom convolutional network. 

## 2.3 Classification using CamemBERT

In [0]:
dftrain,ytrain,dftest,ytest,label_map = random_split_dataset(df,validation=False,deep=True)

### 2.3.1 Training

In [0]:
#Creating instances of training and validation set
train_set = TweetDatasetBERT(df = dftrain, maxlen = 50, model_name='camembert-base')
val_set = TweetDatasetBERT(df = dftest, maxlen = 50, model_name='camembert-base')

#Creating intsances of training and validation dataloaders
train_loader = DataLoader(train_set, batch_size = 32, num_workers = 5)
val_loader = DataLoader(val_set, batch_size = 32, num_workers = 5)

In [0]:
model = CamembertClassifier('camembert-base')
model = model.to("cuda")

In [0]:
criterion = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr = 1e-4)
scheduler = None

In [0]:
train(model, 'bert',criterion, optimizer,scheduler, train_loader, val_loader, n_epochs=10, print_every=100, gpu=True)

### 2.3.2 Evaluation

For the evaluation part, since we are interested in recall, precision and f1-score, we cannot simply compute an average over batches. 

In [0]:
model.eval()
yval,ypred = get_predictions(model,val_loader)
print(classification_report(yval,ypred,target_names = sorted(label_map)))
plot_confusion_matrix(ytest,ypred,label_map=label_map)

### 2.3.3 Interpretation

For this part, we want to visualize which parts of text caught the attention of the classifier. This gives us an insight about how the model works.

In [0]:
from utils.interpretation import get_last_layer_attention,display_attention,visualize_layer_attention

In [0]:
sample = train_set[1]
mean_attn,tokens,attn_data = get_last_layer_attention(model,train_set.tokenizer,sample)

We visualize the attention of the 12 attention heads of the encoder's last layer (which is the one given in input ot the classifier part of our model).

In [0]:
visualize_layer_attention(attn_data['layer11'].cpu().detach().numpy())

Below is a more compact and more explicit visualization of these same attentions. it is obtained by averaging over attentions heads and rows, getting rid of useless tokens (start, end and padding tokens) and summing the attentions over tokens which are parts of the same word. Let's notice however that this procedure does not give us the ties between words of the sentence, but only which words are important and have strong ties with a certain number of other words.

In [0]:
display_attention(tokens,mean_attn,train_set.tokenizer)

In most cases this points to significant words in the sentences, such as "progress", "solidarity", "deregulation", which may mean that some political parties use them more than others (which is intuitive). Also, when the name of the candidate is present in the sentence, it attracts the attention of the model.

# 3. Classification with time-wise splitting

In this second approach, we split the data time-wise: we train on March and we test on April and May. We expect the performances of our models to drop, especially those which are based on the count vectorizer, which will face many out-of-vocabulary words in the test set. The reason for that is that our tweets are highly contextual: they respond to events happening in a short time window before the tweet. 