# Political `'tweet type'` Classification

Using HuggingFace `transformers` library for text classification.

Only a portion of the data sets provided by Piper have been classified for the political typology. Although Challenge #2 is primarily focused on summarization, these groups need to be assigned prior to further analysis.

Here, we fine-tune a pre-trained language model for the classification task then predict tweet type using the fine-tuned model. 

For future work, a more robust classification should be used.



# Load Data

First, confirm GPU/high RAM 

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

# offline_tweets_df = pd.read_pickle('/content/drive/MyDrive/Colab Notebooks/Not-So-Twitterpated/cleaned_offline_tweets_df.pickle')
offline_tweets_df = pd.read_pickle('/content/drive/MyDrive/Piper Gradient/Not-So-Twitterpated/cleaned_offline_tweets_df_large.pickle')

display(offline_tweets_df[['id','created_at','user_id','text3','tweet category','is_retweet','is_quote_status','user_descr']])

In [None]:
offline_tweets_df.columns

In [None]:
!pip install transformers[sentencepiece]

In [None]:
import transformers

We may need to clean/remove Retweet tags, most probably links, and possibly hashtags. Pre-processing for [`cardiffnlp/twitter-roberta-base`](https://huggingface.co/cardiffnlp/twitter-roberta-base) requires converting users and links to masking tokens. 

In [None]:
# Functions to identify retweets, mentions, hashtags, and links

def find_retweeted(tweet):
  '''This function will extract the twitter handles of retweed people'''
  return re.findall('(?<=RT\s)(@[A-Za-z]+[A-Za-z0-9-_]+)', tweet)

def find_mentioned(tweet):
  '''This function will extract the twitter handles of people mentioned in the tweet'''
  return re.findall('(?<!RT\s)(@[A-Za-z]+[A-Za-z0-9-_]+)', tweet)  

def find_hashtags(tweet):
  '''This function will extract hashtags'''
  return re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', tweet)  

def find_links(tweet):
  '''This function will extract url links'''
  http_pattern = r'https?://[A-Za-z0-9./]+'
  bitly_pattern = r'bit.ly/\S+'
  pattern = r'|'.join((http_pattern, bitly_pattern))
  return re.findall(pattern, tweet)


#Substitutions
Substitute http for links and @user for mentions in tweet text

(Required pre-processing for [`cardiffnlp/twitter-roberta-base`](https://huggingface.co/cardiffnlp/twitter-roberta-base))

In [None]:
import re
pat1_user = r'@[A-Za-z0-9_:]+'
pat2_http = r'https?://[A-Za-z0-9./]+'

def preprocess(text, pat1repl='@user', pat2repl='http'):
    subbed = re.sub(pat1_user, pat1repl, text)
    subbed = re.sub(pat2_http, pat2repl, subbed)
    return subbed

offline_tweets_df['text4'] = offline_tweets_df['text3'].map(lambda x: preprocess(x))
offline_tweets_df['user_descr'] = offline_tweets_df['user_descr'].map(lambda x: preprocess(x))
display(offline_tweets_df[['text3','text4']])

# separate classified and unclassified records

Combine user description and tweet text

In [None]:
offline_tweets_df['ud_plus_text'] = offline_tweets_df['user_descr'].str.cat(offline_tweets_df['text4'], sep=' ')
print(offline_tweets_df['ud_plus_text'][0])
offline_tweets_df[['user_descr','text4','ud_plus_text']]

In [None]:
#df_train = offline_tweets_df[(offline_tweets_df['tweet category']==offline_tweets_df['tweet category']) & (offline_tweets_df['is_retweet']==0)]
df_train = offline_tweets_df[(offline_tweets_df['tweet category']==offline_tweets_df['tweet category'])]
df_pred = offline_tweets_df[offline_tweets_df['tweet category']!=offline_tweets_df['tweet category']]
print('tr',df_train.shape, '  unclassified', df_pred.shape)

#Split Train and validation sets
Begin by splitting data for training and validation:

In [None]:
from sklearn.model_selection import train_test_split
lab_swap = {-3.0:0, -2.0:1, -1.0:2, 0.0:3, 1.0:4, 2.0:5, 2.5:6, 3.0:7}

df_train['labels'] = [lab_swap[x] for x in df_train['tweet category']]
df_train.columns

Define "predicting_text" from the features in the dataset

In [None]:
df_train['predicting_text']=df_train['text4']
#df_train['predicting_text']=df_train['ud_plus_text']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train[['predicting_text','tweet category']], df_train['labels'], test_size=1/3, stratify=df_train['labels'])
print(y_train)
X_train.head()

Recombine X and y to df for each of train and validate

In [None]:
df_tr = X_train.join(y_train)
df_val = X_test.join(y_test)
df_tr

For efficient batching (less padding), sort the training df by the number of words in each tweet

In [None]:
df_tr['num_words'] = df_tr['predicting_text'].str.split().str.len()

df_tr.sort_values(['num_words'],axis=0, inplace=True)
print(df_tr.predicting_text[df_tr.index[0]])
df_tr


Examine a specified tweet:

In [None]:
i=df_tr.index[0]   # or i=3552 or other specified value
print(df_tr['predicting_text'][i])
print(df_tr['num_words'][i], 'words')
print('category:', df_tr['tweet category'][i])

# Convert to Datasets and tokenize

In [None]:
!pip install datasets

Use pyarrow to convert to Dataset structure

In [None]:
import pyarrow as pa
from datasets import Dataset

ds_tr = Dataset(pa.Table.from_pandas(df_tr))
ds_val = Dataset(pa.Table.from_pandas(df_val))

ds_tr = ds_tr.remove_columns(['num_words'])

ds_tr

Combine training and validation Datsets into a DatasetDict

In [None]:
from datasets import DatasetDict
dsd = DatasetDict({'train':ds_tr, 'val':ds_val}).remove_columns(['tweet category','__index_level_0__'])
dsd

Tokenize the datasets 

In [None]:
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer

# Same as before
checkpoint = "cardiffnlp/twitter-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["predicting_text"], truncation=True)

In [None]:
tok_ds = dsd.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print(tok_ds["train"][0])
print([k for k in tok_ds['train'][0]])
{k: len(v) for k, v in tok_ds.items()}

Remove unnecessary columns

In [None]:
tok_ds = tok_ds.remove_columns(['predicting_text'])
print(tok_ds)
tok_ds.set_format("torch")

#Full fine-tuning 
with training epochs, as per [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?fw=pt)

Create the DataLoaders

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tok_ds["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tok_ds["val"], batch_size=8, collate_fn=data_collator
)

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=8)

outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

import torch 
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)


In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

In [None]:
from datasets import load_metric
# 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]

metric= load_metric("glue", 'mnli')
model.eval()
pred=[]
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    pred.append([predictions, batch["labels"]])
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

In [None]:
# p=prediction, t=truth
p2=[];t2=[]
for p,t in pred:
  p2.extend(p.cpu().numpy())
  t2.extend(t.cpu().numpy())
len(p2)
len(t2)
categ = {lab_swap[k]:k for k in lab_swap}
p3 = [categ[p] for p in p2]
t3 = [categ[t] for t in t2]

Finally, predict `tweet type` political classification of unlabelled data. 

In [None]:
import matplotlib
import matplotlib.pyplot as plt

err = [(p-t) for p,t in zip(p3,t3)]
#tuple(zip(p2,t2,abserr))

num_bins = 12
n, bins, patches = plt.hist(err, num_bins,
                            density = 0, 
                            alpha = 0.7)
MAE = sum([abs(e) for e in err])/len(err)
MSE = sum([e**2 for e in err])/len(err)
print('Piper Typology Prediction Error')
print('MAE:', MAE)
print('MSE:', MSE)
print('RMSE:', MSE**0.5)

In [None]:
pd.crosstab(pd.Series(t3), pd.Series(p3), colnames=['predicted'], rownames=['actual'])

In [None]:
model.save_pretrained('Piper-typology')

# Prediction

Tokenize data for prediction

In [None]:
# also prepare dataset requiring classfication/prediction:
ds_pr = Dataset(pa.Table.from_pandas(df_pred[['ud_plus_text']]))
#adjust above line if using just text4

ds_pr = ds_pr.remove_columns('__index_level_0__')

def tokenize_function(example):
    return tokenizer(example["ud_plus_text"], truncation=True)

tok_pr = ds_pr.map(tokenize_function, batched=True)
tok_pr = tok_pr.remove_columns(['ud_plus_text'])
tok_pr.set_format("torch")

Run predictions by batch

In [None]:
pred_dataloader = DataLoader(
    tok_pr, shuffle=False, batch_size=8, collate_fn=data_collator
)

progress_bar = tqdm(range(len(pred_dataloader)))

model.eval()
prednew=[]
for batch in pred_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    prednew.append([predictions, batch])
    progress_bar.update(1)
#    metric.add_batch(predictions=predictions, references=batch["labels"])


Collect results into list of predictions (p4) with batch input_ids (b4)

In [None]:
p4=[]
b4=[]
for p, b in prednew:
  p4.extend(p.cpu().numpy())
  b4.extend(b['input_ids'])
len(p4)
categ = {lab_swap[k]:k for k in lab_swap}
p4 = [categ[p] for p in p4]
print(p4)
len(p4)

Check first ten 

In [None]:
from bs4 import BeautifulSoup

#b['input_ids'][0]
for t in b4[0:10]:
  print(BeautifulSoup(tokenizer.decode(t),'lxml').get_text())

Verify tokenized batches align with df_pred dataframe (print any exceptions)

In [None]:
from bs4 import BeautifulSoup

for i in range(16638):
  a1 = df_pred['ud_plus_text'].iloc[i,]
  a2 = BeautifulSoup(tokenizer.decode(b4[i]),'lxml').get_text()
  if a1.replace(' ','')!=a2.replace(' ',''):
    print(i)
    print(df_pred['ud_plus_text'].iloc[i,])
    print(BeautifulSoup(tokenizer.decode(b4[i]),'lxml').get_text())
#df_pred['ud_plus_text'][0:40]

If aligned (yes!) then add predictions to df_pred

In [None]:
df_pred['Piper_typ']=p4

Summarize

In [None]:
df_pred['Piper_typ'].groupby(df_pred['Piper_typ']).count()

In [None]:
df = df_pred[['ud_plus_text']]
df.reset_index(inplace=True)
filt = df['ud_plus_text'].str.startswith('Conservative PAC based in Kitsap County')
df[filt].index

In [None]:
x = 7956
print(tokenizer.decode(b4[x]))
print(p4[x])
print(df_pred['ud_plus_text'].iloc[x,])
print(df_pred['Piper_typ'].iloc[x,])

Review sample of predictions/classifications:

In [None]:
df = df_pred.sample(15, random_state=3)
for i in range(15):
    print('UD:', df[['user_description']].iloc[i,0])
    print('Tweet:', df[['text3']].iloc[i,0])
    print(df[['Piper_typ']].iloc[i,0])

Plot distribution

In [None]:
num_bins = 12
n, bins, patches = plt.hist(p4, num_bins, density = 0, alpha = 0.7)
print('Piper Typology Prediction Distribution')


In [None]:

n, bins, patches = plt.hist(t3, num_bins, density = 0, alpha = 0.7)
print('Piper Typology Labelled Distribution')

For training data df_train, set 'Piper_typ' = 'tweet category'

In [None]:
df_train['Piper_typ']=df_train['tweet category']

Align df_train and df_pred column structure, and verify

In [None]:
df_train.drop('labels', axis='columns', inplace=True)
df_train.columns
#df_pred.columns == df_train.columns

####Combine and Save
Combine training and prediction sets back to complete "large" dataset, with new 'Piper_typ' column, and save to pickle file

In [None]:
new_df = df_pred.append(df_train, sort=False).sort_index()


In [None]:
new_df.to_pickle('/content/drive/MyDrive/Piper Gradient/Not-So-Twitterpated/cleaned_tweets_large_Piper_typology.pickle')