# Dive into Abusive Language with Snorkel

Author: BingYune Chen 
<br>
Updated: 2021-08-02

----------

### Time to Predict New Labels

We just completed the following steps to work with our BERT model:

1. Fine-tuned BERT model using Sentiment140 to generalize on Twitter data
2. Trained BERT model using **Snorkel labels** for X_train to predict abusive language 

**We will now apply the fine-tuned and trained BERT model to predict labels for our unlabeled data.**

In [None]:
# Imports and setup for Google Colab

# Mount Google Drive
from google.colab import drive ## module to use Google Drive with Python
drive.mount('/content/drive') ## mount to access contents

# Install python libraries
! pip install --upgrade tensorflow --quiet
! pip install snorkel --quiet
! pip install tensorboard==1.15.0 --quiet
! pip install transformers --quiet

In [None]:
# Imports for data and plotting
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline 
import seaborn as sns

import pickle
import os
import re
import csv
from tqdm import tqdm

# Imports for snorkel analysis and multi-task learning
from snorkel.labeling.model import LabelModel
from snorkel.labeling import filter_unlabeled_dataframe

# Imports for bert language model
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn import metrics

import transformers

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import RandomSampler, SequentialSampler

import time
import datetime
import random

In [None]:
# Access notebook directory

# Define paths
LOAD_MODEL = '../models/'
LOAD_DATA = '../data/processed/'

SAVE_MODEL = '../models/'
SAVE_DATA = '../data/published/'

# Define files for training
INPUT_FILE = 'clean_20201103.txt' ## update
COUNT_FILE = 'abusivelanguage2020_vf_counts.csv'

# Define current version of BERT model to load
BERT_PRE = 'model_bert_df_train_dict_vf.pt' ## update

# Save final labels
FINAL_FILE = 'abusivelanguage2020_vf.txt'

In [None]:
# Create BERT tokenizer (original BERT of 110M parameters)
# BERT tokenizer can handle punctuation, simleys, etc.
# Previously replaced mentions and urls with special tokens (#has_url, #has_mention)

bert_token = transformers.BertTokenizerFast.from_pretrained(
    'bert-base-uncased', 
    do_lower_case=True) 

# Create helper function for text parsing
def bert_encode(tweet_df, tokenizer):
    ## add '[CLS]' token as prefix to flag start of text
    ## append '[SEP]' token to flag end of text
    ## append '[PAD]' token to fill uneven text
    bert_tokens = tokenizer.batch_encode_plus(
        tweet_df['tweet'].to_list(),
        padding='max_length', 
        truncation=True,
        max_length=30
        )
    
    ## convert list to tensors
    input_word_ids = torch.tensor(bert_tokens['input_ids'])
    input_masks = torch.tensor(bert_tokens['attention_mask'])
    input_type_ids = torch.tensor(bert_tokens['token_type_ids'])

    inputs = {
        'input_word_ids': input_word_ids,
        'input_masks': input_masks,
        'input_type_ids': input_type_ids
        }

    return inputs

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
# Redfine BERT model for additional fine-tuning 
nlp_bert = transformers.BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', ## use 12-layer BERT model, uncased vocab
    num_labels=2, ## binary classfication
    output_attentions = False, ## model returns attentions weights
    output_hidden_states = False, ## model returns all hidden-states
    )

nlp_bert.cuda()

# Load saved BERT model
nlp_bert.load_state_dict(torch.load(os.path.join(LOAD_MODEL, BERT_PRE)))

# Put model in evaluation mode
nlp_bert.eval() ## IMPORTANT STEP

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
# Predict labels
batch_size = 32
chunksize = 100_000
n = 1

for chunk in tqdm(pd.read_csv(LOAD_DATA + INPUT_FILE, chunksize=chunksize)):
    
    print('')
    print('Encoding Chunk...{}'.format(n))

    ## predict on test
    X_test = bert_encode(chunk, bert_token)
    TOKEN_FILE = 'bert_tokens_{}_c{}.pkl'.format(INPUT_FILE[6:-4], n)
    
    with open(os.path.join(SAVE_MODEL, TOKEN_FILE), 'wb') as file:
        pickle.dump(X_test, file)

    print('')
    print('Predicting labels for {:,} tweets...'.format(
        len( X_test['input_word_ids'])))

    ## wrap tensors for test
    X_test_data = TensorDataset(
        X_test['input_word_ids'], 
        X_test['input_masks']
        )

    ## make sampler for test
    X_test_sampler = SequentialSampler(X_test_data)

    ## make dataLoader for test
    X_test_dataloader = DataLoader(
        X_test_data, 
        sampler=X_test_sampler, 
        batch_size=batch_size
        )

    ## track variables 
    predictions , true_labels = [], []

    ## predict 
    for batch in X_test_dataloader:
        ## add batch to GPU
        batch = tuple(t.to('cuda') for t in batch)
        
        ## unpack the inputs from our dataloader
        b_input_ids, b_input_mask = batch
        
        ## tell the model not to compute or store gradients, 
        ## saving memory and speeding up prediction
        with torch.no_grad():
            ## start forward pass, calculate logit predictions
            outputs = nlp_bert(b_input_ids, token_type_ids=None, 
                            attention_mask=b_input_mask)

        logits = outputs[0]

        ## move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        
        ## store predictions and true labels
        predictions.append(logits)

    print('')
    print('Saving new labels...')

    LABEL_FILE = 'label_{}_c{}.txt'.format(INPUT_FILE[6:-4], n)

    ## flatten arrays into single list
    chunk['label'] = [np.argmax(item) for sublist in 
                            predictions for item in sublist]

    chunk.to_csv(os.path.join(SAVE_DATA, LABEL_FILE), 
                columns=['label', 'tweet'],
                index=False,
                sep = ' ')
    
    with open(os.path.join(SAVE_DATA, COUNT_FILE), 'a') as csvfile:
        writer = csv.writer(csvfile, delimiter=",")
        writer.writerow([LABEL_FILE[:-4]] + chunk.label.value_counts().tolist())

    n += 1

0it [00:00, ?it/s]


Encoding Chunk...1

Predicting labels for 100,000 tweets...

Saving new labels...


1it [01:07, 67.22s/it]


Encoding Chunk...2

Predicting labels for 100,000 tweets...

Saving new labels...


2it [02:12, 66.75s/it]


Encoding Chunk...3

Predicting labels for 100,000 tweets...

Saving new labels...


3it [03:18, 66.47s/it]


Encoding Chunk...4

Predicting labels for 100,000 tweets...

Saving new labels...


4it [04:24, 66.21s/it]


Encoding Chunk...5

Predicting labels for 100,000 tweets...

Saving new labels...


5it [05:30, 66.08s/it]


Encoding Chunk...6

Predicting labels for 100,000 tweets...

Saving new labels...


6it [06:36, 66.06s/it]


Encoding Chunk...7

Predicting labels for 100,000 tweets...

Saving new labels...


7it [07:41, 65.98s/it]


Encoding Chunk...8

Predicting labels for 100,000 tweets...

Saving new labels...


8it [08:47, 65.83s/it]


Encoding Chunk...9

Predicting labels for 100,000 tweets...

Saving new labels...


9it [09:52, 65.76s/it]


Encoding Chunk...10

Predicting labels for 100,000 tweets...

Saving new labels...


10it [10:58, 65.65s/it]


Encoding Chunk...11

Predicting labels for 100,000 tweets...

Saving new labels...


11it [12:03, 65.54s/it]


Encoding Chunk...12

Predicting labels for 100,000 tweets...

Saving new labels...


12it [13:09, 65.53s/it]


Encoding Chunk...13

Predicting labels for 100,000 tweets...

Saving new labels...


13it [14:14, 65.34s/it]


Encoding Chunk...14

Predicting labels for 100,000 tweets...

Saving new labels...


14it [15:18, 65.20s/it]


Encoding Chunk...15

Predicting labels for 100,000 tweets...

Saving new labels...


15it [16:23, 65.03s/it]


Encoding Chunk...16

Predicting labels for 58,637 tweets...

Saving new labels...


16it [17:01, 63.86s/it]


In [None]:
# View some new labels
pd.set_option('display.max_colwidth', 500)
chunk.head(50)

Unnamed: 0,tweet,label
1500000,#has_retweet #has_mention #has_url John James has won the Senate Seat in Michigan and flipped a Seat Red!!,1
1500001,#has_retweet #has_mention Very cool.. #Teaserต้องไป,1
1500002,#has_url VIX but for PredictIt,0
1500003,"#has_retweet #has_mention In Malaysia, Biden ni macam PKR-DAP. Trump ni macam UMNO-BERSATU-PAS.",1
1500004,#has_mention #has_mention What was the gist of it?,0
1500005,#has_retweet #has_mention so i heard some tr*mp supporters might hack some accounts so if i seem to say anything in support of him or anything rac #has_truncate,0
1500006,#has_retweet #has_mention Follow everyone who retweets and likes this 💐,1
1500007,"#has_mention #has_mention This is democracy. that is how a democratic system works, I'm sorry if that bothers you.",0
1500008,#has_retweet #has_mention EVERYONE RETWEET DJT TWEET THAT WAS CENSORED! #has_url,1
1500009,"#has_retweet #has_mention He who finds a wife finds a good thing I cannot wait to congratulate you properly Mercy Show us the way, Shake our legs wi #has_truncate",1


In [None]:
# Combine all chunks into a single file
os.chdir(SAVE_DATA)
files = os.listdir()
sorted_files = sorted(files)
  
# Open new abusivelanguage file in write mode
with open(FINAL_FILE, 'w') as outfile:
    ## add header
    outfile.write("label tweet\n")
    ## iterate through list
    for fnames in sorted_files:
        ## check for txt file
        if fnames.endswith('.txt'):
            ## open each file in read mode
            f = open(fnames, 'r')
            lines = f.readlines()[1:] ## remove header from each file
            
            for l in lines:
                outfile.write(l) ## read then write to file
            ## Add '\n' to enter data from next line
            outfile.write("\n")
            f.close()
        else: 
            continue

In [None]:
# Check label counts
abuse_df = pd.read_csv(FINAL_FILE, sep=' ')
abuse_df.label.value_counts()

0    1150423
1     408214
Name: label, dtype: int64

In [None]:
# Explore abusive language labels
abuse_df.loc[abuse_df['label'] == 1, :].sample(n=50)

Unnamed: 0,label,tweet
239697,1,#has_retweet #has_mention Beyoncé — Pretty Hurts #has_url
32172,1,"#has_retweet #has_mention 'Wheelbarrow washing' by Pamela Grace, a contemporary artist and printmaker living and working in Galloway, southwestern Sc #has_truncate"
355661,1,#has_retweet #has_mention if i find out any of my family members voted for Trump.. bitch.. the way i will be waiting for Thanksgiving dinner so i c #has_truncate
1347951,1,you are delusional if you think Emily in Paris had “style” 😹
1366886,1,#has_retweet #has_mention Rich people getting free shit so they can model it to poor people hoping we BUY it has to be the most bullshittiest th #has_truncate
959433,1,"#has_mention #has_mention #has_mention Oh, it is C and not even close, IMO. Gase is hopefully a 2-year mistake, 2009 was pain #has_truncate #has_url"
840847,1,class president x secretary..👀👀 CATCH ME ATTORNEY
908199,1,Even cough syrup drinking sound cloud rappers know that 💤 joe will be taking their money
1447528,1,#has_retweet #has_mention “dear cis men with fearful hearts &amp; fearless penises who only see us when it is dark do something for us outside of bed #has_truncate
1392829,1,hi i'm lexi and i never pay attention in class and bullshit everything but somehow it works


In [None]:
# Explore not abusive language labels
abuse_df.loc[abuse_df['label'] == 0, :].sample(n=50)

Unnamed: 0,label,tweet
1410180,0,New Free Ebooks | Brooks Law Group #has_url
1456400,0,#has_mention #has_mention #has_mention do not be shocked!! it is Philly!!
1252952,0,#has_retweet #has_mention 🔝 🔟 ALBUMS🌎Apple 1⃣Positions #ArianaGrande 2⃣LoveGoes #SamSmith 3⃣ #has_url #BringMeTheHorizon 4⃣TheAlbum #BLACKPINK #has_truncate
545512,0,#has_mention #has_mention Does that piss people off as well or is that ok?
394522,0,#has_retweet #has_mention Pay more attention to your creator than your critics.
685871,0,"#has_retweet #has_mention 🦊: honestly i like cute things like this. pastel, pink 🐉: ah really? i do not know 🦊: *jokingly hit yongha* 🐉: since you l #has_truncate"
526809,0,#has_retweet #has_mention [Log] '20 01 Nov. #1 [Situation] カゲロウスターズ2 DD Shiranui DD Nowaki #has_mention [Photo] #has_mention [Note] Editing myself #has_truncate
1071766,0,#has_mention #has_mention Love this! Best ever
469833,0,#has_retweet #has_mention NEVADA STAY IN LINE! VOTE FOR TRUMP! FINISH THE FIGHT!
435573,0,"#has_retweet #has_mention Virginia has about half of its precincts reporting and Trump is up 12 points. I get the North goes heavily D, but that is q #has_truncate"


In [None]:
# Create specific csv file format for repo
abuse_df.to_csv('abusivelanguage2020.csv', index=None)