<a class="anchor" id="0"></a>
# [Tweet Sentiment Extraction](https://www.kaggle.com/c/tweet-sentiment-extraction)

# Overview

In this notebook, I analyze and visualize the outliers of the NLP solution from very good notebook "[TSE2020] RoBERTa (CNN) & Random Seed Distribution"(https://www.kaggle.com/khoongweihao/tse2020-roberta-cnn-random-seed-distribution) using the functions from my notebook [NLP - EDA, Bag of Words, TF IDF, GloVe, BERT](https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert) including PCA processing, Kmeans clustering, WordCloud and others. More over I try to improve the original solution.

Add chapters "**Subtext analysis**" and "**Metric analysis**" from the commit 10.

# Results of analysis:
1. Outlier analysis of the best solutions on basic roBERTa - pls. see https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/155419
2. Analysis of the predictions with the worst score=0 from roBERTa - pls. see https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/155616
3. New (commit 22): **analysis of 3 or more repetitions of characters in words**

## Acknowledgements
* [NLP - EDA, Bag of Words, TF IDF, GloVe, BERT](https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert)
* [COVID-19 (Week5) Global Forecasting - EDA&ExtraTR](https://www.kaggle.com/vbmokin/covid-19-week5-global-forecasting-eda-extratr)
* [TSE2020] RoBERTa (CNN) & Random Seed Distribution (https://www.kaggle.com/khoongweihao/tse2020-roberta-cnn-random-seed-distribution)
* Chris Deotte's post: https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/142404#809872
* [Faster (2x) TF roBERTa](https://www.kaggle.com/seesee/faster-2x-tf-roberta)
* Many thanks to Chris Deotte for his TF roBERTa dataset at https://www.kaggle.com/cdeotte/tf-roberta
* https://www.kaggle.com/abhishek/roberta-inference-5-folds

<a class="anchor" id="0.1"></a>
## Table of Contents

1. [Import libraries](#1)
1. [Data fetching](#2)
1. [Model (TF: RoBerta, DNN: CNN)](#3)
   - [My upgrade of parameters](#3.1)
   - [Model training](#3.2)
1. [Submission](#4)
1. [Outlier analysis](#5)
    - [Training prediction result visualization](#5.1)
    - [WordCloud](#5.2)
    - [Subtext analysis](#5.3)
    - [Metric analysis](#5.4)
    - [PCA visualization](#5.5)
    - [Clustering](#5.6)

## 1. Import libraries <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import seaborn as sns; sns.set(style='white')
from mpl_toolkits.mplot3d import Axes3D

from wordcloud import WordCloud
from sklearn.decomposition import PCA, TruncatedSVD
import math
import pickle

from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

import tensorflow as tf
import tensorflow.keras.backend as K
from transformers import *
import tokenizers
from sklearn.model_selection import StratifiedKFold

pd.set_option('max_colwidth', 40)

## 2. Data fetching <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

Code from notebook https://www.kaggle.com/khoongweihao/tse2020-roberta-cnn-random-seed-distribution?scriptVersionId=34448972

In [2]:
# global variables

MAX_LEN = 96
EPOCHS = 3 # originally 3
BATCH_SIZE = 64 # originally 32
PAD_ID = 1
SEED = 88888
LABEL_SMOOTHING = 0.15

In [3]:
# get train data 

train = pd.read_csv('../input/tweet-sentiment-extraction/train.csv').fillna('')

# initialize tenserflow and numpy random seed
tf.random.set_seed(SEED)
np.random.seed(SEED)


In [4]:
# remove noises in train data

def find_failed_start(text,selected_text):
    """find the case that lable is not truncated by space"""
    begin = text.find(selected_text)
    end = begin + len(selected_text)
    if begin == 0:
        return False
    return text[begin -1].isalpha()

def find_failed_end(text,selected_text):
    """find the case that lable is not truncated by space"""
    begin = text.find(selected_text)
    end = begin + len(selected_text)
    if end == len(text):
        return False
    return text[end].isalpha()

failed_start = train[train.apply(lambda row: find_failed_start(row.text, row.selected_text),axis=1)]
failed_end = train[train.apply(lambda row: find_failed_end(row.text, row.selected_text),axis=1)]
failed_ids = set(failed_start.textID.tolist() + failed_end.textID.tolist())

print(len(train))
print(len(failed_ids))

for ids in failed_ids:
    idx=train[train['textID']==ids].index
    train=train.drop(idx)

27481
1699


In [5]:
# display train data after removing noises

train.reset_index(drop=True, inplace=True)
print(len(train))
train.tail()

25782


Unnamed: 0,textID,text,selected_text,sentiment
25777,a208770a32,in spoke to you yesterday and u did...,in spoke to you yesterday and u didn...,neutral
25778,b78ec00df5,enjoy ur night,enjoy,positive
25779,f67aae2310,Yay good for both of you. Enjoy the...,Yay good for both of you.,positive
25780,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive
25781,6f7127d9d7,All this flirting going on - The ...,All this flirting going on - The ATG...,neutral


In [6]:
# get test data

test = pd.read_csv('../input/tweet-sentiment-extraction/test.csv').fillna('')

## 3. Model (TF: RoBerta, DNN: CNN) <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

## 3.1. Tunable parameter <a class="anchor" id="3.1"></a>

[Back to Table of Contents](#0.1)

In [7]:
dropout_new = 0.15     # originally 0.1
n_split = 5            # originally 5
lr = 3e-5              # (lr: learning rate) originally 3e-5

## 3.2. Jaccard score function <a class="anchor" id="3.2"></a>

[Back to Table of Contents](#0.1)

In [8]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    if (len(a)==0) & (len(b)==0): return 0.5
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

## 3.3. Build RoBerta <a class="anchor" id="3.3"></a>

[Back to Table of Contents](#0.1)

Code from notebook https://www.kaggle.com/khoongweihao/tse2020-roberta-cnn-random-seed-distribution?scriptVersionId=34448972

**Upgrade:** add prediction for training data for Outlier analysis and parameters tuning

In [9]:
PATH = '../input/tf-roberta/'

# get RoBerta tokenizer

tokenizer = tokenizers.ByteLevelBPETokenizer(
    vocab_file = PATH +'vocab-roberta-base.json', 
    merges_file = PATH +'merges-roberta-base.txt', 
    lowercase = True,
    add_prefix_space = True
)

tokenization of the train data

In [10]:
# tokenization of the train data

sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974}

ct = train.shape[0]
input_ids_train = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask_train = np.zeros((ct,MAX_LEN),dtype='int32')
token_type_ids_train = np.zeros((ct,MAX_LEN),dtype='int32')
start_tokens_train = np.zeros((ct,MAX_LEN),dtype='int32')
end_tokens_train = np.zeros((ct,MAX_LEN),dtype='int32')

for k in range(train.shape[0]):
    # find overlap
    text1 = " "+" ".join(train.loc[k,'text'].split())
    text2 = " ".join(train.loc[k,'selected_text'].split())
    idx = text1.find(text2)
    chars = np.zeros((len(text1)))
    chars[idx:idx+len(text2)]=1
    if text1[idx-1]==' ': chars[idx-1] = 1 
    enc = tokenizer.encode(text1) 
        
    # id offsets
    offsets = []; idx=0
    for t in enc.ids:
        w = tokenizer.decode([t])
        offsets.append((idx,idx+len(w)))
        idx += len(w)
    
    # start and end tokens
    toks = []
    for i,(a,b) in enumerate(offsets):
        sm = np.sum(chars[a:b])
        if sm>0: toks.append(i) 
        
    s_tok = sentiment_id[train.loc[k,'sentiment']]
    input_ids_train[k,:len(enc.ids)+3] = [0, s_tok] + enc.ids + [2]
    attention_mask_train[k,:len(enc.ids)+3] = 1
    if len(toks)>0:
        start_tokens_train[k,toks[0]+2] = 1
        end_tokens_train[k,toks[-1]+2] = 1

tokenization of the test data

In [11]:
# tokenization of the test data

ct = test.shape[0]
input_ids_test = np.ones((ct,MAX_LEN),dtype='int32')
attention_mask_test = np.zeros((ct,MAX_LEN),dtype='int32')
token_type_ids_test = np.zeros((ct,MAX_LEN),dtype='int32')

# store input_ids and attention_mask

for k in range(test.shape[0]):
    text1 = " "+" ".join(test.loc[k,'text'].split())
    enc = tokenizer.encode(text1)                
    s_tok = sentiment_id[test.loc[k,'sentiment']]
    input_ids_test[k,:len(enc.ids)+3] = [0, s_tok] + enc.ids + [2]
    attention_mask_test[k,:len(enc.ids)+3] = 1

In [12]:
def save_weights(model, dst_fn):
    weights = model.get_weights()
    with open(dst_fn, 'wb') as f:
        pickle.dump(weights, f)


def load_weights(model, weight_fn):
    with open(weight_fn, 'rb') as f:
        weights = pickle.load(f)
    model.set_weights(weights)
    return model

def loss_fn(y_true, y_pred):
    # adjust the targets for sequence bucketing
    ll = tf.shape(y_pred)[1]
    y_true = y_true[:, :ll]
    loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred,
        from_logits=False, label_smoothing=LABEL_SMOOTHING)
    loss = tf.reduce_mean(loss)
    return loss


def build_model():
    ids = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    att = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    tok = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    padding = tf.cast(tf.equal(ids, PAD_ID), tf.int32)

    lens = MAX_LEN - tf.reduce_sum(padding, -1)
    max_len = tf.reduce_max(lens)
    ids_ = ids[:, :max_len]
    att_ = att[:, :max_len]
    tok_ = tok[:, :max_len]

    config = RobertaConfig.from_pretrained(PATH+'config-roberta-base.json')
    bert_model = TFRobertaModel.from_pretrained(PATH+'pretrained-roberta-base.h5',config=config)
    x = bert_model(ids_,attention_mask=att_,token_type_ids=tok_)
    
    x1 = tf.keras.layers.Dropout(dropout_new)(x[0])
    x1 = tf.keras.layers.Conv1D(768, 2,padding='same')(x1)
    x1 = tf.keras.layers.LeakyReLU()(x1)
    x1 = tf.keras.layers.Conv1D(64, 2,padding='same')(x1)
    x1 = tf.keras.layers.LeakyReLU()(x1)
    x1 = tf.keras.layers.Conv1D(64, 2,padding='same')(x1)
    x1 = tf.keras.layers.Dense(1)(x1)
    x1 = tf.keras.layers.Flatten()(x1)
    x1 = tf.keras.layers.Activation('softmax')(x1)
    
    x2 = tf.keras.layers.Dropout(dropout_new)(x[0]) 
    x2 = tf.keras.layers.Conv1D(768, 2,padding='same')(x2)
    x2 = tf.keras.layers.LeakyReLU()(x2)
    x2 = tf.keras.layers.Conv1D(64, 2, padding='same')(x2)
    x2 = tf.keras.layers.LeakyReLU()(x2)
    x2 = tf.keras.layers.Conv1D(64, 2, padding='same')(x2)
    x2 = tf.keras.layers.Dense(1)(x2)
    x2 = tf.keras.layers.Flatten()(x2)
    x2 = tf.keras.layers.Activation('softmax')(x2)

    model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x1,x2])
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr) 
    model.compile(loss=loss_fn, optimizer=optimizer)
    
    # this is required as `model.predict` needs a fixed size!
    x1_padded = tf.pad(x1, [[0, 0], [0, MAX_LEN - max_len]], constant_values=0.)
    x2_padded = tf.pad(x2, [[0, 0], [0, MAX_LEN - max_len]], constant_values=0.)
    
    padded_model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x1_padded,x2_padded])
    return model, padded_model

Train RoBerta

In [13]:
%%time

jac = []
VER='v0'
DISPLAY=1 # use display=1 for interactive

oof_start = np.zeros((input_ids_train.shape[0],MAX_LEN))
oof_end = np.zeros((input_ids_train.shape[0],MAX_LEN))

preds_start_train = np.zeros((input_ids_train.shape[0],MAX_LEN))
preds_end_train = np.zeros((input_ids_train.shape[0],MAX_LEN))

preds_start = np.zeros((input_ids_test.shape[0],MAX_LEN))
preds_end = np.zeros((input_ids_test.shape[0],MAX_LEN))

skf = StratifiedKFold(n_splits=n_split,shuffle=True,random_state=SEED)

for fold,(idxT,idxV) in enumerate(skf.split(input_ids_train,train.sentiment.values)):
    print('#'*25)
    print('### FOLD %i'%(fold+1))
    print('#'*25)
    
    K.clear_session()
    model, padded_model = build_model()
        
    #sv = tf.keras.callbacks.ModelCheckpoint(
    #    '%s-roberta-%i.h5'%(VER,fold), monitor='val_loss', verbose=1, save_best_only=True,
    #    save_weights_only=True, mode='auto', save_freq='epoch')
    inpT = [input_ids_train[idxT,], attention_mask_train[idxT,], token_type_ids_train[idxT,]]
    targetT = [start_tokens_train[idxT,], end_tokens_train[idxT,]]
    inpV = [input_ids_train[idxV,],attention_mask_train[idxV,],token_type_ids_train[idxV,]]
    targetV = [start_tokens_train[idxV,], end_tokens_train[idxV,]]
    
    # sort the validation data
    shuffleV = np.int32(sorted(range(len(inpV[0])), key=lambda k: (inpV[0][k] == PAD_ID).sum(), reverse=True))
    inpV = [arr[shuffleV] for arr in inpV]
    targetV = [arr[shuffleV] for arr in targetV]
    weight_fn = '%s-roberta-%i.h5'%(VER,fold)
    for epoch in range(1, EPOCHS + 1):
        # sort and shuffle: We add random numbers to not have the same order in each epoch
        shuffleT = np.int32(sorted(range(len(inpT[0])), key=lambda k: (inpT[0][k] == PAD_ID).sum() + np.random.randint(-3, 3), reverse=True))
        # shuffle in batches, otherwise short batches will always come in the beginning of each epoch
        num_batches = math.ceil(len(shuffleT) / BATCH_SIZE)
        batch_inds = np.random.permutation(num_batches)
        shuffleT_ = []
        for batch_ind in batch_inds:
            shuffleT_.append(shuffleT[batch_ind * BATCH_SIZE: (batch_ind + 1) * BATCH_SIZE])
        shuffleT = np.concatenate(shuffleT_)
        # reorder the input data
        inpT = [arr[shuffleT] for arr in inpT]
        targetT = [arr[shuffleT] for arr in targetT]
        model.fit(inpT, targetT, 
            epochs=epoch, initial_epoch=epoch - 1, batch_size=BATCH_SIZE, verbose=DISPLAY, callbacks=[],
            validation_data=(inpV, targetV), shuffle=False)  # don't shuffle in `fit`
        save_weights(model, weight_fn)

    print('Loading model...')
    # model.load_weights('%s-roberta-%i.h5'%(VER,fold))
    load_weights(model, weight_fn)

    print('Predicting OOF...')
    oof_start[idxV,],oof_end[idxV,] = padded_model.predict([input_ids_train[idxV,],attention_mask_train[idxV,],token_type_ids_train[idxV,]],verbose=DISPLAY)
    
    print('Predicting all TRAIN for outlier analysis...')
    preds_train = padded_model.predict([input_ids_train,attention_mask_train,token_type_ids_train],verbose=DISPLAY)
    preds_start_train += preds_train[0]/skf.n_splits
    preds_end_train += preds_train[1]/skf.n_splits

    print('Predicting TEST...')
    preds = padded_model.predict([input_ids_test,attention_mask_test,token_type_ids_test],verbose=DISPLAY)
    preds_start += preds[0]/skf.n_splits
    preds_end += preds[1]/skf.n_splits
    
    # display fold jaccard
    all = []
    for k in idxV:
        a = np.argmax(oof_start[k,])
        b = np.argmax(oof_end[k,])
        if a>b: 
            st = train.loc[k,'text'] # improve CV/LB with better choice here
        else:
            text1 = " "+" ".join(train.loc[k,'text'].split())
            enc = tokenizer.encode(text1)
            st = tokenizer.decode(enc.ids[a-2:b-1])
        all.append(jaccard(st,train.loc[k,'selected_text']))
    jac.append(np.mean(all))
    print('>>>> FOLD %i Jaccard ='%(fold+1),np.mean(all))
    print()

#########################
### FOLD 1
#########################
Train on 20625 samples, validate on 5157 samples
Train on 20625 samples, validate on 5157 samples
Epoch 2/2
Train on 20625 samples, validate on 5157 samples
Epoch 3/3
Loading model...
Predicting OOF...
Predicting all TRAIN for outlier analysis...
Predicting TEST...
>>>> FOLD 1 Jaccard = 0.7423914403296816

#########################
### FOLD 2
#########################
Train on 20625 samples, validate on 5157 samples
Train on 20625 samples, validate on 5157 samples
Epoch 2/2
Train on 20625 samples, validate on 5157 samples
Epoch 3/3
Loading model...
Predicting OOF...
Predicting all TRAIN for outlier analysis...
Predicting TEST...
>>>> FOLD 2 Jaccard = 0.736493141561854

#########################
### FOLD 3
#########################
Train on 20626 samples, validate on 5156 samples
Train on 20626 samples, validate on 5156 samples
Epoch 2/2
Train on 20626 samples, validate on 5156 samples
Epoch 3/3
Loading model...
Predicting O

In [14]:
print('OVERALL 5Fold CV Jaccard =',np.mean(jac))
print(jac) # Jaccard CVs

OVERALL 5Fold CV Jaccard = 0.7320280546064353
[0.7423914403296816, 0.736493141561854, 0.7332598735422187, 0.729638400196351, 0.7183574174020708]


## 4. Submission <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

Code from notebook https://www.kaggle.com/khoongweihao/tse2020-roberta-cnn-random-seed-distribution?scriptVersionId=34448972

In [15]:
all = []
for k in range(input_ids_test.shape[0]):
    a = np.argmax(preds_start[k,])
    b = np.argmax(preds_end[k,])
    if a>b: 
        st = test.loc[k,'text']
    else:
        text1 = " "+" ".join(test.loc[k,'text'].split())
        enc = tokenizer.encode(text1)
        st = tokenizer.decode(enc.ids[a-2:b-1])
    all.append(st)

In [16]:
test['selected_text'] = all
test[['textID','selected_text']].to_csv('submission.csv',index=False)
test.sample(10)

Unnamed: 0,textID,text,sentiment,selected_text
2319,ced0bb0d25,I meant looking like a tiger - stupi...,negative,stupid
940,261a5bbd14,tell me what you think of Pride Pre...,neutral,tell me what you think of pride pre...
1271,66762b14cd,I need one of these! Was just thin...,neutral,i need one of these! was just think...
1284,a22d319e5f,- That Jasper clip is my first 'fav...,neutral,- that jasper clip is my first 'fav...
1493,b14b4d5209,haha i agree ! i am her test dummy....,positive,love it. she is magic!
836,105b761a2c,i dont even know now lenaaa when y...,neutral,i dont even know now lenaaa when yo...
1505,762a44afb6,Mrs.Bates left,neutral,mrs.bates left
1974,b4031d0753,"Yay me! But, what in particular?",neutral,"yay me! but, what in particular?"
1219,cea5b3f191,Best friend is leaving to go back to...,negative,sad
1423,a9e176a76e,hey I didn`t get the email yet,neutral,hey i didn`t get the email yet


## 5. Outlier analysis <a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

Code from notebook https://www.kaggle.com/khoongweihao/tse2020-roberta-cnn-random-seed-distribution?scriptVersionId=34448972

[Go to Top](#0)