# Experimenting with BERT - tweet sentiment

> Our goal is to create a model that takes a tweet (just like the ones in our dataset) and produces either 1 (indicating the tweet carries a positive sentiment) or a 0 (indicating the tweet carries a negative sentiment).

> * **DistilBERT** processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
> * The next model, a basic **Logistic Regression** model from scikit learn will take in the result of DistilBERT’s processing, and classify the tweet as either positive or negative (1 or 0, respectively).

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
import pickle
import config
import re
import nltk
import gc
from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer

#Modeling
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.model_selection import train_test_split

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


I1203 10:39:17.335577 140735979778944 file_utils.py:32] TensorFlow version 2.0.0 available.
I1203 10:39:17.336699 140735979778944 file_utils.py:39] PyTorch version 1.1.0 available.


# Loading Dataset

> The dataset is a sample from a Kaggle dataset. you can download the entire dataset from this link: https://www.kaggle.com/kazanova/sentiment140 

In [2]:
TRAIN_SIZE = 0.8

DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
DATASET_PATH = config.DATA_DIR+'data_sample.csv'

In [3]:
decode_map = {0: "NEGATIVE", 4: "POSITIVE"}
def decode_sentiment(label):
    return decode_map[int(label)]

In [4]:
df = pd.read_csv(DATASET_PATH, encoding =DATASET_ENCODING)
df.shape

(6000, 6)

In [5]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1752724555,Sat May 09 22:46:45 PDT 2009,NO_QUERY,milanq,I'm WET!!!! ....spilled my tea all over my dr...
1,0,1971714242,Sat May 30 07:44:05 PDT 2009,NO_QUERY,samarudge,Finished photoshopping and now trying to figur...
2,0,1976810217,Sat May 30 19:08:35 PDT 2009,NO_QUERY,digmod,is twitter is down again..........
3,0,1573747027,Tue Apr 21 02:39:58 PDT 2009,NO_QUERY,juergenfenn,!Identica currently has rather severe problems...
4,0,2188274875,Mon Jun 15 21:36:10 PDT 2009,NO_QUERY,LoisLane210,Only 3 eps of one of my fave TV shows left to ...


In [6]:
df['target'].value_counts()

4    3000
0    3000
Name: target, dtype: int64

# Preprocessing the tweets

> Basic preprocessing of tweets by removing user mentions, links and punctuations 

In [7]:
# TEXT CLEANING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

def preprocess_2(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if stem:
            tokens.append(stemmer.stem(token))
        else:
            tokens.append(token)
    return " ".join(tokens)

In [8]:
df.text = df.text.apply(lambda x: preprocess_2(x))

# Loading the Pre-trained DistillBERT model

> Right now, the variable `model` holds a pretrained distilBERT model, a version of BERT that is smaller, but much faster and requiring a lot less memory.

In [9]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## To use BERT instead of DistillBERT, Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

I1203 10:39:17.870800 140735979778944 tokenization_utils.py:375] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /Users/araghavan/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
I1203 10:39:18.053743 140735979778944 configuration_utils.py:152] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json from cache at /Users/araghavan/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.1ccd1a11c9ff276830e114ea477ea2407100f4a3be7bdc45d37be9e37fa71c7e
I1203 10:39:18.055613 140735979778944 configuration_utils.py:169] Model config {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": null,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "is_decoder": false,
  "max_position_embed

# Preprocessing for BERT

> Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.
* **Tokenizing**: First step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.
* **Padding**: After tokenization, tokenized is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).
* **Masking**: If we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input.

In [10]:
tokenized = df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [11]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [12]:
np.array(padded).shape

(6000, 56)

In [13]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(6000, 56)

# Model fitting

In [14]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

# Sentence Embedding

> The output corresponding the first token of each sentence is what we need. The way BERT does sentence classification, is that it adds a token called [CLS] (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence. We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [15]:
features = last_hidden_states[0][:,0,:].numpy()
import gc
del last_hidden_states
gc.collect()

91

# Prepare Train-Test for Classification

In [16]:
labels = df['target']
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [17]:
import gc
del features,labels
gc.collect()

50

# Grid Search for Parameters

In [18]:
#parameters = {'C': np.linspace(0.0001, 100, 20)}
#grid_search = GridSearchCV(LogisticRegression(), parameters)
#grid_search.fit(train_features, train_labels)

#print('best parameters: ', grid_search.best_params_)
#print('best scrores: ', grid_search.best_score_)

# Logistic Regression

In [19]:
lr_clf = LogisticRegression(C= 5.263252631578947)
lr_clf.fit(train_features, train_labels)



LogisticRegression(C=5.263252631578947, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

# Model Evaluation

In [20]:
lr_clf.score(test_features, test_labels)

0.7513333333333333

# Comparison to baseline

In [21]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.494 (+/- 0.03)




# Saving the model

In [23]:
pickle.dump(lr_clf, open(config.PROCESS_DIR+'LogisticRegressionModel.p', 'wb'))