# BERT PRESIDENTIAL TWEET SENTIMENT ANALYSIS

* [Tutorial](https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671)

In [4]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import pandas as pd
import os
import shutil

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=433.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=536063208.0), HTML(value='')))




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




# Pretraining

In [5]:
# Step 1: Check Pytorch 
import torch
print("Cuda available: ", torch.cuda.is_available())
print("Device name:", torch.cuda.get_device_name())
# Step 2: Check Tensorflow
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

Cuda available:  True
Device name: Tesla T4
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2078096680240664600
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14770136704
locality {
  bus_id: 1
  links {
  }
}
incarnation: 6852159092692336544
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5"
]


In [7]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [9]:
URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz", 
                                  origin=URL,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [10]:
# Create main directory path ("/aclImdb")
main_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
# Create sub directory path ("/aclImdb/train")
train_dir = os.path.join(main_dir, 'train')
# Remove unsup folder since this is a supervised learning task
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
# View the final train folder
print(os.listdir(train_dir))

['labeledBow.feat', 'neg', 'pos', 'unsupBow.feat', 'urls_neg.txt', 'urls_pos.txt', 'urls_unsup.txt']


# Separating Train and Test Sets

In [11]:
# We create a training dataset and a validation 
# dataset from our "aclImdb/train" directory with a 80/20 split.
train = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=30000, validation_split=0.2, 
    subset='training', seed=123)
test = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=30000, validation_split=0.2, 
    subset='validation', seed=123)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [12]:
for i in train.take(1):
  train_feat = i[0].numpy()
  train_lab = i[1].numpy()

train = pd.DataFrame([train_feat, train_lab]).T
train.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
train['DATA_COLUMN'] = train['DATA_COLUMN'].str.decode("utf-8")
train.head()

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
0,Canadian director Vincenzo Natali took the art...,1
1,I gave this film 10 not because it is a superb...,1
2,I admit to being somewhat jaded about the movi...,1
3,"For a long time, 'The Menagerie' was my favori...",1
4,A truly frightening film. Feels as if it were ...,0


In [13]:
for j in test.take(1):
  test_feat = j[0].numpy()
  test_lab = j[1].numpy()

test = pd.DataFrame([test_feat, test_lab]).T
test.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
test['DATA_COLUMN'] = test['DATA_COLUMN'].str.decode("utf-8")
test.head()

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
0,I can't believe that so much talent can be was...,0
1,This movie blows - let's get that straight rig...,0
2,"The saddest thing about this ""tribute"" is that...",0
3,I'm only rating this film as a 3 out of pity b...,0
4,Something surprised me about this movie - it w...,1


In [15]:
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)

InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

# Setting Up tensors

In [16]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples

  train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'DATA_COLUMN', 
                                                                           'LABEL_COLUMN')

In [17]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

In [18]:
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



In [19]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x1982dab0370>

In [20]:
pred_sentences = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
                  'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie']

In [21]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])

This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good : 
 Positive
One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie : 
 Negative


# Analyzing Trump's Tweets

In [4]:
%cd C:\Users\tenis\Desktop\Data_Projects\presidential_tweets_sentiment_analysis

df = pd.read_csv('./data/raw/update_trumps_tweets.csv')
df['created_at'] = pd.to_datetime(df['created_at'])
df.head()

len(df['text'])


C:\Users\tenis\Desktop\Data_Projects\presidential_tweets_sentiment_analysis


36573

In [40]:
test = df.head(1000)
trumps_tweets = test['text'].tolist()
type(trumps_tweets)
trumps_tweets

['VOTE! VOTE! VOTE!https://t.co/85ySh1KYkh',
 'RT @PastorDScott: We need to set all time records in voter turnout tomorrow for President @realDonaldTrump ! VOTE Donald Trump for Presiden…',
 'RT @PastorDScott: VOTE TRUMP!!!!!!',
 'Thank you Matt! https://t.co/hWiyWpvf8o',
 'RT @GOP: “Let’s Make America Great Again and re-elect our fantastic president!” -@GOPChairwoman https://t.co/nfrSa5b44g',
 'Thank you Paris. Keep up the GREAT work! https://t.co/jPT046qOTU',
 'RT @camakridis: My new analysis in @TheHillOpinion with @jonjakubowski "Don\'t believe the polls — Trump is winning." #TrumpIsLosing #Trump…',
 'To all of our supporters: thank you from the bottom of my heart. You have been there from the beginning and I will never let you down. Your hopes are my hopes your dreams are my dreams and your future is what I am fighting for every single day! https://t.co/gsFSghkmdM https://t.co/fLek4keQ1t',
 'Thank you Brad! https://t.co/Rdcp9D76Ol',
 'https://t.co/hLuyRy0WNU',
 'https://t.co/tyZW5

In [41]:
tf_batch = tokenizer(trumps_tweets, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
label

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0,

### Immigration/Hispanic Specific 

* [Polls on immigration attitudes](https://news.gallup.com/poll/1660/immigration.aspx)

In [46]:
#Filtering Tweets involving immigration, hispanics, latinos
hispanic = df
hispanic['text'] = hispanic['text'].str.lower()
selected_words = ['mexico', 'mexican','mexicans', 'immigrants', 'immigration', 'deportation',
                  'deport', 'latino', 'puerto rico', 'puerto rican', 'puerto ricans', 'cuba', 'cuban',
                  'cubans', 'guatemala','guatemalan', 'guatemalans', 'el salvador', 'salvadoran', 'salvadorans',
                  'honduras', 'honduran', 'hondurans','hispanics', 'hispanic']
#hispanic = df[df['text'].str.contains("mexico")]
hispanic = hispanic[hispanic.text.str.contains('|'.join(selected_words))]

#adding the BERT sentiment
hispanic_tweets = hispanic['text'].tolist()
tf_batch = tokenizer(hispanic_tweets, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
label_df=pd.DataFrame(label, columns=['Sentiment']) 
hispanic_df = pd.concat([hispanic.reset_index(drop=True), label_df], axis=1)
hispanic_df

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,Sentiment
0,Twitter for iPhone,my #americandreamplan is a promise to hispanic...,2020-11-02 22:23:00,14009.0,72299.0,False,1.323390e+18,1
1,Twitter for iPhone,...as i said at the debate – “will you remembe...,2020-11-02 06:51:00,17152.0,96571.0,False,1.323160e+18,1
2,Twitter for iPhone,for 47 years sleepy joe biden betrayed hispani...,2020-11-02 06:03:00,13669.0,64904.0,False,1.323140e+18,1
3,Twitter for iPhone,rt @cortessteve: hispanics rally to pres trump...,2020-11-01 18:27:00,9174.0,0.0,True,1.322970e+18,0
4,Twitter for iPhone,when i originally became your all time favorit...,2020-11-01 11:49:00,23079.0,126097.0,False,1.322870e+18,1
...,...,...,...,...,...,...,...,...
896,Twitter Web Client,the oscars were a great night for mexico &amp;...,2015-02-24 14:53:00,1184.0,637.0,False,5.702350e+17,1
897,Twitter Web Client,via @foxnewslatino by @geraldorivera: “@appren...,2015-02-10 19:33:00,12.0,27.0,False,5.652320e+17,1
898,Twitter Web Client,via @latinovoices by @caritojuliette: “meet th...,2015-01-21 18:26:00,16.0,34.0,False,5.579680e+17,1
899,Twitter for Android,@aquila7: @realdonaldtrump geraldo rivera the ...,2015-01-06 02:05:00,31.0,65.0,False,5.522850e+17,1


In [50]:
hispanic_df.to_csv('hispanic_sentiment.csv', index = False)

### China Specific 

In [47]:
china = df 
china['text'] = china['text'].str.lower()
china = china[china.text.str.contains('china')]


#adding the BERT sentiment
china_tweets = china['text'].tolist()
tf_batch = tokenizer(china_tweets, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
label_df=pd.DataFrame(label, columns=['Sentiment']) 
china_df = pd.concat([china.reset_index(drop=True), label_df], axis=1)
china_df

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,Sentiment
0,Twitter for iPhone,a vote for me and the republican party is a vo...,2020-11-03 03:37:00,18798.0,97639.0,False,1.323470e+18,1
1,Twitter for iPhone,a vote for joe biden is a vote to extinguish a...,2020-11-03 00:40:00,23044.0,112932.0,False,1.323420e+18,1
2,Twitter for iPhone,...he was a cheerleader for nafta and china’s ...,2020-11-02 21:24:00,10675.0,51902.0,False,1.323380e+18,1
3,Twitter for iPhone,i gave maine everything that obama/biden took ...,2020-11-02 19:29:00,19401.0,103644.0,False,1.323350e+18,1
4,Twitter for iPhone,biden can never negotiate with china. they wou...,2020-11-02 18:26:00,26984.0,167455.0,False,1.323330e+18,0
...,...,...,...,...,...,...,...,...
635,Twitter Web Client,china has a backdoor into the trans-pacific pa...,2015-04-22 21:01:00,121.0,163.0,False,5.909840e+17,0
636,Twitter for Android,@stephenfhayes: trump: i have made a fortune a...,2015-04-19 02:29:00,24.0,49.0,False,5.896170e+17,0
637,Twitter for Android,@josh_millard16: i'm loving everything donaldt...,NaT,,,,,1
638,Twitter Web Client,my @wmur9 commitment 2016 conversation with @j...,2015-03-30 20:24:00,15.0,37.0,False,5.826400e+17,1


In [51]:
china_df.to_csv('china_sentiment.csv', index = False)