<a href="https://colab.research.google.com/github/caseyhyoon/W266-Final-Project/blob/casey/bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [110]:
import numpy as np
import pandas as pd
import os
import re
from sklearn.model_selection import train_test_split
import string
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [81]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [82]:
os.listdir()

['.config',
 'twitter_parsed_dataset.csv',
 'twitter_sexism_parsed_dataset.csv',
 'twitter_racism_parsed_dataset.csv',
 'sample_data']

In [83]:
parsed = pd.read_csv('twitter_parsed_dataset.csv')
racism = pd.read_csv('twitter_racism_parsed_dataset.csv')
sexism = pd.read_csv('twitter_sexism_parsed_dataset.csv')

twitter_data = pd.concat([parsed, racism, sexism]).dropna()
twitter_data.head()

Unnamed: 0,index,id,Text,Annotation,oh_label
0,5.74948705591165e+17,5.74948705591165e+17,@halalflaws @biebervalue @greenlinerzjm I read...,none,0.0
1,5.71917888690393e+17,5.71917888690393e+17,@ShreyaBafna3 Now you idiots claim that people...,none,0.0
2,3.90255841338601e+17,3.90255841338601e+17,"RT @Mooseoftorment Call me sexist, but when I ...",sexism,1.0
3,5.68208850655916e+17,5.68208850655916e+17,"@g0ssipsquirrelx Wrong, ISIS follows the examp...",racism,1.0
4,5.75596338802373e+17,5.75596338802373e+17,#mkr No No No No No No,none,0.0


In [84]:
### Cleaning tweets

def cleaning_tweets(tweet):
    # 1. Remove Twitter handles (@user)
    users = re.findall("@[\w]*", tweet) # tokenizing
    for user in users:
        tweet = re.sub(user, '', tweet)
        
    # 2. Remove urls
    tweet = re.sub(r'http\S+', '', tweet)

    # 3. Remove, Punctuations, Numbers, and Special Characters (keep hashtags)
    tweet = tweet.replace(".", " ").replace(",", " ").replace("?", " ").replace("!", " ")
    tweet = "".join([char for char in tweet if char not in string.punctuation])
    tweet = re.sub('[0-9]+', '', tweet)

    # 4. Lowercase all
    tweet = tweet.lower()
    
    return tweet

twitter_data['cleaned_tweets'] = twitter_data['Text'].apply(cleaning_tweets)
twitter_data.head()

Unnamed: 0,index,id,Text,Annotation,oh_label,cleaned_tweets
0,5.74948705591165e+17,5.74948705591165e+17,@halalflaws @biebervalue @greenlinerzjm I read...,none,0.0,i read them in context no change in meaning...
1,5.71917888690393e+17,5.71917888690393e+17,@ShreyaBafna3 Now you idiots claim that people...,none,0.0,now you idiots claim that people who tried to...
2,3.90255841338601e+17,3.90255841338601e+17,"RT @Mooseoftorment Call me sexist, but when I ...",sexism,1.0,rt call me sexist but when i go to an auto p...
3,5.68208850655916e+17,5.68208850655916e+17,"@g0ssipsquirrelx Wrong, ISIS follows the examp...",racism,1.0,wrong isis follows the example of mohammed a...
4,5.75596338802373e+17,5.75596338802373e+17,#mkr No No No No No No,none,0.0,mkr no no no no no no


In [109]:
X = twitter_data['cleaned_tweets']
y = np.array(twitter_data['oh_label'])

sum(y)/len(y)

0.23660862446622563

The dataset is imbalanced, we are going to under-sample the data for balanced training.

In [122]:
good = twitter_data[twitter_data['oh_label'] == 0]
racist_sexist = twitter_data[twitter_data['oh_label'] == 1]

good_undersample = good.sample(int(sum(y)))
balanced_data = pd.concat([good_undersample, racist_sexist], axis=0)

In [159]:
X_train, X_test, y_train, y_test = train_test_split(twitter_data['cleaned_tweets'], twitter_data['oh_label'], test_size = 0.2, random_state=1)


In [123]:
X_train, X_test, y_train, y_test = train_test_split(balanced_data['cleaned_tweets'], balanced_data['oh_label'], test_size = 0.2, random_state=1)

X_train

4711     as expected  when the terrorist group hamas wo...
10126                                                here 
4797      in all seriousness  ive been dying my hair si...
13621    rt   just so iâ€™m clear you have dogs named leo...
8004       kat you are the biggest mole   i hope you ch...
                               ...                        
805                       kat is the daughter of satan mkr
8717      wrong  apostacy is the equivalent of leaving ...
10392     its obvious why the former president of the n...
4754      can you explain the wage gap   what does the ...
8440                kat and andre are the fuckin devil mkr
Name: cleaned_tweets, Length: 17110, dtype: object

In [160]:
pip install transformers



In [161]:
from transformers import BertTokenizer, TFBertModel

In [162]:
bert_layer = TFBertModel.from_pretrained('bert-base-uncased')

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [163]:
input_ids = tf.keras.layers.Input(shape=(49), dtype='int32', name='input_ids')
masks = tf.keras.layers.Input(shape=(49), dtype='int32', name='mask')
token_type_ids = tf.keras.layers.Input(shape=(49), dtype='int32', name='token_types')

bert_output = bert_layer([input_ids, masks, token_type_ids])


cls = bert_output[0][:, 0, :]

hidden = tf.keras.layers.Dense(200, activation='relu')(cls)

classification = tf.keras.layers.Dense(1, activation='sigmoid')(hidden)

model = tf.keras.Model(inputs = [input_ids, masks, token_type_ids], outputs = classification)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics='acc')



In [164]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

bert_inputs = tokenizer(list(X_train), padding=True, return_tensors='tf')

In [165]:
bert_inputs

{'input_ids': <tf.Tensor: shape=(36157, 49), dtype=int32, numpy=
array([[  101,  2059,  3844, ...,     0,     0,     0],
       [  101,  1045,  2113, ...,     0,     0,     0],
       [  101, 19387,  2023, ...,     0,     0,     0],
       ...,
       [  101,  4921,  2063, ...,     0,     0,     0],
       [  101,  3507, 28407, ...,     0,     0,     0],
       [  101, 10645,  9326, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(36157, 49), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(36157, 49), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=in

In [149]:
model.fit(x = [np.array(bert_inputs['input_ids']), np.array(bert_inputs['attention_mask']), np.array(bert_inputs['token_type_ids'])],
          y = y_train,
          epochs = 2,
          batch_size = 128)


Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fa0fa43cd10>

In [166]:
bert_test_inputs = tokenizer(list(X_test), padding=True, return_tensors='tf')
bert_test_inputs

{'input_ids': <tf.Tensor: shape=(9040, 49), dtype=int32, numpy=
array([[  101,   100,   102, ...,     0,     0,     0],
       [  101, 14145,  2080, ...,     0,     0,     0],
       [  101, 20228,  2480, ...,     0,     0,     0],
       ...,
       [  101,  2053,  2655, ...,     0,     0,     0],
       [  101,  2175, 29247, ...,     0,     0,     0],
       [  101,  2129,  2116, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(9040, 49), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(9040, 49), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32

In [133]:
y_pred = model.predict([np.array(bert_test_inputs['input_ids']), np.array(bert_test_inputs['attention_mask']), np.array(bert_test_inputs['token_type_ids'])])
y_pred

ValueError: ignored

In [71]:
np.mean(y_test)

0.2390486725663717

In [72]:
np.unique(y_pred)

array([0.22709993], dtype=float32)

3044     0.0
898      0.0
11195    0.0
6434     1.0
15173    1.0
        ... 
313      0.0
12508    0.0
13983    0.0
5964     0.0
1363     1.0
Name: oh_label, Length: 9040, dtype: float64

Strong possibility that BERT is just calling for one class? Why is that? There is a 25% imbalance in the training data.