# Exploring Bias in Tweets by Members of Congress

---

### Contributors: Alex Shropshire, Mando Iwanaga

### Goal: Use transfer learning (Pre-trained BERT model) to classify tweet text from Politicians as having a democratic bias, a republican bias, or neutrality.

### Process

**1.Business Understanding  
2.Understand the data  
3.Prepare the data for analysis and modeling  
4.Model  
5.Evaluate Results  
6.Deploy**

---

**Import Necessary libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

#Text Cleaning
import re
import string

#Transfer Learning Model (BERT)
from keras import Model
from keras.layers import Lambda, Dense
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from batch_generator.batch_generator import BatchGenerator
from load_pretrained_bert import load_google_bert


#Model Evaluation
from sklearn.metrics import accuracy_score,f1_score, precision_score, recall_score



Using TensorFlow backend.


2.2.4


---

## Data Preparation

**We'll begin by uploading our dataset retrieved from [figure-eight](https://www.figure-eight.com/data-for-everyone/), an open source data platform.**

In [2]:
#Upload dataset with only the necessary columns
raw_df = pd.read_csv('Political-media-DFE.csv',encoding='latin')
df = raw_df[['bias','message','embed','label','source','text']]
df.head()

Unnamed: 0,bias,message,embed,label,source,text
0,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,partisan,attack,"<blockquote class=""twitter-tweet"" width=""450"">...",From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,neutral,support,"<blockquote class=""twitter-tweet"" width=""450"">...",From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...
3,neutral,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...


In [3]:
df['bias'].value_counts()

neutral     3689
partisan    1311
Name: bias, dtype: int64

**The dataset does not specify political affiliation, will need to add politician's affiliations**

In [4]:
#Make a function to clean text of puntuations
def remove_punctuations(text):
    '''Removes punctuation from strings'''
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

In [5]:
#Apply our remove_punctuations function
df['text'] = df.loc[:,'text'].apply(remove_punctuations)
df['label'] = df['label'].str.replace('From: ','')
df['text'] = df['text'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [6]:
#Create a new column with the tweet purpose and bias
df['purpose_and_bias'] = df['message'] + '_' + df['bias']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [7]:
df.head()

Unnamed: 0,bias,message,embed,label,source,text,purpose_and_bias
0,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Trey Radel (Representative from Florida),twitter,rt nowthisnews rep trey radel r fl slams obama...,policy_partisan
1,partisan,attack,"<blockquote class=""twitter-tweet"" width=""450"">...",Mitch McConnell (Senator from Kentucky),twitter,video obamacare full of higher costs and bro...,attack_partisan
2,neutral,support,"<blockquote class=""twitter-tweet"" width=""450"">...",Kurt Schrader (Representative from Oregon),twitter,please join me today in remembering our fallen...,support_neutral
3,neutral,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Michael Crapo (Senator from Idaho),twitter,rt senatorleahy 1st step toward senate debate ...,policy_neutral
4,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Mark Udall (Senator from Colorado),twitter,amazon delivery drones show need to update law...,policy_partisan


In [8]:
#Load dataset of politicians and their affiliations
congressmen_df = pd.read_csv('congressmen_2015.csv')
congressmen_df.head()

Unnamed: 0,First,Last,congressman,affiliation
0,Gregorio,Sablan,Gregorio Sablan (Representative from NA),d
1,Robert,Aderholt,Robert Aderholt (Representative from Alabama),r
2,Lamar,Alexander,Lamar Alexander (Senator from Tennessee),r
3,Justin,Amash,Justin Amash (Representative from Michigan),r
4,Mark,Amodei,Mark Amodei (Representative from Nevada),r


**Join the two datasets to include politicians affiliations and our target variable**

In [9]:
#partisan tweets will be labeled as the politicians affiliation in target column
#neutral tweets will be labeled as neutral in target column
df = df.merge(congressmen_df, how='left',left_on='label',right_on='congressman')
df.loc[df.bias == 'partisan', 'target'] = df['affiliation']
df.loc[df.bias == 'neutral', 'target'] = df['bias']
df.dropna(axis=0,inplace=True)

In [10]:
#drop rows with target label as "i" 
df = df[df['target'] != 'i']

In [11]:
df.head()

Unnamed: 0,bias,message,embed,label,source,text,purpose_and_bias,First,Last,congressman,affiliation,target
0,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Trey Radel (Representative from Florida),twitter,rt nowthisnews rep trey radel r fl slams obama...,policy_partisan,Trey,Radel,Trey Radel (Representative from Florida),r,r
1,partisan,attack,"<blockquote class=""twitter-tweet"" width=""450"">...",Mitch McConnell (Senator from Kentucky),twitter,video obamacare full of higher costs and bro...,attack_partisan,Mitch,McConnell,Mitch McConnell (Senator from Kentucky),r,r
2,neutral,support,"<blockquote class=""twitter-tweet"" width=""450"">...",Kurt Schrader (Representative from Oregon),twitter,please join me today in remembering our fallen...,support_neutral,Kurt,Schrader,Kurt Schrader (Representative from Oregon),d,neutral
3,neutral,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Michael Crapo (Senator from Idaho),twitter,rt senatorleahy 1st step toward senate debate ...,policy_neutral,Michael,Crapo,Michael Crapo (Senator from Idaho),r,neutral
4,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Mark Udall (Senator from Colorado),twitter,amazon delivery drones show need to update law...,policy_partisan,Mark,Udall,Mark Udall (Senator from Colorado),d,d


In [12]:
#We have 3 target classifications
df['target'].value_counts()

neutral    3631
r           791
d           490
Name: target, dtype: int64

---

## Text Cleaning

**We'll create functions to clean our text**

In [13]:
def replace_contraction(text):
    """Replace contractions from text"""
    contraction_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'can not'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'),
                         (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would'), (r'&', 'and'), (r'dammit', 'damn it'), (r'dont', 'do not'), (r'wont', 'will not') ]
    patterns = [(re.compile(regex), repl) for (regex, repl) in contraction_patterns]
    for (pattern, repl) in patterns:
        (text, count) = re.subn(pattern, repl, text)
    return text

In [14]:
def replace_links(text, filler=' '):
    """Replace url links included in text"""
    text = re.sub(r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*',
                      filler, text).strip()
    return text

In [15]:
def remove_numbers(text):
    """Remove numbers from text"""
    text = ''.join([i for i in text if not i.isdigit()])
    return text

In [16]:
#Create a function to incorporate three functions above in one
def cleanText(text):
    """Incorporate three created functions above into one"""
    text = text.strip().replace("\n", " ").replace("\r", " ")
    text = replace_contraction(text)
    text = replace_links(text, "link")
    text = remove_numbers(text)
    text = re.sub(r'[,!@#$%^&*)(|/><";:.?\'\\}{]',"",text)
    text = text.lower()
    return text

**Label our target variables in single column**

In [17]:
df.loc[df['target'] == 'neutral', 'target'] = 0
df.loc[df['target'] == 'r', 'target'] = 1
df.loc[df['target'] == 'd', 'target'] = 2
df.head()

Unnamed: 0,bias,message,embed,label,source,text,purpose_and_bias,First,Last,congressman,affiliation,target
0,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Trey Radel (Representative from Florida),twitter,rt nowthisnews rep trey radel r fl slams obama...,policy_partisan,Trey,Radel,Trey Radel (Representative from Florida),r,1
1,partisan,attack,"<blockquote class=""twitter-tweet"" width=""450"">...",Mitch McConnell (Senator from Kentucky),twitter,video obamacare full of higher costs and bro...,attack_partisan,Mitch,McConnell,Mitch McConnell (Senator from Kentucky),r,1
2,neutral,support,"<blockquote class=""twitter-tweet"" width=""450"">...",Kurt Schrader (Representative from Oregon),twitter,please join me today in remembering our fallen...,support_neutral,Kurt,Schrader,Kurt Schrader (Representative from Oregon),d,0
3,neutral,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Michael Crapo (Senator from Idaho),twitter,rt senatorleahy 1st step toward senate debate ...,policy_neutral,Michael,Crapo,Michael Crapo (Senator from Idaho),r,0
4,partisan,policy,"<blockquote class=""twitter-tweet"" width=""450"">...",Mark Udall (Senator from Colorado),twitter,amazon delivery drones show need to update law...,policy_partisan,Mark,Udall,Mark Udall (Senator from Colorado),d,2


---

## Incorporating BERT 

In [18]:
#!wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip

In [19]:
#!unzip multi_cased_L-12_H-768_A-12.zip

In [20]:
#Upload pretrained BERT model
BERT_PRETRAINED_DIR = 'multi_cased_L-12_H-768_A-12/'
SEQ_LEN = 70
BATCH_SIZE = 12
LR = 1e-5

In [21]:
# Define our X and Y variables and apply cleanText function
X = df['text'].apply(cleanText).values
Y = df['target'].values
print(X[0])  
print(Y[0])  

rt nowthisnews rep trey radel r fl slams obamacare politics httpstcozvywmgyih
1


In [22]:
#split our data into training sets and test sets using default parameters
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)

In [23]:
#Sanity chack
print(len(X_train))
print(len(X_test))
print(len(Y_train))
print(len(Y_test))

3684
1228
3684
1228


In [24]:
#Define batch generators
train_gen = BatchGenerator(X_train,
                           vocab_file=os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt'),
                           seq_len=SEQ_LEN,
                           labels=Y_train,
                           do_lower_case=False,
                           batch_size=BATCH_SIZE)
valid_gen = BatchGenerator(X_test,
                           vocab_file=os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt'),
                           seq_len=SEQ_LEN,
                           labels=Y_test,
                           do_lower_case=False,
                           batch_size=BATCH_SIZE)

W0622 01:58:08.570214 140133029328640 deprecation_wrapper.py:119] From /home/jupyter/political_bias_classifier_BERT/batch_generator/tokenization.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

100%|████████████████████████████████████████████████████████████████████████████████████| 3684/3684 [00:02<00:00, 1573.01it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 3684/3684 [00:00<00:00, 75790.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1228/1228 [00:00<00:00, 1725.37it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1228/1228 [00:00<00:00, 86477.59it/s]


In [25]:
#Load BERT pretrained model and print summary
g_bert = load_google_bert(base_location=BERT_PRETRAINED_DIR, use_attn_mask=False, max_len=SEQ_LEN)
g_bert.summary()

W0622 01:58:12.615245 140133029328640 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0622 01:58:12.634814 140133029328640 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0622 01:58:12.635972 140133029328640 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0622 01:58:12.677128 140133029328640 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0622 01:58:12.684548 

bert/embeddings/LayerNorm/beta  ->  layer_normalization_1/beta:0
bert/embeddings/LayerNorm/gamma  ->  layer_normalization_1/gamma:0
bert/embeddings/position_embeddings  ->  PositionEmbedding/embeddings:0
bert/embeddings/token_type_embeddings  ->  SegmentEmbedding/embeddings:0
bert/embeddings/word_embeddings  ->  TokenEmbedding/embeddings:0
bert/encoder/layer_0/attention/output/LayerNorm/beta  ->  layer_0/ln_1/beta:0
bert/encoder/layer_0/attention/output/LayerNorm/gamma  ->  layer_0/ln_1/gamma:0
bert/encoder/layer_0/attention/output/dense/bias  ->  layer_0/c_attn_proj/bias:0
bert/encoder/layer_0/attention/output/dense/kernel  ->  layer_0/c_attn_proj/kernel:0
bert/encoder/layer_0/attention/self/key/bias  ->  layer_0/c_attn/bias:0
bert/encoder/layer_0/attention/self/key/kernel  ->  layer_0/c_attn/kernel:0
bert/encoder/layer_0/attention/self/query/bias  ->  layer_0/c_attn/bias:0
bert/encoder/layer_0/attention/self/query/kernel  ->  layer_0/c_attn/kernel:0
bert/encoder/layer_0/attention/sel

bert/encoder/layer_5/attention/self/value/bias  ->  layer_5/c_attn/bias:0
bert/encoder/layer_5/attention/self/value/kernel  ->  layer_5/c_attn/kernel:0
bert/encoder/layer_5/intermediate/dense/bias  ->  layer_5/c_fc/bias:0
bert/encoder/layer_5/intermediate/dense/kernel  ->  layer_5/c_fc/kernel:0
bert/encoder/layer_5/output/LayerNorm/beta  ->  layer_5/ln_2/beta:0
bert/encoder/layer_5/output/LayerNorm/gamma  ->  layer_5/ln_2/gamma:0
bert/encoder/layer_5/output/dense/bias  ->  layer_5/c_ffn_proj/bias:0
bert/encoder/layer_5/output/dense/kernel  ->  layer_5/c_ffn_proj/kernel:0
bert/encoder/layer_6/attention/output/LayerNorm/beta  ->  layer_6/ln_1/beta:0
bert/encoder/layer_6/attention/output/LayerNorm/gamma  ->  layer_6/ln_1/gamma:0
bert/encoder/layer_6/attention/output/dense/bias  ->  layer_6/c_attn_proj/bias:0
bert/encoder/layer_6/attention/output/dense/kernel  ->  layer_6/c_attn_proj/kernel:0
bert/encoder/layer_6/attention/self/key/bias  ->  layer_6/c_attn/bias:0
bert/encoder/layer_6/atten

In [26]:
# Choose Layer 0 as containing the features relevant for classification; see BERT paper for further explanation on
# this choice.
classification_features = Lambda(lambda x: x[:, 0, :])(g_bert.output)
out = Dense(3, activation='softmax')(classification_features)

In [27]:
#Define model, compile, and define parameters
model = Model(g_bert.inputs, out)
model.compile(optimizer=Adam(LR), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

W0622 01:58:20.358971 140133029328640 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
segment_input (InputLayer)      (None, 70)           0                                            
__________________________________________________________________________________________________
position_input (InputLayer)     (None, 70)           0                                            
__________________________________________________________________________________________________
token_input (InputLayer)        (None, 70)           0                                            
__________________________________________________________________________________________________
SegmentEmbedding (Embedding)    (None, 70, 768)      1536        segment_input[0][0]              
__________________________________________________________________________________________________
PositionEm

**We will now fit the BERT pre-trained model**

In [104]:
history_log = model.fit_generator(train_gen,
                    epochs=1,
                    verbose=1,
                    validation_data=valid_gen,
                    shuffle=True)

Epoch 1/1


In [105]:
#Generate class probability predictions 
Y_test_predictions = model.predict_generator(valid_gen, verbose=1)



In [106]:
#Check if truth and pred length match, looks like they're not
print(len(Y_test))
print(len(Y_test_predictions))

1224
1224


In [107]:
#Make truth and pred length the same
Y_test = Y_test[:len(Y_test_predictions)]

In [108]:
#Sanity check
print(len(Y_test))
print(len(Y_test_predictions))

1224
1224


In [109]:
#Let's check some of the predictions
Y_test_predictions[0:5]

array([[0.9921198 , 0.00270515, 0.00517503],
       [0.2083101 , 0.61396605, 0.17772388],
       [0.07927147, 0.91105837, 0.00967021],
       [0.9444568 , 0.0369842 , 0.01855901],
       [0.41268933, 0.04572394, 0.54158676]], dtype=float32)

In [110]:
Y_test[0:5]

array([0, 0, 1, 0, 0])

---

### Prediction Probability Thresholds

In [111]:
new_preds = []
for pred in Y_test_predictions:
    if pred[2] > 0.4:
        pred = 2
        new_preds.append(pred)
    elif (pred[2] <= 0.4) and (pred[1] > 0.5):
        pred = 1
        new_preds.append(pred)
    else:
        pred = 0
        new_preds.append(pred)

In [112]:
Y_test_predictions = np.array(new_preds)


In [113]:
Y_test = Y_test.astype(int)

---

In [114]:
unique, counts = np.unique(Y_test_predictions, return_counts=True)

print (np.asarray((unique, counts)).T)

[[  0 920]
 [  1 194]
 [  2 110]]


## Model Evaluation

In [115]:
accuracy = accuracy_score(Y_test, Y_test_predictions)
recall = recall_score(Y_test, Y_test_predictions, average='micro')
precision = precision_score(Y_test, Y_test_predictions, average='micro')
f1 = f1_score(Y_test, Y_test_predictions, average='micro')


print ("Accuracy: {}".format(accuracy))
print ("Recall: {}".format(recall))
print ("Precision: {}".format(precision))
print ("F-1 Score: {}".format(f1))

Accuracy: 0.7344771241830066
Recall: 0.7344771241830066
Precision: 0.7344771241830066
F-1 Score: 0.7344771241830066


In [134]:
#Show loss metric
#Since we used 1 epoch, it doesn't warrant a model loss plot
history_log.history

{'acc': [0.795874049395614],
 'loss': [0.5186771043009012],
 'val_acc': [0.727124183493502],
 'val_loss': [0.6952079083843559]}

## Future Work and Potential Application

- Test model on new tweets
- Flask Deployment 
- Ranking: Most Neutral & Most Biased Congressmen
- Strongest buzzwords for each affiliation
- Comprehensive evaluation metrics with cross-validation