In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%%bash
# Logistics #2: install the transformers package, create a folder, download the dataset and a patch
# pip -q install transformers

# remove the directory if necessary
# rm -rf "/content/gdrive/MyDrive/6864_hw4/"

mkdir "/content/drive/My Drive/NLP/"
cd "/content/drive/My Drive/NLP/"

# wget http://nlp.stanford.edu/data/glove.6B.zip
# unzip glove.6B.zip
# ls
# pwd

mkdir: cannot create directory ‘/content/drive/My Drive/NLP/’: File exists


In [3]:
import numpy as np
import pandas as pd
import torch

if torch.cuda.is_available():
  device = torch.device('cuda')

  print('There are %d GPU(s) available.' % torch.cuda.device_count())

  print('We will use the GPU:', torch.cuda.get_device_name(0))

There are 1 GPU(s) available.
We will use the GPU: Tesla V100-SXM2-16GB


In [4]:
project_path = '/content/drive/My Drive/NLP/'
dataset = pd.read_csv(project_path + 'fake_and_real_news/combined.csv')

In [48]:
dataset

Unnamed: 0,text,label
0,President Trump made a joke while speaking to ...,0.0
1,"Tuesday night, retired neurosurgeon and former...",0.0
2,This is a very big development. We all knew th...,0.0
3,President Barack Obama will ask the U.S. Congr...,1.0
4,Beyonce made an attempt to glorify the violent...,0.0
...,...,...
44893,"Francois Compaore, the younger brother of form...",1.0
44894,Two women accused of murdering the estranged h...,1.0
44895,I will close my business before I will make a...,0.0
44896,We re only two months into Trump s failing pre...,0.0


In [6]:
# Very important. So that the model doesn't just overfit to the reuters prefix
dataset['text'] = dataset['text'].str.replace("^.*\(Reuters\) - ", "", regex=True)

In [49]:
dataset

Unnamed: 0,text,label
0,President Trump made a joke while speaking to ...,0.0
1,"Tuesday night, retired neurosurgeon and former...",0.0
2,This is a very big development. We all knew th...,0.0
3,President Barack Obama will ask the U.S. Congr...,1.0
4,Beyonce made an attempt to glorify the violent...,0.0
...,...,...
44893,"Francois Compaore, the younger brother of form...",1.0
44894,Two women accused of murdering the estranged h...,1.0
44895,I will close my business before I will make a...,0.0
44896,We re only two months into Trump s failing pre...,0.0


In [10]:
print(dataset['text'][1])

Tuesday night, retired neurosurgeon and former GOP presidential candidate Ben Carson delivered an  unusual  speech at the Republican National Convention in Cleveland.Carson, one of Trump s biggest (and worst) supporters, arrived on stage with far more energy than he usually has (he actually looked awake) and launched into a predictable attack on Democratic presidential candidate Hillary Clinton. He encouraged Republicans to dispel  the notion that a Hillary Clinton administration wouldn t be that bad,  and said: It won t be four or eight years because she will be appointing people who will have an effect on generations, and America may never recover from that. But then things got really weird. Carson accused Clinton of idolizing liberal radical Saul Alinsky, who according to Carson is  somebody who acknowledges Lucifer  in his book Rules For Radicals. Of course, no Republican s speech is ever complete without some bizarre, over-the-top religious reference, so it probably wasn t too sho

In [12]:
# import shit
from keras import callbacks
from keras.models import Sequential
from keras.layers import Activation,Flatten,Dense,Dropout,Embedding,Bidirectional,LSTM
from keras.optimizers import Adam,SGD

from keras.models import load_model
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
import pickle
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
import gc
import keras.backend as K

In [13]:
def create_model(vocabulary_size,embedding_size,embedding_matrix):
    model_glove = Sequential()
    model_glove.add(Embedding(vocabulary_size, embedding_size, weights=[embedding_matrix], trainable=False))
    model_glove.add(Bidirectional(LSTM(100)))
    model_glove.add(Dense(1, activation='sigmoid'))
    model_glove.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model_glove.summary()
    return model_glove

def callback(model_name,tf_log_dir_name='./tf-log/',patience_lr=10,):
    cb = []
    """
    Tensorboard log callback
    """
    tb = callbacks.TensorBoard(log_dir=tf_log_dir_name, histogram_freq=0)
    cb.append(tb)

    """
    Model-Checkpoint
    """
    m = callbacks.ModelCheckpoint(filepath=model_name,monitor='val_loss',mode='auto',save_best_only=True)
    cb.append(m)

    """
    Reduce Learning Rate
    """
    reduce_lr_loss = callbacks.ReduceLROnPlateau(monitor='loss', factor=0.1, patience=patience_lr, verbose=1, epsilon=1e-4, mode='min')
    cb.append(reduce_lr_loss)

    """
    Early Stopping callback
    """
    # Uncomment for usage
    early_stop = callbacks.EarlyStopping(monitor='val_acc', min_delta=0, patience=5, verbose=1, mode='auto')
    cb.append(early_stop)

    return cb

######### Show Train Val History Graph ###############
def plot_loss_accu(history,lossLoc='Train_Val_Loss',accLoc='Train_Val_acc'):
    import matplotlib.pyplot as plt

    plt.clf()

    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(len(loss))
    plt.plot(epochs, loss, 'r')
    plt.plot(epochs, val_loss, 'b')
    plt.title('Training and validation loss')
    plt.legend(['train', 'val'], loc='upper right')
    #plt.show()
    plt.savefig(lossLoc)

    plt.clf()

    acc = history.history['acc']
    val_acc = history.history['val_acc']
    epochs = range(len(acc))
    plt.plot(epochs, acc, 'r')
    plt.plot(epochs, val_acc, 'b')
    plt.title('Training and validation accuracy')
    plt.legend(['train', 'val'], loc='lower right')
    #plt.show()
    plt.savefig(accLoc)

    return model_glove

In [14]:
import re
import string
from nltk.corpus import stopwords
from nltk import re, SnowballStemmer

def clean_text(text):
    import nltk
    nltk.download('stopwords')
    translate_table = dict((ord(char), None) for char in string.punctuation)
    text = text.translate(translate_table)

    re_url = re.compile(r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\
                        .([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*",
                        re.MULTILINE | re.UNICODE)
    re_ip = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")

    text = re_url.sub("URL", text)

    text = re_ip.sub("IPADDRESS", text)

    text = text.lower().split()

    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]

    text = " ".join(text)

    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)


    return text

In [15]:
vocabulary_size = 400000
time_step=300

texts=dataset['text']
label=dataset['label']

X=texts.map(lambda x: clean_text(x))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is alr

In [16]:
labelEncoder=LabelEncoder()
encoded_label=labelEncoder.fit_transform(label)
y=np.reshape(encoded_label,(-1,1))


training_size=int(0.8*X.shape[0])
X_train=X[:training_size]
y_train=y[:training_size]
X_test=X[training_size:]
y_test=y[training_size:]


#Tokenizing texts
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(X_train)
sequences_train= tokenizer.texts_to_sequences(X_train)
X_train = sequence.pad_sequences(sequences_train, maxlen=time_step,padding='post')

print(len(tokenizer.word_index))


173886


In [17]:
X_train.shape

(35918, 300)

In [18]:
vocab_size=len(tokenizer.word_index)+1

#Reading Glove
f = open(project_path + 'glove.6B.100d.txt',encoding='utf-8')
embeddings={}
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings[word] = coefs
f.close()

print('Total %s word vectors.' % len(embeddings))

Total 400000 word vectors.


In [19]:
embedding_size=100

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, embedding_size))
for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

print(embedding_matrix.shape)

(173887, 100)


In [20]:
sequences_test= tokenizer.texts_to_sequences(X_test)
X_test = sequence.pad_sequences(sequences_test, maxlen=time_step,padding='post')
vocab_size = embedding_matrix.shape[0]
vocab_size

173887

In [22]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)##################
cvscores_FR=[]
classfication_report=[]

last_preds = None

Fold = 1
for train, val in kfold.split(X_train, y_train):
    gc.collect()
    K.clear_session()
    print('Fold: ', Fold)

    X_train_train = X_train[train]
    X_train_val = X_train[val]

    y_train_train = y_train[train]
    y_train_val = y_train[val]

    print("Initializing Callback :/...")
    model_name = 'Models/Bi_LSTM/Cross_Validation/Callbacks/FR/Model_cv_bi_lstm_FR_1_Callbacks_kfold_'+str(Fold)+'.h5'
    cb = callback(model_name=model_name) 
    # create model
    print("Creating and Fitting Model...")
    model = create_model(vocabulary_size=vocab_size,embedding_size=embedding_size,embedding_matrix=embedding_matrix)

    history=model.fit(X_train_train, y_train_train,validation_data=(X_train_val,y_train_val),
                      epochs=10, batch_size=128,shuffle=True,callbacks=cb)

    # Save each fold model
    print("Saving Model...")
    model_name = 'Models/Bi_LSTM/Cross_Validation/FR/Model_cv_bi_lstm_FR_1_kfold_' + str(Fold) + '.h5'########################################3
    model.save(model_name)
    '''
    model = load_model('Models/Bi_LSTM/Cross_Validation/FR/Model_cv_bi_lstm_FR_1_kfold_' + str(Fold) + '.h5')
    model.name='Model_bi_lstm_FR_1.h5'
    '''

    # evaluate the model
    print("Evaluating Model...")
    ##########################################
    scores = model.evaluate(X_test, y_test, verbose=0)
    print("Eval with Fake or Real %s: %.2f%%" % (model.metrics_names[1], scores[1]))
    cvscores_FR.append(scores[1])

    from sklearn.metrics import precision_recall_fscore_support, classification_report

    y_pred = model.predict_classes(X_test)
    last_preds = y_pred
    classfication_report.append(classification_report(y_test, y_pred))
    #print('Classification report:\n', classification_report(y_test, y_pred))
    # print('Classification report:\n',precision_recall_fscore_support(y_test,y_pred))
    # print(y_pred)

    '''#######################################################
    ########### Saving Graph ####################
    print("Saving graph...")
    plot_loss_accu(history,'Graphs/Train_Val_Loss_Fold_'+str(Fold)+'.png','Graphs/Train_Val_Acc_Fold_'+str(Fold)+'.png')
    #######################################################'''

    Fold = Fold + 1

print("Accuracy list of Fake or Real: ",cvscores_FR)
print("%s: %.2f%%" % ("Mean Accuracy of Fake or Real: ", np.mean(cvscores_FR)))
print("%s: %.2f%%" % ("Standard Deviation of Fake or Real: +/-", np.std(cvscores_FR)))


print('Classfication Report:')
for cr in classfication_report:
    print(cr)

Fold:  1
Initializing Callback :/...
Creating and Fitting Model...
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 100)         17388700  
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               160800    
_________________________________________________________________
dense (Dense)                (None, 1)                 201       
Total params: 17,549,701
Trainable params: 161,001
Non-trainable params: 17,388,700
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Saving Model...
Evaluating Model...
Eval with Fake or Real accuracy: 0.98%




Fold:  2
Initializing Callback :/...
Creating and Fitting Model...
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 100)         17388700  
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               160800    
_________________________________________________________________
dense (Dense)                (None, 1)                 201       
Total params: 17,549,701
Trainable params: 161,001
Non-trainable params: 17,388,700
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Saving Model...
Evaluating Model...
Eval with Fake or Real accuracy: 0.98%
Fold:  3
Initializing Callback :/...
Creating and Fitting Model...
Model: "sequential"
____________________________

In [57]:
og_X_test=dataset['text'][training_size:].to_numpy()

In [58]:
og_X_test.shape

(8980,)

In [59]:
last_preds.shape

(8980, 1)

In [60]:
fp = og_X_test[(y_test[:, 0] == 0) & (last_preds[:, 0] == 1)]

In [61]:
fp

array(['Obama s apology tour for the greatest country in the world continues The United States heard widespread concern Monday over excessive use of force by law-enforcement officials against minorities as it faced the U.N. s main human rights body for a review of its record.Washington also faced calls to work toward abolishing the death penalty, push ahead with closing the Guantanamo Bay detention center and ensure effective safeguards against abuses of Internet surveillance. Its appearance before the U.N. Human Rights Council in Geneva is the second review of the U.S. rights record, following the first in 2010.A string of countries ranging from Malaysia to Mexico pressed the U.S. to redouble efforts to prevent police using excessive force against minorities. Welcome to Mexico. It to likely you ll find any human rights violations here. Oh and here s a prison in Malaysia. Nothing to see here The U.N.Human Rights Council has more pressing issues to deal with like hmmm .maybe their conce

In [62]:
fn = og_X_test[(y_test[:, 0] == 1) & (last_preds[:, 0] == 0)]

In [63]:
fn

array(['As valedictorian of his high school class, Merrick Garland let his audience know precisely how he felt when parents unplugged the sound system that day in protest at a classmate’s speech against the Vietnam War. He may not necessarily have agreed with the topic or tone but, stirred by the sight of a student’s voice being silenced, Garland abandoned his prepared remarks to deliver instead an impassioned defense of free speech. U.S. President Barack Obama told that story on Wednesday when he nominated Garland, now a 63-year-old judge, on what is often called the second highest court in the land, to a seat on the Supreme Court, the country’s highest court. Obama praised what he called Garland’s “track record of building consensus as a thoughtful, fair-minded judge who follows the law.” Although Garland faces an uphill fight from a Republican-led U.S Senate opposed to anyone the Democratic president nominates, the judge is praised by politicians left and right, even after 19 years 