<a href="https://colab.research.google.com/github/nglglhtr/tapiz/blob/master/ReliabilityAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reliability Analysis

## Techniques:
1. MultinomialNB - Bayesian
  1. CountVectorizer - Bag of Words
  2. Tf-idf
3. PassiveAgressiveClassifer
4. RNN

## Imports

In [1]:
import pandas as pd 
import numpy as np
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier
import sklearn.metrics as metrics


## Dataset

In [8]:
df = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv")
df.set_index("Unnamed: 0")
df.head()
# print (df.shape)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [12]:
y = df.label
df.drop("label", axis=1)


# @todo add analysis for test/train split
X_train, X_test, y_train, y_test = train_test_split (
    df['text'], y, test_size=0.33, random_state=53
)

## Tokenization

### CountVectorizer

In [17]:
count_vectorizer = CountVectorizer(stop_words = 'english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

### TFIDFVectorizer

In [18]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

👉🏼 It is more important to NOT label real news articles as fake. All fakes MUST be labelled fake. NO real must be labelled Fake. Any fake may/may not be fake. (since humans will be reading it. they can judge in the worst case 😄)

## MultinomialNB

In [39]:
# @todo: parameter tuning 
tfidf_clf = MultinomialNB()
count_clf = MultinomialNB()

## Performance of tf-idf

In [40]:
tfidf_clf.fit(tfidf_train, y_train)
tfidf_pred = tfidf_clf.predict(tfidf_test)
tfidf_score = metrics.accuracy_score(y_test, tfidf_pred)

print ("accuracy: {:.3%}".format(tfidf_score))

accuracy: 85.653%


## Performance of Bag of Words

In [41]:
count_clf.fit(count_train, y_train)
count_pred = count_clf.predict(count_test)
count_score = metrics.accuracy_score(y_test, count_pred)

print ("accuracy: {:.3%}".format(count_score))

accuracy: 89.335%


## Linear Model: PassiveAggressiveClassifer

In [42]:
# @todo: parameter tuning
linear_clf = PassiveAggressiveClassifier(
    n_iter_no_change=50)


In [43]:
linear_clf.fit(tfidf_train, y_train)
linear_pred = linear_clf.predict(tfidf_test)
linear_score = metrics.accuracy_score(
    y_test, 
    linear_pred)
print ("accuracy: {:.3%}".format(linear_score))

accuracy: 93.257%


## Introspection: PassiveAggressiveClassifier

In [44]:
def most_informative_feature_for_binary_classification(vectorizer, classifier, n=100):
    """
    See: https://stackoverflow.com/a/26980472
    
    Identify most important features if given a vectorizer and binary classifier. Set n to the number
    of weighted features you would like to show. (Note: current implementation merely prints and does not 
    return top classes.)
    """

    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)

    print()

    for coef, feat in reversed(topn_class2):
        print(class_labels[1], coef, feat)


most_informative_feature_for_binary_classification(tfidf_vectorizer, linear_clf, n=30)

FAKE -5.1748572882063 2016
FAKE -4.093550827989906 october
FAKE -4.05741402641117 hillary
FAKE -3.2332544719805285 share
FAKE -3.1026958776016826 november
FAKE -2.967454807819075 article
FAKE -2.7234502071115014 print
FAKE -2.4765486402667736 email
FAKE -2.3179091112046377 oct
FAKE -2.235979597157709 advertisement
FAKE -2.218269469991737 establishment
FAKE -2.150016476161974 podesta
FAKE -2.1424355430688338 war
FAKE -2.1316322525883615 election
FAKE -2.1110116224519206 mosul
FAKE -2.0615362341856573 nov
FAKE -1.9708149671630848 source
FAKE -1.9574568601958193 com
FAKE -1.785596945327603 snip
FAKE -1.7308454796611574 wikileaks
FAKE -1.7240507940466634 donald
FAKE -1.7120643174627335 26
FAKE -1.706007055911879 photo
FAKE -1.6934164597597179 ayotte
FAKE -1.6926926036936873 jewish
FAKE -1.6688470559765691 dr
FAKE -1.6450803478749765 brexit
FAKE -1.6162305554399246 pipeline
FAKE -1.6048305536045644 corporate
FAKE -1.564076997981873 reuters

REAL 4.791838181903359 said
REAL 2.679584939433817

### 📝Scratch Space

In [49]:
news = "U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sunday’s unity march against terrorism.\n\nKerry said he expects to arrive in Paris Thursday evening, as he heads home after a week abroad. He said he will fly to France at the conclusion of a series of meetings scheduled for Thursday in Sofia, Bulgaria. He plans to meet the next day with Foreign Minister Laurent Fabius and President Francois Hollande, then return to Washington.\n\nThe visit by Kerry, who has family and childhood ties to the country and speaks fluent French, could address some of the criticism that the United States snubbed France in its darkest hour in many years.\n\nThe French press on Monday was filled with questions about why neither President Obama nor Kerry attended Sunday’s march, as about 40 leaders of other nations did. Obama was said to have stayed away because his own security needs can be taxing on a country, and Kerry had prior commitments.\n\nAmong roughly 40 leaders who did attend was Israeli Prime Minister Benjamin Netanyahu, no stranger to intense security, who marched beside Hollande through the city streets. The highest ranking U.S. officials attending the march were Jane Hartley, the ambassador to France, and Victoria Nuland, the assistant secretary of state for European affairs. Attorney General Eric H. Holder Jr. was in Paris for meetings with law enforcement officials but did not participate in the march.\n\nKerry spent Sunday at a business summit hosted by India’s prime minister, Narendra Modi. The United States is eager for India to relax stringent laws that function as barriers to foreign investment and hopes Modi’s government will act to open the huge Indian market for more American businesses.\n\nIn a news conference, Kerry brushed aside criticism that the United States had not sent a more senior official to Paris as “quibbling a little bit.” He noted that many staffers of the American Embassy in Paris attended the march, including the ambassador. He said he had wanted to be present at the march himself but could not because of his prior commitments in India.\n\n“But that is why I am going there on the way home, to make it crystal clear how passionately we feel about the events that have taken place there,” he said.\n\n“And I don’t think the people of France have any doubts about America’s understanding of what happened, of our personal sense of loss and our deep commitment to the people of France in this moment of trauma."
my_news = count_vectorizer.transform([news])
print('from linear clf:', linear_clf.predict(my_news))
print('from bayesian + count:',count_clf.predict(my_news))
print('from bayesian + tfidf:', tfidf_clf.predict(my_news))

from linear clf: ['REAL']
from bayesian + count: ['REAL']
from bayesian + tfidf: ['REAL']


## RNN

In [63]:
import tensorflow as tf 
from tensorflow import keras

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [64]:
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 20000

In [104]:
training_sentences = X_train.tolist()
testing_sentences = X_test.tolist()
training_labels = y_train.tolist()
testing_labels = y_test.tolist()

training_labels_encoded = []
testing_labels_encoded = []
# testing_labels = [a for label in testing_labels if label == 'FAKE']
for label in testing_labels:
  if label == 'FAKE':
    testing_labels_encoded.append(0)
  else:
    testing_labels_encoded.append(1)

for label in training_labels:
  if label == 'FAKE':
    training_labels_encoded.append(0)
  else:
    training_labels_encoded.append(1)

training_labels = training_labels_encoded
testing_labels = testing_labels_encoded


In [105]:
tokenizer = Tokenizer(
    num_words=vocab_size, 
    oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(
    training_sentences)

training_padded = pad_sequences(
    training_sequences, 
    maxlen=max_length, 
    padding=padding_type, 
    truncating = trunc_type)

testing_sequences = tokenizer.texts_to_sequences(
    testing_sentences)
testing_padded = pad_sequences(
    testing_sequences, 
    maxlen=max_length, 
    padding=padding_type, 
    truncating=trunc_type)


In [106]:
import numpy as np
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

In [107]:
rnn_model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 64),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1)
])

In [108]:
rnn_model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-4),
    metrics=['accuracy']
)

In [109]:
rnn_model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 64)          640000    
_________________________________________________________________
bidirectional_4 (Bidirection (None, 128)               66048     
_________________________________________________________________
dense_8 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 65        
Total params: 714,369
Trainable params: 714,369
Non-trainable params: 0
_________________________________________________________________


In [111]:
history = rnn_model.fit(
    training_padded, 
    training_labels,
    epochs=30,
    validation_data=(
        testing_padded, testing_labels
    ))

# training_labels

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### 📝Scratch Space

In [116]:
sentences = [
  "U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sunday’s unity march against terrorism.\n\nKerry said he expects to arrive in Paris Thursday evening, as he heads home after a week abroad. He said he will fly to France at the conclusion of a series of meetings scheduled for Thursday in Sofia, Bulgaria. He plans to meet the next day with Foreign Minister Laurent Fabius and President Francois Hollande, then return to Washington.\n\nThe visit by Kerry, who has family and childhood ties to the country and speaks fluent French, could address some of the criticism that the United States snubbed France in its darkest hour in many years.\n\nThe French press on Monday was filled with questions about why neither President Obama nor Kerry attended Sunday’s march, as about 40 leaders of other nations did. Obama was said to have stayed away because his own security needs can be taxing on a country, and Kerry had prior commitments.\n\nAmong roughly 40 leaders who did attend was Israeli Prime Minister Benjamin Netanyahu, no stranger to intense security, who marched beside Hollande through the city streets. The highest ranking U.S. officials attending the march were Jane Hartley, the ambassador to France, and Victoria Nuland, the assistant secretary of state for European affairs. Attorney General Eric H. Holder Jr. was in Paris for meetings with law enforcement officials but did not participate in the march.\n\nKerry spent Sunday at a business summit hosted by India’s prime minister, Narendra Modi. The United States is eager for India to relax stringent laws that function as barriers to foreign investment and hopes Modi’s government will act to open the huge Indian market for more American businesses.\n\nIn a news conference, Kerry brushed aside criticism that the United States had not sent a more senior official to Paris as “quibbling a little bit.” He noted that many staffers of the American Embassy in Paris attended the march, including the ambassador. He said he had wanted to be present at the march himself but could not because of his prior commitments in India.\n\n“But that is why I am going there on the way home, to make it crystal clear how passionately we feel about the events that have taken place there,” he said.\n\n“And I don’t think the people of France have any doubts about America’s understanding of what happened, of our personal sense of loss and our deep commitment to the people of France in this moment of trauma."    
]

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(
    sequences, 
    maxlen = max_length,
    padding=padding_type,
    truncating=trunc_type)
predictions=rnn_model.predict(padded)
predictions

array([[16.308409]], dtype=float32)