<a href="https://colab.research.google.com/github/nglglhtr/tapiz/blob/master/ReliabilityAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reliability Analysis

## Techniques:
1. MultinomialNB - Bayesian
  1. CountVectorizer - Bag of Words
  2. Tf-idf
3. PassiveAgressiveClassifer
4. RNN

## Imports

In [1]:
import pandas as pd 
import numpy as np
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

## Dataset

In [8]:
df = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv")
df.set_index("Unnamed: 0")
df.head()
# print (df.shape)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [12]:
y = df.label
df.drop("label", axis=1)


# @todo add analysis for test/train split
X_train, X_test, y_train, y_test = train_test_split (
    df['text'], y, test_size=0.33, random_state=53
)

## Tokenization

### CountVectorizer

In [17]:
count_vectorizer = CountVectorizer(stop_words = 'english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

### TFIDFVectorizer

In [18]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

👉🏼 It is more important to NOT label real news articles as fake. All fakes MUST be labelled fake. NO real must be labelled Fake. Any fake may/may not be fake. (since humans will be reading it. they can judge in the worst case 😄)

## MultinomialNB

In [39]:
# @todo: parameter tuning 
tfidf_clf = MultinomialNB()
count_clf = MultinomialNB()

## Performance of tf-idf

In [40]:
tfidf_clf.fit(tfidf_train, y_train)
tfidf_pred = tfidf_clf.predict(tfidf_test)
tfidf_score = metrics.accuracy_score(y_test, tfidf_pred)

print ("accuracy: {:.3%}".format(tfidf_score))

accuracy: 85.653%


## Performance of Bag of Words

In [41]:
count_clf.fit(count_train, y_train)
count_pred = count_clf.predict(count_test)
count_score = metrics.accuracy_score(y_test, count_pred)

print ("accuracy: {:.3%}".format(count_score))

accuracy: 89.335%


## Linear Model: PassiveAggressiveClassifer

In [42]:
# @todo: parameter tuning
linear_clf = PassiveAggressiveClassifier(
    n_iter_no_change=50)


In [43]:
linear_clf.fit(tfidf_train, y_train)
linear_pred = linear_clf.predict(tfidf_test)
linear_score = metrics.accuracy_score(
    y_test, 
    linear_pred)
print ("accuracy: {:.3%}".format(linear_score))

accuracy: 93.257%


## Introspection: PassiveAggressiveClassifier

In [44]:
def most_informative_feature_for_binary_classification(vectorizer, classifier, n=100):
    """
    See: https://stackoverflow.com/a/26980472
    
    Identify most important features if given a vectorizer and binary classifier. Set n to the number
    of weighted features you would like to show. (Note: current implementation merely prints and does not 
    return top classes.)
    """

    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)

    print()

    for coef, feat in reversed(topn_class2):
        print(class_labels[1], coef, feat)


most_informative_feature_for_binary_classification(tfidf_vectorizer, linear_clf, n=30)

FAKE -5.1748572882063 2016
FAKE -4.093550827989906 october
FAKE -4.05741402641117 hillary
FAKE -3.2332544719805285 share
FAKE -3.1026958776016826 november
FAKE -2.967454807819075 article
FAKE -2.7234502071115014 print
FAKE -2.4765486402667736 email
FAKE -2.3179091112046377 oct
FAKE -2.235979597157709 advertisement
FAKE -2.218269469991737 establishment
FAKE -2.150016476161974 podesta
FAKE -2.1424355430688338 war
FAKE -2.1316322525883615 election
FAKE -2.1110116224519206 mosul
FAKE -2.0615362341856573 nov
FAKE -1.9708149671630848 source
FAKE -1.9574568601958193 com
FAKE -1.785596945327603 snip
FAKE -1.7308454796611574 wikileaks
FAKE -1.7240507940466634 donald
FAKE -1.7120643174627335 26
FAKE -1.706007055911879 photo
FAKE -1.6934164597597179 ayotte
FAKE -1.6926926036936873 jewish
FAKE -1.6688470559765691 dr
FAKE -1.6450803478749765 brexit
FAKE -1.6162305554399246 pipeline
FAKE -1.6048305536045644 corporate
FAKE -1.564076997981873 reuters

REAL 4.791838181903359 said
REAL 2.679584939433817

## RNN