#  NLP: "Fake News Detection"

Objective: Build a classifier that can distinguish real news from fake news. A training set for this is available.

#Action items for this exercise:

* Use the vector model and text preprocessing techniques from class to construct a training data set.
* Determine the dimensions of your vector model and print out the first 10 dimensions
* Use that training data set to construct a Naive Bayes classifier.
* Compute the accuracy and 95% CI for the classifier.
* Try your analysis with and without data preprocessing, is there a difference in accuracies of the models.


Note: The data set contains a large number of articles (takes a long time to train), you can downsample this to something like 1,000 articles or so in order to speed up training and evaluation.

The fields you are interested in are ‘text’ and ‘label’ with the obvious interpretations.




In [None]:
import os
os.getcwd()

'/content'

In [None]:
url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"

In [None]:
df = pd.read_csv(url)
fetch_20newsgroups = df 
fetch_20newsgroups

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [None]:
# setup
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from assets.confint import classification_confint
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from sklearn.datasets import fetch_20newsgroups

In [None]:

print("******** data **********")

# get the newsgroup database
cats = ['talk.politics.misc', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=cats)

# extract into dataframes
texts = pd.DataFrame(newsgroups_train.data, columns=['text'])
labels = pd.DataFrame(newsgroups_train.target, columns=['label'])['label'].apply(lambda x: cats[x])
texts.head()


******** data **********


Unnamed: 0,text
0,\nIn billions of dollars (%GNP):\nyear GNP ...
1,ajteel@dendrite.cs.Colorado.EDU (A.J. Teel) w...
2,\nMy opinion is this: In a society whose econ...
3,"Ahhh, remember the days of Yesterday? When we..."
4,"\n""...a la Chrysler""?? Okay kids, to the near..."


In [None]:
texts.iloc[3,0]

'Ahhh, remember the days of Yesterday?  When we were only \n\tgoing to pay $17 / month?\n\n\tWhen only 1.2% of the population would pay extra taxes?\n\n\tRemember when a few of us predicted that it wasn\'t true?  :)\n\tRemember the Inaugural?   Dancing and Singing!  Liberation\n\tat last!  \n\n\tWell, figure *this* out:\n\n\t5% VAT, estimated to raise $60-100 Billion per year ( on CNN )\n\tWork it out, chum...\n\n\t     $60,000,000,000  /  125,000,000 taxpayers = $480 / year\n\n        But, you exclaim, " I\'ll get FREE HEALTH CARE! "\n\tBut, I exclaim, " No, you won\'t! "\n\n\tThis is only for that poor 37 million who have none.  Not for\n\tYOU, chum. :)  That comes LATER.\n\n\tAdd in the estimates of the energy tax costs - $300-500 / year\n\n\tPlus, all that extra "corporate and rich" taxes that will \n\ttrickle down, and what do you have?\n\n\t$1,000 / year, just like I said two months ago.\n\n\tAnd, the best part?   You don\'t GET ANYTHING for it.\n\n\tDeficit is STILL projected to

In [None]:
labels.head()

0    sci.space
1    sci.space
2    sci.space
3    sci.space
4    sci.space
Name: label, dtype: object

In [None]:
print("******** docarray **********")

# build the stemmer object
stemmer = PorterStemmer()
# get the default text analyzer from CountVectorizer
analyzer = vectorizer = CountVectorizer(analyzer = "word", token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

# build docarrayu
vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=2)
docarray = vectorizer.fit_transform(texts.loc[:,'text']).toarray()
docarray.shape

******** docarray **********


(1058, 6267)

In [None]:
print("******** model **********")


# KNN
model = KNeighborsClassifier()

# grid search
param_grid = {'n_neighbors': list(range(1,15,3))}
grid = GridSearchCV(model, param_grid, cv=2, verbose=10, n_jobs=-1)
grid.fit(docarray, labels)
print("Grid Search: best parameters: {}".format(grid.best_params_))

******** model **********
Fitting 2 folds for each of 5 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    5.8s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   28.9s finished


Grid Search: best parameters: {'n_neighbors': 7}


In [None]:
print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = grid.best_estimator_
predict_y = best_model.predict(docarray)
acc = accuracy_score(labels, predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

******** Accuracy **********
Accuracy: 0.86 (0.84,0.88)


In [None]:

print("******** confusion matrix **********")

# build the confusion matrix
cm = confusion_matrix(labels, predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** confusion matrix **********
Confusion Matrix:
                    talk.politics.misc  sci.space
talk.politics.misc                 562         31
sci.space                           19        446


# Naive Bayes(NB)

In [None]:
## Naive Bayes

# setup
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from assets.confint import classification_confint
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from sklearn.datasets import fetch_20newsgroups

print("******** data **********")

# get the newsgroup database
cats = ['talk.politics.misc', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=cats)

# extract into dataframes
texts = pd.DataFrame(newsgroups_train.data, columns=['text'])
labels = pd.DataFrame(newsgroups_train.target, columns=['label'])['label'].apply(lambda x: cats[x])

print("******** docarray **********")

# build the stemmer object
stemmer = PorterStemmer()
# get the default text analyzer from CountVectorizer
analyzer = vectorizer = CountVectorizer(analyzer = "word", token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

# build docarrayu
vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=2)
docarray = vectorizer.fit_transform(texts.loc[:,'text']).toarray()
docarray.shape

print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, labels)


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(labels, predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

print("******** confusion matrix **********")

# build the confusion matrix
cm = confusion_matrix(labels, predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))


******** data **********
******** docarray **********
******** model **********
******** Accuracy **********
Accuracy: 0.95 (0.94,0.97)
******** confusion matrix **********
Confusion Matrix:
                    talk.politics.misc  sci.space
talk.politics.misc                 562         31
sci.space                           19        446
