# Building a "fake news" classifier

I this project, we will apply the basics of NLP (Natural Language Processing) along with some supervised machine learning to build a "fake news" detector. We'll begin by doing the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles.

In this exercise, we'll use pandas alongside scikit-learn to create a sparse text vectorizer to train and test a supervised model. 

In [8]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd

#read csv file
df = pd.read_csv('fake_or_real.csv')

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33, random_state=89)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

  Unnamed: 0                                              title  \
0       8476                       You Can Smell Hillary’s Fear   
1      10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2       3608        Kerry to go to Paris in gesture of sympathy   
3      10142  Bernie supporters on Twitter erupt in anger ag...   
4        875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  
['00', '000', '0000', '00000031', '000035', '00006', '0002', '000billion', '000ft', '001']


Similar to the sparse CountVectorizer created in the previous exercise, we'll work on creating tf-idf vectors for the documents. we'll set up a TfidfVectorizer and investigate some of its features.

In this exercise, we'll use pandas and sklearn along with the same X_train, y_train and X_test, y_test DataFrames and Series we created before.

In [9]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train.values)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test.values)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

['00', '000', '0000', '00000031', '000035', '00006', '0002', '000billion', '000ft', '001']
[[0.         0.01855224 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


To get a better idea of how the vectors work, we'll investigate them by converting them into pandas DataFrames.

Here, we'll use the same data structures we created in the previous (count_train, count_vectorizer, tfidf_train, tfidf_vectorizer) as well as pandas, which is imported as pd.

In [10]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

   00  000  0000  00000031  000035  00006  0002  000billion  000ft  001  ...  \
0   0    1     0         0       0      0     0           0      0    0  ...   
1   0    0     0         0       0      0     0           0      0    0  ...   
2   0    0     0         0       0      0     0           0      0    0  ...   
3   0    0     0         0       0      0     0           0      0    0  ...   
4   0    0     0         0       0      0     0           0      0    0  ...   

   تنجح  حلب  عربي  عن  لم  ما  محاولات  من  هذا  والمرضى  
0     0    0     0   0   0   0        0   0    0        0  
1     0    0     0   0   0   0        0   0    0        0  
2     0    0     0   0   0   0        0   0    0        0  
3     0    0     0   0   0   0        0   0    0        0  
4     0    0     0   0   0   0        0   0    0        0  

[5 rows x 55427 columns]
    00       000  0000  00000031  000035  00006  0002  000billion  000ft  001  \
0  0.0  0.018552   0.0       0.0     0.0    0.0   0.

# Training and testing the "fake news" model with CountVectorizer
Now it's time to train the "fake news" model using the features we identified and extracted. We'll train and test a Naive Bayes model using the CountVectorizer data.

The training and test sets have been created, and count_vectorizer, count_train, and count_test have been computed.

In [11]:
# Import the necessary modules
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

0.9025911708253359
[[896 131]
 [ 72 985]]


# Training and testing the "fake news" model with TfidfVectorizer
Now that we have evaluated the model using the CountVectorizer, we'll do the same using the TfidfVectorizer with a Naive Bayes model.

In [12]:
# Calculate the accuracy score and confusion matrix of Multinomial Naive Bayes classifier predictions trained on 
# tfidf_train, y_train and tested against tfidf_test and y_test

nb_classifier = MultinomialNB()
nb_classifier.fit(tfidf_train, y_train)
pred = nb_classifier.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print(score)

cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', "REAL"])
print(cm)

0.8512476007677543
[[ 738  289]
 [  21 1036]]


# Improving the model
We're going to test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.

In [34]:
import numpy as np

# Create the list of alphas: alphas
alphas = np.arange(0.05, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha)
    # Fit to the training data
    nb_classifier.fit(count_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(count_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.05
Score:  0.9064299424184261

Alpha:  0.15000000000000002
Score:  0.9069097888675623

Alpha:  0.25000000000000006
Score:  0.9064299424184261

Alpha:  0.35000000000000003
Score:  0.9073896353166987

Alpha:  0.45000000000000007
Score:  0.9069097888675623

Alpha:  0.5500000000000002
Score:  0.9054702495201535

Alpha:  0.6500000000000001
Score:  0.904510556621881

Alpha:  0.7500000000000002
Score:  0.9049904030710173

Alpha:  0.8500000000000002
Score:  0.9049904030710173

Alpha:  0.9500000000000002
Score:  0.9030710172744721



# Inspecting the model
Now that we have built a "fake news" classifier, we'll investigate what it has learned. We can map the important vector weights back to actual words using some simple inspection techniques.

In [41]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])
print()
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

FAKE [(-11.300427527153001, '0000'), (-11.300427527153001, '000035'), (-11.300427527153001, '0002'), (-11.300427527153001, '000billion'), (-11.300427527153001, '0011'), (-11.300427527153001, '004s'), (-11.300427527153001, '005'), (-11.300427527153001, '005s'), (-11.300427527153001, '00684'), (-11.300427527153001, '006s'), (-11.300427527153001, '007s'), (-11.300427527153001, '008s'), (-11.300427527153001, '0099'), (-11.300427527153001, '00am'), (-11.300427527153001, '00p'), (-11.300427527153001, '00pm'), (-11.300427527153001, '013c2812c9'), (-11.300427527153001, '014'), (-11.300427527153001, '015'), (-11.300427527153001, '01am')]

REAL [(-7.736122101029444, 'gop'), (-7.715082494631467, 'democratic'), (-7.700132178755215, 'states'), (-7.616062189000394, 'republicans'), (-7.613804573183871, 'voters'), (-7.602069929422667, 'house'), (-7.568255176935045, 'percent'), (-7.517735585484349, 'people'), (-7.504294941910381, 'new'), (-7.418655764303377, 'party'), (-7.416781588287513, 'cruz'), (-7.