#  Detect Fake News - Using a variety of scikit-learn classifiers
> Given a news data set, distinquish real news from fake news. 

- toc: false 
- badges: true
- comments: false
- categories: [jupyter, sklearn, tfidfvectorizer, passiveaggresiveclassifier, logisticregressionclassifier, supportvectorclassifier, naivebayesclassifier, decisiontreeclassifier, randomforestclassifier, mlpclassifier]
- author: Venkataramani, Suja

## Overview

Given a corpus of news documents each labelled as "Real" or "Fake", the task is to predict the correct label. We will run a variety of ML classifiers to predict the output and compare the accuracy of the classifiers. The input documents will be split into training and test set, transformed to tf-idf vector before feeding them to the classifiers. We will also attempt to understand different concepts and classifiers along the way.

## Method

First, let's download [news data](https://drive.google.com/file/d/1er9NJTLUA3qnRuyhfzuN0XUsoIC4a-_q/view) set.

In [114]:
# Import packages.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [49]:
# Load data.
news_data = pd.read_csv(".\\data\\news.csv")

In [7]:
# Check te number of rows and columns in the dataset.
news_data.shape

(6335, 4)

In [8]:
# Check the first 5 rows.
news_data.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [11]:
# Split the data by column to input and output.
x = news_data['text']
y = news_data['label']

In [23]:
# Split the x and y into 80% train data and 20% test data.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 100)

### TF-IDF - What is the intuition behind it?
The aim is to convert a document of words into a vector of numbers so it can be fed to ML algorithms. For every word in every document tf-idf is calculated to show the importance of the word in the corpus and the document.

t = term, d = document, N = number of documents in the corpus

- Term Frequency: Calculate the frequency of a word in a document.
tf (t, d) = count of the word in the document / total number of words in the document

- Document Frequency: Calculate the the number of documents in which the term occurs.
df(t) = count of the documents in which term occurs.

- Inverse Document Frequency: Measures the importance of the word in the corpus, relative weight of the term.
idf(t) = N/ df(t) - When the corpus is large, the number can be large.
idf(t) = log (N/df(t) + 1) - Taking a log dampens this large value, +1 in the denominator is to avoid division by 0.

- Term Frequency-Inverse Document Frequency:
tf-idf(t, d) = tf(t, d) * idf(t)

[MonkeyLearn](https://monkeylearn.com/blog/what-is-tf-idf/#:~:text=TF%2DIDF%20is%20a%20statistical,across%20a%20set%20of%20documents.)  
[TowardsDataScience](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)


In [57]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# stop_words = 'english': Removes all common uninformative words like and, the, etc.
# max_df = 0.7 : Removes corpus-specific stop words when the document frequecy is above 0.7. 
tf_idf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

# fit_transform - Learns the terms and returns a document-term sparce matrix (n_samples, n_features)
tf_idf_x_train = tf_idf_vectorizer.fit_transform(x_train)

# transform - Returns a document-term sparce matrix (n_samples, n_features)
tf_idf_x_test = tf_idf_vectorizer.transform(x_test)

### 1. Passive Aggressive Classifier - What is it?

PAC is a streaming algorithm for classifying massive amounts of data. Data is fed to the algorithm in sequestial order. For every example, if the prediction is correct, no change is made to the model (passive), but if the the prediction is incorrect, the model weight is changed to correct the model. The example is thrown away after processing.

This works on the principle of Hinge loss, where if the prediction is correct or overly correct, there is no loss, but if the change is incorrect then there is a loss, bigger the difference, bigger the loss.

The loss is added to the weight vector such that the prediction is just equal to 1.

[GeeksForGeeks](https://www.geeksforgeeks.org/passive-aggressive-classifiers/)  
[YouTube](https://www.youtube.com/watch?v=TJU8NfDdqNQ)   
[MachineCurve](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/)

In [115]:
# Instantiate and fit PAC.
pac = PassiveAggressiveClassifier(max_iter = 50)
pac.fit(tf_idf_x_train, y_train)

# Predict on the test set.
y_pred = pac.predict(tf_idf_x_test)

# Calculate the accuracy.
score = accuracy_score(y_test, y_pred)
print("PAC Accuracy = ", round((score * 100), 3))

PAC Accuracy =  93.291


### 2. Logistic Regression Classifier

Logistic regression is like linear regression but with binomial (true/false) results. It predicts the probability of the class using a sigmoid function. 

[Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)   
[YouTube](https://www.youtube.com/watch?v=yIYKR4sgzI8)


In [73]:
# Random state - controls the random number generator, multiple calls to the funtion will reproduce same result.
lr = LogisticRegression(random_state = 100)
lr.fit(tf_idf_x_train, y_train)

y_pred = lr.predict(tf_idf_x_test)

score = accuracy_score(y_test, y_pred)
print("LR Accuracy = ", round((score * 100), 3))

LR Accuracy =  90.845


### 3. Support Vector Machine

SVM is a supervised ML model which can be used for classification. It does this by finding the best line/plane of separation between the nearest data points from each class. If the classes are not linearly separable in the given dimensions, new dimension is calculated using a kernel function and then a hyperplane is calculated.

[MonkeyLearn](https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/)    
[YouTube](https://www.youtube.com/watch?v=efR1C6CvhmE)

In [89]:
svc = SVC()
svc.fit(tf_idf_x_train, y_train)

y_pred = svc.predict(tf_idf_x_test)

score = accuracy_score(y_test, y_pred)
print("SVC Accuracy = ", round((score * 100), 3))

LR Accuracy =  50.039


### 4. Naive Bayes Classifier

NB is a probabilistic ML model based on Bayes theorem:

P(y|X) =  (P(X|y)p(y))/(P(X))

Where X = x1, x2...xn (n features), y = expected result.
All the predictors are expected to be unrelated (naive assumption) and are considered equally important.

[TowardsDataScience](https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c)  
[YouTube](https://www.youtube.com/watch?v=O2L2Uv9pdDA)

In [102]:
mnb = MultinomialNB()
mnb.fit(tf_idf_x_train, y_train)

y_pred = mnb.predict(tf_idf_x_test)

score = accuracy_score(y_test, y_pred)
print("MNB Accuracy = ", round((score * 100), 3))

MNB Accuracy =  85.556


### 5. Decision Tree Classifier

DT is a non-parametric (data does not have well defined distribution) supervised model. This model creates a tree with the given data. When new data is presented, it follows the decision tree to arrive at a class.

[YouTube](https://www.youtube.com/watch?v=7VeUPuFGJHk)    
[Scikit-Learn](https://scikit-learn.org/stable/modules/tree.html)

In [104]:
dtc = DecisionTreeClassifier()
dtc.fit(tf_idf_x_train, y_train)

y_pred = dtc.predict(tf_idf_x_test)

score = accuracy_score(y_test, y_pred)
print("DTC Accuracy = ", round((score * 100), 3))

GNB Accuracy =  80.9


### 6. Random Forest Classifier

This is a ensemble () classifier where several sub-samples of the data (bootstrapping - random sampling of data with replacement) is used to build decision tree classifiers and the result is calculated based on the average of the results from the different trees.


[Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#)    
[YouTube](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ)

In [107]:
rfc = RandomForestClassifier()
rfc.fit(tf_idf_x_train, y_train)

y_pred = rfc.predict(tf_idf_x_test)

score = accuracy_score(y_test, y_pred)
print("RFC Accuracy = ", round((score * 100), 3))

GNB Accuracy =  85.162


### 7. Multilayer Perceptron Classifier

A neural network model based on perceptrons with hidden layers and back propogation. The hidden layers help learn complex patterns int he data and back propogation adjusts the weights of the classifier after every iteration to minimise loss.

[Scikit-Learn](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)    
[TowardsDataScience](https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f)



In [112]:
mlp = MLPClassifier(random_state=100, max_iter=100)
mlp.fit(tf_idf_x_train, y_train)

y_pred = mlp.predict(tf_idf_x_test)

score = accuracy_score(y_test, y_pred)
print("RFC Accuracy = ", round((score * 100), 3))   

RFC Accuracy =  93.212


## Conclusion

Both PassiveAggressiveClassifier and MLPClassifier resulted in **93%** accuracy while SVCClassifier produced just 50% accuracy. RandomForestClassifier had a better accuracy (85%) than decision trees(80%). MLP results can probably be improved with hyperparameter tuning. None of the classifiers were tuned for optimal performance and we have not measured the time it takes to train the model.