# Machine Learning to Detect Fake News Using Term Frequency Inverse Document Frequency Inverse Document Frequency Vectorizer (TfidfVectorizer) and Passive Aggressive Classifier
## By Carlos Bello
###### Sources and Data: 
###### https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/
###### https://www.w3resource.com/pandas
### Purpose of Program
The purpose of this program is to be able to detect fake news using a Term Frequency Inverse Document Frequency Inverse Document Frequency Vectorizer (TfidfVectorizer) and a  Passive Aggressive Classifier.

### Libraries
The necessary libraries are imported

In [1]:
import numpy as np
import pandas as pd
import itertools
import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

### Dataset
The data set is imported, the head is printed and the labels are saved.

In [2]:
data = pd.read_csv('news.csv', sep = ",")

print(data.shape)
data.head()

(6335, 4)


Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [3]:
labels = data.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

### Training The Model
Here the model is trained. It will be trained 30 times, so that we can get a model with a higher percentage of accuracy. A test size of 0.1 will be used each time the train and test data is split. 

The TfidfVectorizer is then initialized. Terms with a higher document frequency higher than 0.7 will be discarded, which are commonly stop words in English. The vectorizer was then fit and transformed on the train set and then the vectorizer was transformed onto the test set. The PassiveAgressiveClasifier is then fit onto the tfidf_train and y_train. Then an accuracy score is generated. The model that has the highest accuracy is then saved for future reference.

In [4]:
best = 0
for _ in range(30):
    x_train, x_test, y_train, y_test = train_test_split(data['text'], labels, test_size = 0.1)
    
    tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)
    
    tfidf_train = tfidf_vectorizer.fit_transform(x_train) 
    tfidf_test = tfidf_vectorizer.transform(x_test)

    pac = PassiveAggressiveClassifier(max_iter = 50)
    pac.fit(tfidf_train,y_train)
    
    y_pred = pac.predict(tfidf_test)
    acc = accuracy_score(y_test,y_pred)
    
    print(f'Accuracy: {round(acc*100,2)}%')

    if acc > best:
        best = acc
        with open("fake_news.pickle", "wb") as f:
            pickle.dump(tfidf_vectorizer, f)

Accuracy: 95.27%
Accuracy: 94.48%
Accuracy: 93.22%
Accuracy: 93.22%
Accuracy: 93.85%
Accuracy: 95.43%
Accuracy: 94.95%
Accuracy: 95.11%
Accuracy: 94.16%
Accuracy: 93.53%
Accuracy: 95.27%
Accuracy: 95.58%
Accuracy: 94.95%
Accuracy: 94.01%
Accuracy: 95.11%
Accuracy: 93.38%
Accuracy: 93.53%
Accuracy: 93.85%
Accuracy: 96.21%
Accuracy: 94.64%
Accuracy: 93.22%
Accuracy: 93.85%
Accuracy: 95.58%
Accuracy: 94.79%
Accuracy: 92.59%
Accuracy: 94.01%
Accuracy: 94.79%
Accuracy: 93.53%
Accuracy: 96.06%
Accuracy: 93.22%


### Importing The Model
The model with the highest accuracy is then imported back in. The model with the highest accuracy, had an accuracy of 96.21%!

In [5]:
pickle_in = open("fake_news.pickle", "rb")
tfidf_vectorizer = pickle.load(pickle_in)

### Confusion Matrix
A confusion Matrix is then used on the entire data set to look for false positives and false negatives. With this model, we are able to finish with 315 True Positives, 276 True Negatives, 21 False Positives, and 22 False Negatives.

In [6]:
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[315,  21],
       [ 22, 276]], dtype=int64)