# Project ~ Fake News  Detector

This is a fake news detector project. This notebook contains a model that will be able to detect if a news article is either real or fake based on the dataset that it has been trained and tested with.

This is a simulation of how fake news are detected on the online platforms so as to prevent people not to fall into traps such as fraud and scam on online platforms mostly.

The used dataset was extracted from kaggle.com which most of its dataset are from outside africa, but since fake news is widely spread all over the world mostly on online platform it will help us on creating a solution of this problem.

The project will be presented as follows:
* Data collection
* Exploratory data analysis 
* Text Pre-processing
* Data Modeling
* Model Evaluation
* Model Simulation

## Data collection

The data that i am going to use was downloaded from kaggle.com since i had no other resource to get data from, the dataset contains the following collumns:
* Unnamed: 0
* title
* text
* label

#### importing necessary libraries

In [1]:
# for pandas library
import pandas as pd

#for nltk  library
import nltk
from nltk.corpus import stopwords
from string import punctuation

# for sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# for flask
from flask import Flask, render_template, request

# command used for skipping browser warnings
import warnings
warnings.filterwarnings('ignore')

#### Loading the dataset

In [2]:
news = pd.read_csv("news.csv")

In [3]:
news.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


### Exploratory data analysis

In [4]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
Unnamed: 0    6335 non-null int64
title         6335 non-null object
text          6335 non-null object
label         6335 non-null object
dtypes: int64(1), object(3)
memory usage: 198.0+ KB


now let us confirm if their are any missing data in any column of our dataset

In [5]:
news.isnull().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

renaming the columns from there initials to the new column names that is quite simple

In [6]:
news.rename(columns = {"title":"Headline","text":"Text","label":"Label"}, inplace=True)

now lets change the column of label from string to integer

In [7]:
news.loc[news["Label"] == 'REAL', "Label"] = 1
news.loc[news["Label"] == 'FAKE', "Label"] = 0

lets delete the Unnamed: 0 column coz its not for use in the project

In [8]:
news.drop(["Unnamed: 0"],axis=1, inplace=True)

In [9]:
news.head()

Unnamed: 0,Headline,Text,Label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",0
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,0
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,1
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",0
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,1


In [10]:
news.shape

(6335, 3)

## Text Pre-processing

The main aim at this stage is that the dataset have stopwords and punctuation in every text to clean the dataset I shall be removing stopwords becouse they are common words that always appear in every text that is in the dataset and it will not make much meaning training a model with them and also punctuation have no much meaning to the model so we have to remove them before procedding to the next stage

Lets use Python built-in string library to see a list of some punctuations

In [11]:
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Lets use Python built-in stopwords library to see some english stopwords

In [12]:
sw = stopwords.words('english')
print(sw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

lets use an example of a text and see how text preprocessing works

In [13]:
protype = 'What is Text Preprocessing? It is to do with removing punctuations, removing stop_words, returning a list!!!'
print(len(protype))
print(" ")
print(protype)

108
 
What is Text Preprocessing? It is to do with removing punctuations, removing stop_words, returning a list!!!


In [14]:
# Check characters to see if they are in punctuation
nopunctuation = [word for word in protype if word not in punctuation]

# Join the characters again to form the string.
nopunctuation = ''.join(nopunctuation)

print(nopunctuation.split())

['What', 'is', 'Text', 'Preprocessing', 'It', 'is', 'to', 'do', 'with', 'removing', 'punctuations', 'removing', 'stopwords', 'returning', 'a', 'list']


In [15]:
# Now just remove any stopwords
sw = stopwords.words('english')
clean_text = [word for word in nopunctuation.split() if word.lower() not in sw]
print(len(clean_text))
print(" ")
print(clean_text)

8
 
['Text', 'Preprocessing', 'removing', 'punctuations', 'removing', 'stopwords', 'returning', 'list']


In [16]:
def text_process(probook):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunctuation = [word for word in probook if word not in punctuation]

    # Join the characters again to form the string
    nopunctuation = ''.join(nopunctuation)
    
    # Now just remove any stopwords
    sw = stopwords.words('english')
    return [word for word in nopunctuation.split() if word.lower() not in sw]

In [17]:
# show the original text
text = news.Text
text.head()

0    Daniel Greenfield, a Shillman Journalism Fello...
1    Google Pinterest Digg Linkedin Reddit Stumbleu...
2    U.S. Secretary of State John F. Kerry said Mon...
3    — Kaydee King (@KaydeeKing) November 9, 2016 T...
4    It's primary day in New York and front-runners...
Name: Text, dtype: object

In [18]:
# show the processed text
news['Text'].head(7).apply(text_process)

0    [Daniel, Greenfield, Shillman, Journalism, Fel...
1    [Google, Pinterest, Digg, Linkedin, Reddit, St...
2    [US, Secretary, State, John, F, Kerry, said, M...
3    [—, Kaydee, King, KaydeeKing, November, 9, 201...
4    [primary, day, New, York, frontrunners, Hillar...
5    [I’m, immigrant, grandparents, 50, years, ago,...
6    [Share, Baylee, Luciani, left, Screenshot, Bay...
Name: Text, dtype: object

## Data Modeling

#### Train Test Split

Now let's split the data into training set and testing set. We will train out model with the training set and then use the test set to evaluate the model.

In [19]:
x_train, x_test, y_train, y_test = train_test_split(news['Text'], news['Label'], test_size=0.3, random_state=7)

In [20]:
print('Training size =', len(x_train))
print('Test size =', len(x_test))

Training size = 4434
Test size = 1901


#### Vectorization

Since a computer doesn't know how to read text instead it reads 0s and 1s we will need some sort of numerical feature to convert our text into vector in order to enable the computer read and make predictions

In this section we will convert the raw text into vectors using TfidfVectorizer

specifying the analyzer to be the previous defined function

Using our previous example of protype we are trying to see how the TfidfVectorizer will work in our dataset with this three sentences splited with a comma, we have used .fit_transform() to fit and transform the same time, then printing the output as an array to see how they have been vectorised, and also used .get_feature_names() to reveal the vectorized words in terms of appearence

In [21]:
cv = TfidfVectorizer(min_df=1,stop_words='english')
traincheck = cv.fit_transform(["What is Text Preprocessing?", "It is to do with removing punctuations", "removing stop_words, returning a list!!!"])

now this is how the computer will read the above three sentences 

In [22]:
traincheck.toarray()

array([[0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.70710678],
       [0.        , 0.        , 0.79596054, 0.60534851, 0.        ,
        0.        , 0.        ],
       [0.52863461, 0.        , 0.        , 0.40204024, 0.52863461,
        0.52863461, 0.        ]])

lets create a vectorizer that we will use in the whole project

In [23]:
vectoriza = TfidfVectorizer(analyzer=text_process, stop_words="english", decode_error="ignore", max_df=0.7)

lets fit in the vectorizer with the x_train data

In [24]:
vectoriza.fit(x_train)

TfidfVectorizer(analyzer=<function text_process at 0x00000199A2B81E18>,
        binary=False, decode_error='ignore', dtype=<class 'numpy.float64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=0.7,
        max_features=None, min_df=1, ngram_range=(1, 1), norm='l2',
        preprocessor=None, smooth_idf=True, stop_words='english',
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

#### PassiveAggressive Classifier

Passive Aggressive Classifier are algorithms that remains passive for a correct classification outcome and turns aggressive in the event of a miscalculation, updating and adjusting also it does not converge, its purpose is to make updates that correct the loss causing very little change in the norm of the weight vector

In [25]:
model = PassiveAggressiveClassifier(max_iter=50)

In [26]:
model.fit(vectoriza.transform(x_train), y_train)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              early_stopping=False, fit_intercept=True, loss='hinge',
              max_iter=50, n_iter=None, n_iter_no_change=5, n_jobs=None,
              random_state=None, shuffle=True, tol=None,
              validation_fraction=0.1, verbose=0, warm_start=False)

## Model Evaluation

Now lets see how good our model will do to the entire dataset. Let's begin by getting all the predictions

finally predicting from the model using .predict() then check out the model accuracy finishing with a classification report that is a similar with accuracy but it is a matrix form

In [27]:
confirmation = model.predict(vectoriza.transform(x_test))
print(confirmation)

[1 0 1 ... 0 0 1]


In [28]:
why_pred = model.predict(vectoriza.transform(x_test))
score    = accuracy_score(y_test,why_pred)
print(f'Model Accuracy : {round(score*100,2)}%')

Model Accuracy : 94.63%


In [29]:
print(classification_report(y_test, why_pred))

              precision    recall  f1-score   support

           0       0.94      0.95      0.95       974
           1       0.95      0.94      0.94       927

   micro avg       0.95      0.95      0.95      1901
   macro avg       0.95      0.95      0.95      1901
weighted avg       0.95      0.95      0.95      1901



In [30]:
print(confusion_matrix(y_test, why_pred))

[[926  48]
 [ 54 873]]


## Model Simulation
#### Sample Prediciton

In [31]:
article  = ["Facebook has announced that due to the success of emoji's and how widely recognized they have become, Facebook\
            has opted to take the bold step of deleting all emoji's."]
analyse1 = vectoriza.transform(article).toarray()
model.predict(analyse1)

array([0], dtype=int64)

In [32]:
article2 = ["Muhammad Ali was deathly afraid of flying. Louisville just named its airport after him."]
analyse2 = vectoriza.transform(article2).toarray()
model.predict(analyse2)

array([1], dtype=int64)

app = Flask(__name__)

@app.route('/')
def home():
    return render_template("home.html")

@app.route('/predict', methods=['POST'])
def predict():    
    if request.method == 'POST':
        article = request.form['article']
        data    = [article]
        sight   = vectoriza.transform(data).toarray()
        analyse = model.predict(sight)

        return render_template('result.html', prediction=analyse)

if __name__ == '__main__':
    app.run(port=8577)