# Fake News Detection Using Scikit-learn

Let’s start this project with a simple question, do you trust all the news from social media? How can we detect fake news from real news? It’s a 
tough question. Luckily, we can detect fake news using a supervised machine learning method.

Fake news is a piece of news that is not true and deliberately designed to mislead people. It is usually spread via social media or other online
platforms. Fake news is usually politically driven to give advantages or disadvantages to a political party. Such news items may contain false and
exaggerated claims and because of certain algorithms, trap users in a filter bubble.

In this project, we’ll use two different datasets:

News dataset available on Kaggle.
    
The second dataset we’ll create ourselves using the News API. We will use this API to load some data and then append that data to the other dataset.
    
In the end, we will use a passive-aggressive classifier to classify and differentiate the fake news from the real ones. The
passive-aggressive classifier is a classification algorithm in machine learning that changes the model whenever there is a wrong prediction. 
If there is no wrong prediction, the model will stay the same.


# Task 1 : Import Modules

Let’s start the project by importing the necessary modules. To start with the News API, import the following modules:

NewsApiClient from newsapi: This module will be used to interact with the API and get the news from different sources.

random: This module will be used to generate random numbers for the news.
                                                           

In [1]:
# Task 1: Import Modules

from newsapi import NewsApiClient
import random

# Task 2: Create a Get News Method

In this task, create a method to get the news data from the News API. To interact with the News API, an API key is required.

After we get your API key, do the following steps:

1. Call the NewsApiClient() method and pass the API key to this method.

2. Create a method to get the news data from the API.

3. Use the get_everything() method from the NewsApiClient to get data and pass the following parameters to the method:

sources: This is used to specify the source to get news from. This is a comma-separated string.

domains: This is used to restrict the domain of search. This is a comma-separated string.

from_param: This specifies the oldest news to get from the API.

to: To specify the date of the newest news from the search.

language: This is used to specify the language of response from the API. Possible values for these parameters are 
ar, de, en, es, fr, he, it, nl, no, pt, ru, sv, ud, zh. By default, it will use all the available languages.

sort_by: This is used to specify the order of the news to get from the search. Some possible options are relevancy, popularity, publishedAt.
By default, the method will sort the records by their publishing date.

page: This is used to specify the page of the source to get the results from. By default this will use the first page to get the results.

4. After getting the results from the API, pass the results to an array and return that array.



In [2]:
# Task 2: Create a Get News Method

from datetime import datetime, timedelta
prev_date = datetime.today() - timedelta(days=30)
next_date = datetime.today() - timedelta(days=0)
p_date = str(prev_date.year)+'-'+str(prev_date.month)+'-'+str(prev_date.day)
c_date = str(next_date.year)+'-'+str(next_date.month)+'-'+str(next_date.day)
if prev_date.month < 10:
    p_date = str(prev_date.year)+'-0'+str(prev_date.month)+'-'+str(prev_date.day)
else:
    p_date = str(prev_date.year)+'-'+str(prev_date.month)+'-'+str(prev_date.day)

if next_date.month < 10:
    c_date = str(next_date.year)+'-0'+str(next_date.month)+'-'+str(next_date.day)
else:
    c_date = str(next_date.year)+'-'+str(next_date.month)+'-'+str(next_date.day)

# Task 2: Create a Get News Method
newsapi = NewsApiClient(api_key='204af192d1f549d794e8bf6ddb9da66a')
def getNews(sourceId):
    newses = newsapi.get_everything(sources=sourceId,domains='bbc.co.uk,techcrunch.com',from_param=p_date,to=c_date,language='en',sort_by='relevancy',page=2)
    newsData = []
    for news in newses['articles']:
        list = [random.randint(0, 1000), news['title'],news['content'], 'REAL']
        newsData.append(list)
    return newsData

# Task 3: Get News Sources

The official documentation of the News API says that there are 3000 authenticated news sources.

In this task, let’s get the news sources using the News API. Do the following:

1. Get all the sources from the News API.

2. Add the ID of each source to a list.
    
3. Truncate the list to a size of 10, and get news from those sources.


In [3]:
# Task 3: Get News Sources

sources = newsapi.get_sources()
sourceList = []
for source in sources['sources']:
    sourceList.append(source['id'])
del sourceList[10:len(sourceList)]
print('New Sources: ', sourceList)

New Sources:  ['abc-news', 'abc-news-au', 'aftenposten', 'al-jazeera-english', 'ansa', 'argaam', 'ars-technica', 'ary-news', 'associated-press', 'australian-financial-review']


In [4]:
sourceList.

['abc-news',
 'abc-news-au',
 'aftenposten',
 'al-jazeera-english',
 'ansa',
 'argaam',
 'ars-technica',
 'ary-news',
 'associated-press',
 'australian-financial-review']

# Task 4: Get News using Multiple Sources

After getting the news sources, complete the following steps to get the news from the API:

1. Use the getNews() method from Task 2 to get the news from the API.

2. Use a loop to pass all sources to the method.
    
3. Add all the returned news to a list.


In [5]:
# Task 4: Get News using Multiple Sources

dataList = []
for sourceId in sourceList:
    newses = getNews(sourceId)
    dataList = dataList + newses

print('Total News: ', len(dataList))

Total News:  1000


In [13]:
print(dataList[1])
print('='*100)
print(dataList[500])
print('='*100)
print(dataList[999])

[806, 'Federal prosecutors charge Ryan Routh with attempted assassination of Donald Trump', 'Federal prosecutors have officially charged Ryan Routh with attempting to assassinate former President Donald Trump, a source familiar with the matter confirmed to ABC News.\r\nThe move was expected an… [+591 chars]', 'REAL']
[979, 'In the Studio: Lenin Tamayo and Q-pop', 'Peruvian singer Lenin Tamayo has been dubbed the founder of Q-pop. He combines traditional Andean folk music with K-pop inspired instrumentation and dance. His songs mix Quechua one of Perus indigeno… [+643 chars]', 'REAL']
[131, '21/09/2024 04:01 GMT', 'The latest five minute news bulletin from BBC World Service.', 'REAL']


# Task 5: Create a DataFrame of News

In this task, create a new DataFrame using the news list. To complete the task, do the following:

1. Use the from_records() method from pandas.DataFrame to create a new DataFrame using the list.
    
2. Add new column headings to the DataFrame.


In [14]:
# Task 5: Create a DataFrame of News

import pandas as pd
df = pd.DataFrame.from_records(dataList)
df.columns = ['','title','text','label']
print(df.head())


                                                    title  \
0  842  2024 election updates: Nebraska governor ends ...   
1  806  Federal prosecutors charge Ryan Routh with att...   
2  102  Trump claims women won't 'be thinking about ab...   
3  562  Missouri executes a man for the 1998 killing o...   
4  897  Death row inmate Marcellus Williams executed b...   

                                                text label  
0  Trump is expected to return to Butler, Pennsyl...  REAL  
1  Federal prosecutors have officially charged Ry...  REAL  
2  Former President Donald Trump appears to be tr...  REAL  
3  Missouri executes a man for the 1998 killing o...  REAL  
4  Missouri death row inmate Marcellus Williams w...  REAL  


# Task 6: Load and Concat the DataFrame

As News API claims that all of their news are authenticated, you need another dataset that contains both fake news and real news to train the model. 
To create a DataFrame that consists of fake news and real news, complete the following steps:

1. Load the data from a .csv file available in the same directory with the name of the news.csv file.

2. Add the column headings to the DataFrame.

3. Use pandas to concat both DataFrames to create a new DataFrame.


In [15]:
# Task 6: Load and Concat the DataFrame

trainData = pd.read_csv('news.csv')
trainData.columns = ['', 'title', 'text', 'label']
data = [trainData, df]
df = pd.concat(data)
print(df.head())


                                                      title  \
0   8476                       You Can Smell Hillary’s Fear   
1  10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2   3608        Kerry to go to Paris in gesture of sympathy   
3  10142  Bernie supporters on Twitter erupt in anger ag...   
4    875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  


In [34]:
df.label.value_counts()

REAL    4171
FAKE    3164
Name: label, dtype: int64

In [17]:
trainData.head()

Unnamed: 0,Unnamed: 1,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [19]:
df.shape

(7335, 4)

# Task 7: Import Scikit Modules 

Let’s start creating a training model over the DataFrame. To create the training model, you need to import the following modules:

1. train_test_split from sklearn.model_selection: To create random training and testing subsets using the DataFrame.

2. CountVectorizer from sklearn.feature_extraction.text: To create a matrix of token count from the text document.

3. PassiveAggressiveClassifier from sklearn.linear_model: To create a linear model that will be used to classify the real news from fake news.

4. accuracy_score from sklearn.metrics: To calculate the model’s accuracy by testing the model using the test data.



In [18]:
# Task 7: Import Scikit Modules

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier


# Task 8: Split the Training and Testing Data

In this task, split the training and testing data from the DataFrame. For this project, use 70% of the data for training and 30% for testing. 
To split the training and testing data, use the train_test_split() method. This method accepts the following parameters:

*arrays: This parameter accepts lists, NumPy arrays, and pandas DataFrames. Pass the combination of title, text, and news labels to this parameter.

test_size: This parameter accepts a floating value between 0.0 and 1.0 used to evaluate the percentage of testing data from the DataFrame.

random_state: This parameter accepts an integer value used to add a random shuffle to the DataFrame before applying the split.


In [20]:
# Task 8: Split the Training and Testing Data

training_x, testing_x, training_y, testing_y = train_test_split(
    df['text'], df.label, test_size=0.3, random_state=7)


# Task 9: Feature Selection

After splitting the training and testing data, let’s start the feature selection using the scikit-learn method CountVectorizer(). This method 
accepts the following parameters:

1. stop_words: Stop words such as “and,” “the,” and “him” are assumed to be uninformative in representing the content of a text. Therefore they can be 
removed to avoid being interpreted as a signal for prediction. Similar words can be helpful for prediction in some cases, such as classifying 
writing style or personality.

2. max_df: When creating the vocabulary, exclude terms with a document frequency that is strictly greater than the given threshold 
(corpus-specific stop words). If the parameter is a float, it represents a percentage of documents; otherwise, it means absolute counts.
If the vocabulary is None, this parameter is ignored.


In [21]:
# Task 9: Feature Selection

count_vectorizer = CountVectorizer(stop_words='english', max_df=0.7)
feature_train = count_vectorizer.fit_transform(training_x)
feature_test = count_vectorizer.transform(testing_x)


# Task 10: Initialise and Apply the Classifier

After the feature selection, let’s apply the classifier to the training data. To apply the classifier to the data, use the following steps:

1. Initialize the PassiveAggressiveClassifier. This is an algorithm from the online learning family of machine learning. This algorithm uses the
passive technique, which states that if the prediction is correct, keep the model; only change the model if the prediction is incorrect. 
This method accepts the following parameters:

max_iter: This defines the number of iterations to apply to the training data. With each iteration, it checks the prediction and updates itself.
    
2. Call the fit() method from the PassiveAggressiveClassifier and pass the features of the training data along with the labels.



In [22]:
# Task 10: Initialise and Apply the Classifier

classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(feature_train, training_y)


# Task 11: Test the classifier

After classifying the data, let’s use the classifier to predict the results on the testing data. The testing data is the remaining 30% of the
complete DataFrame we made earlier. To use the classifier, do the following:

1. Use the predict() method from the classifier and pass the testing feature of the dataset.

2. Use the accuracy_score() method to get the model’s score. This method accepts the following parameters:

y_true: The accurate labels or output of the testing date.
    
y_pred: The results from the model.


In [23]:
# Task 11: Test the classifier

prediction = classifier.predict(feature_test)
score = accuracy_score(testing_y, prediction)

print("Accuracy: ", score*100)

Accuracy:  91.09495683780099


# Task 12: Load the Test Data

After completing the model, try it on test data which is not from any actual source but specifically created to verify the model. This DataFrame 
contains some fake news and some real news to verify the model. To create a DataFrame that consists of fake and real news,
complete the following steps:

1. Load the data from the test_data.csv file available in the same directory.
    
2. Print the head of the DataFrame to see the dataset.


In [28]:
# Task 12: Load the Test Data

test_data = pd.read_csv('test_data.csv')
test_labels = test_data.label
test_data.head()


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,title,text,label
0,914,7014,Trumps Hollywood Walk of Fame star Destroyed w...,Trump's Hollywood Walk of Fame star Destroyed ...,FAKE
1,4058,6440,Corporate Army smashes Dakota barbarians near ...,Corporate Army smashes Dakota barbarians near ...,FAKE
2,4982,6125,German Panzers to Rumble Once More Along Russi...,Citizen journalism with a punch German Panzers...,FAKE
3,800,8389,Contaminated Food from China Now Entering the ...,Contaminated Food from China Now Entering the ...,FAKE
4,4871,976,Cruz likely to block Trump on a second ballot ...,Republican presidential candidate Ted Cruz is ...,REAL


# Task 13: Select Features and Get Predictions

Let’s select the features from the test data and get a prediction based on those features. To complete this task, do the following:

1. Use CountVectorizer.transform from Task 9 to select features from the test_data.

2. After getting the features from the test_data, use the predict() method from Task 11 to get the prediction using the classifier.

   

In [29]:
# Task 13: Select Features and Get Predictions

test_data_feature = count_vectorizer.transform(test_data['text'])
prediction = classifier.predict(test_data_feature)

In [40]:
pd.DataFrame(test_data_feature)

Unnamed: 0,0
0,"(0, 109)\t3\n (0, 689)\t3\n (0, 856)\t3\n ..."
1,"(0, 689)\t1\n (0, 700)\t1\n (0, 909)\t1\n ..."
2,"(0, 1)\t1\n (0, 110)\t1\n (0, 673)\t1\n (..."
3,"(0, 176)\t1\n (0, 689)\t1\n (0, 1036)\t1\n..."
4,"(0, 1)\t1\n (0, 176)\t1\n (0, 214)\t1\n (..."
5,"(0, 48)\t1\n (0, 689)\t1\n (0, 1933)\t2\n ..."
6,"(0, 214)\t1\n (0, 343)\t1\n (0, 1865)\t1\n..."
7,"(0, 193)\t2\n (0, 344)\t1\n (0, 689)\t1\n ..."
8,"(0, 110)\t1\n (0, 211)\t1\n (0, 245)\t2\n ..."
9,"(0, 508)\t1\n (0, 689)\t1\n (0, 2302)\t1\n..."


# Task 14: Evaluate the Predictions

After getting the predictions from the classifier, evaluate the predictions using the following methods:

1. Print all the predictions and test_labels side by side to visualize the results.
    
2. Use the accuracy_score() method from Task 11 to print the score of the classifier on the data.


In [39]:
for i in range(len(test_labels)):
    print(test_labels[i], prediction[i])

score = accuracy_score(test_labels, prediction)
print("Accuracy: ", score*100, "%")

FAKE FAKE
FAKE FAKE
FAKE FAKE
FAKE FAKE
REAL REAL
FAKE FAKE
REAL REAL
FAKE FAKE
FAKE FAKE
FAKE FAKE
FAKE FAKE
REAL REAL
FAKE FAKE
REAL REAL
FAKE FAKE
REAL REAL
REAL REAL
FAKE FAKE
REAL REAL
FAKE FAKE
Accuracy:  100.0 %
