# Fake News Detection

In [None]:
import pandas as pd
import numpy as np
import itertools
from sklearn.feature_extraction.text import TfidfVectorizer

Pandas is mainly used for data analysis and associated manipulation of tabular data in DataFrames. Pandas allows importing data from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.

NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices.


In [None]:
df=pd.read_csv("news.csv")

By default, it reads first rows on CSV as column names (header) and it creates an incremental numerical number as index starting from zero.
CSV files are mostly used for importing and exporting important information, such as customer or order data, to and from your database,CSV stands for Comma Separated Values. CSVs are used to store tabular data in spreadsheet or databases. Each line of the file is called a record. Each record is separated by a delimiter, usually it is comma (,) but you can change it according to your need

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


The head() method returns a specified number of rows, string from the top. The head() method returns the first 5 rows if a number is not specified.

In [None]:
df.shape

(6335, 4)

The shape of a DataFrame is a tuple of array dimensions that tells the number of rows and columns of a given DataFrame.

In [None]:
df.isnull().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

The function dataframe. isnull(). sum(). sum() returns the number of missing values in the data set.

In [None]:
labels = df.label

In [None]:
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

 the term label is used as if it is granted that we know what it is, such as in Indexing and selecting data

In [None]:
from sklearn.model_selection import train_test_split

the train_test_split() class from sklearn. model_selection is used to split our data into train and test sets where feature variables are given as input in the method. test_size determines the portion of the data which will go into test sets and a random state is used for data reproducibility

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df["text"], labels, test_size = 0.2, random_state = 20)

The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train, X_test, y_train, and y_test. X_train and y_train sets are used for training and fitting the model

In [None]:
x_train.head()

4741    NAIROBI, Kenya — President Obama spoke out Sun...
2089    Killing Obama administration rules, dismantlin...
4074    Dean Obeidallah, a former attorney, is the hos...
5376      WashingtonsBlog \nCNN’s Jake Tapper hit the ...
6028    Some of the biggest issues facing America this...
Name: text, dtype: object

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier

PassiveAggressiveClassifier which means that it can update its weights as new data comes in

In [None]:
# initilise a Tfidvectorizer
vector = TfidfVectorizer(stop_words='english', max_df=0.7)

In [None]:
# fit and tranform
tf_train = vector.fit_transform(x_train)
tf_test = vector.transform(x_test)

In [None]:
# initilise a PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tf_train, y_train)

PassiveAggressiveClassifier(max_iter=50)

In [None]:
# predicton the tst dataset
from sklearn.metrics import accuracy_score, confusion_matrix
y_pred = pac.predict(tf_test)

TfidfVectorizer. Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer

In [None]:
score = accuracy_score(y_test, y_pred)

In [None]:
print(f"Accuracy : {round(score*100,2)}%")

Accuracy : 94.87%


In [None]:
 #confusion metrics
confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])

array([[623,  25],
       [ 40, 579]])

A confusion matrix is a useful machine learning method that allows you to measure recall, precision, accuracy, and AUC-ROC curve. The confusion matrix is a systematic way to allocate the predictions to the original classes to which the data originally belonged.

In [None]:
# save model
import pickle
filename = 'finalized_model.pkl'
pickle.dump(pac, open(filename, 'wb'))

PyPAC is a Python library for finding proxy auto-config (PAC) files and making HTTP requests that respect them

In [None]:
# save vectorizer
filename = 'vectorizer.pkl'
pickle.dump(vector, open(filename, 'wb'))

pickle" or ". pkl". A Python pickle file serializes a tuple of two numpy arrays, (feature, label)
Vectorization is used to speed up the Python code without using loop.

pickle.dump() function to store the object data to the file. pickle.dump() function takes 3 arguments. The first argument is the object that you want to store. The second argument is the file object you get by opening the desired file in write-binary (wb) mode. And the third argument is the key-value argument