In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/fake-news/news.csv


## Import Packages

In [2]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Fake News Prediction

# Introduction

![image.png](attachment:194d8644-377c-40a8-b20c-a989ed1a948d.png)

**Fake news** refers to false or misleading information presented as if it were true. This disinformation can take various forms, including written articles, images, videos, or social media posts, and it is often spread through online platforms and social media networks.

**Fake news is created for various reasons, such as:**

1. **Misinformation:** Some individuals or groups intentionally create false information to deceive the public, advance a particular agenda, or damage the reputation of a person, organization, or group.

2. **Sensationalism:** Some news outlets or websites may publish fake news to generate attention, increase website traffic, and generate ad revenue.

3. **Political Manipulation:** Fake news can be used for political purposes, such as spreading false information about political opponents, undermining the credibility of institutions, or manipulating public opinion.

4. **Hoaxes:** Some fake news stories are created purely for entertainment or as pranks, although they can still have unintended consequences.

5. **Profit:** Some people create fake news stories to make money through advertising revenue or click-throughs on their websites or social media profiles.

Combatting fake news is important for maintaining an informed and responsible society. Individuals can help by critically evaluating information sources, fact-checking claims, and using reputable news outlets. Media organizations and tech companies have also implemented various measures to identify and mitigate fake news, including fact-checking services, content moderation, and algorithm adjustments.

In [3]:
df = pd.read_csv('/kaggle/input/fake-news/news.csv')

This dataset is having 4 columns. First columns is having serial number of the news. Second is having the title of the news. Third is having the the content of the news. Fourth column is having the labelling of Fake and Real for each news.

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [5]:
df.shape

(6335, 4)

This dataset is having **4 columns** and **6335 rows**.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6335 non-null   int64 
 1   title       6335 non-null   object
 2   text        6335 non-null   object
 3   label       6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


Unnamed column is not needed for our prediction. And title, text and label column is having object data type.

In [7]:
df.columns

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')

In [8]:
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [9]:
df['label'].value_counts(normalize=True) * 100

label
REAL    50.055249
FAKE    49.944751
Name: proportion, dtype: float64

Having almost same percentage of FAKE and REAL news in the dataset.

# Model Building

The "**train-test split**" is a fundamental concept in machine learning and data analysis. It involves dividing a dataset into two subsets: a "training set" and a "test set." This division is done to assess the performance of a machine learning model and evaluate its ability to make accurate predictions on new, unseen data.

**Training set:- 80%, Test set:- 20%**

In [10]:
x_train, x_test, y_train, y_test = train_test_split(df['text'], 
                                                    labels,test_size=0.2, 
                                                    random_state=7)

The **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer is a popular text preprocessing technique used in natural language processing (NLP) and information retrieval. It's primarily used to convert a collection of text documents into numerical feature vectors that can be used for machine learning and text analysis tasks. TF-IDF is a way to represent the importance of words or terms in a document relative to a corpus of documents.

**Here's how TF-IDF works:**

**Term Frequency (TF)**: This component calculates how frequently a term (word) occurs in a specific document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document. A higher TF score indicates that a term is more important within the document.

**Inverse Document Frequency (IDF)**: The IDF component measures the importance of a term in the entire corpus of documents. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. Terms that are common across many documents have lower IDF values, while terms that are unique to a few documents have higher IDF values.

**TF-IDF Score**: The TF-IDF score for a term in a document is calculated by multiplying its TF and IDF scores. This score quantifies the importance of a term within a specific document relative to its importance across the entire corpus.

The TF-IDF vectorizer processes a collection of documents, such as a set of articles, books, or web pages, and converts each document into a numerical vector. Each element in the vector corresponds to the TF-IDF score of a specific term within the document.

TF-IDF has various applications, including text classification, information retrieval, document clustering, and more. It helps in identifying important terms or keywords within documents and is particularly useful for tasks like content recommendation, search engines, and sentiment analysis.

In [11]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

In [12]:
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)

The **Passive-Aggressive Classifier** is a machine learning algorithm used for binary and multiclass classification tasks. It belongs to the family of online learning algorithms, which means it processes data one instance at a time, making it well-suited for scenarios where data is continuously arriving or where you have limited computational resources for batch processing. The algorithm is particularly useful for tasks like text classification and sentiment analysis.

The "Passive-Aggressive" name comes from the behavior of the algorithm during training. It aims to make correct predictions for the training data while being passive when it's right and aggressive when it's wrong. In other words, it adjusts its model only when it makes a mistake.

**Here's a high-level overview of how the Passive-Aggressive Classifier works:**

**Initialization**: The algorithm starts with an initial model, typically with random or zero weights.

**Training**: For each training instance, the model makes a prediction. If the prediction is correct (i.e., the instance is classified correctly), the model remains passive, and no changes are made to the weights. If the prediction is incorrect, the model becomes aggressive and updates its parameters to correct the mistake.

**Loss Function**: The loss function used to update the model is typically based on the margin between the predicted class and the true class. The algorithm tries to minimize this loss by adjusting the model's parameters.

**Regularization**: Regularization is often used to prevent overfitting. L2 regularization is commonly applied to the model's parameters.

The Passive-Aggressive Classifier has different variants, such as Passive-Aggressive I (PA-I) and Passive-Aggressive II (PA-II), which differ in terms of how they handle the aggressive updates. PA-I tends to be less aggressive than PA-II.

In [13]:
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

In [14]:
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.98%


In [15]:
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[591,  47],
       [ 42, 587]])

# Getting a really Good Accuracy of 92.5 %