# Fake News Identifer

Fake news runs rampant in today's society, and in the age of artificial intelligence, it can be difficult to separate what is real from what is false. However, we can use the power of data and machine learning to assess whether news is authentic or not, which is what this project attempts to accomplish.

## Setup

### Step 1: Connect to Kaggle, Download the Dataset

The dataset being used has three columns (not including the index): A title of a news report, the content of the report, and a truth value; '0' indicates fake news and '1' indicates real news. 

In [None]:
!pip3 install kaggle -q kagglehub
!kaggle datasets download -d saurabhshahane/fake-news-classification

Dataset URL: https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading fake-news-classification.zip to /Users/adarsh/Desktop/vscode-workspace/fake-news-identifier/fake-news-identifier
100%|█████████████████████████████████████▉| 92.0M/92.1M [00:03<00:00, 30.5MB/s]
100%|██████████████████████████████████████| 92.1M/92.1M [00:03<00:00, 26.3MB/s]


### Step 2: Verify the Dataset

In [6]:
import zipfile
import pandas as pd

with zipfile.ZipFile('fake-news-classification.zip', 'r') as zip_file:
  zip_file.extractall('data')

!ls data

df = pd.read_csv('data/WELFake_Dataset.csv')
df.head()

WELFake_Dataset.csv


Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


We can see that the dataset is accessible, so now we can move onto the data preprocessing.

## Data Preprocessing

Since this dataset is entirely categorical (with the exception of the assessment of whether or not the news is real or fake), there is no need to address outliers. Instead, we will check how many missing values exist to ensure that we do not end up changing a large portion of the dataset.

In [None]:
duplicate_count = df.duplicated().sum()
print(f'Dataset has {duplicate_count} duplicate entries')
for column in df.columns:
    missing_count = df[column].isnull().sum()
    print(f'Column "{column}" has {missing_count} missing values')

Dataset has 0 duplicate entries
Column "Unnamed: 0" has 0 missing values
Column "title" has 558 missing values
Column "text" has 39 missing values
Column "label" has 0 missing values


There are 78,098 data entries in the dataset, so up to only 0.76% (597 out of 78,098) of the entries have missing values. As such, we can feel comfortable simply dropping any entries that contain a missing value.

In [13]:
df_cleaned = df.dropna()
df_cleaned.to_csv('data/cleaned_data.csv', index=False)

print('Cleaned data saved to cleaned_data.csv')

Cleaned data saved to cleaned_data.csv


The data preprocessing is complete, and we can now move onto the visualization phase.

## Data Analysis

Identifying trends is especially difficult with data that is predominantly categorical, as this dataset is. We cannot simply find the correlation between two columns of the dataset. Instead, we will have to analyze specific features of the title and content of the news reports to see if there are features that are more indicative of fake news than others.

One thing we can try is assessing each news report a "sensationalism factor", using the percentage of capital letters and exclamation points in the title as an indicator of whether or not the report is true or false:

In [None]:
df = pd.read_csv('data/cleaned_data.csv')
