<a href="https://colab.research.google.com/github/Val2425/MachineLearningProject-Korea2024/blob/main/MachineLearningProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Introduction**


**Purpose:**

This project aims to classify news articles as real or fake using machine learning on textual content alone. Through natural language processing (NLP), we seek to detect misleading information—a crucial skill in today’s digital world.

-

**What is the Problem?**

The rapid spread of fake news, especially via social media, poses significant risks to public opinion and social trust. This project focuses on distinguishing real from fake news articles, an issue highlighted by events like the American elections, where misinformation can heavily influence public sentiment.

-

**Why is This Problem Important?**

Detecting fake news is both a timely and complex challenge. This project is intellectually engaging as it leverages NLP for a socially relevant task while remaining manageable within a binary classification framework.

-

**Expected Outcomes of the Model**

Our objective is to develop a model with an F1-score of at least 0.85, balancing precision and recall to effectively minimize misclassifications. This metric underscores the model’s aim to accurately identify fake news while reducing errors, aligning with the critical nature of the task.

# **2. Methods**

## **2.1 Import dataset from Kaggle**

We decided to use the "Fake News Detection" datataset ([Link to the dataset](https://www.kaggle.com/datasets/bhavikjikadara/fake-news-detection))

First we download our personal kaggle API key on our computer. Then we add it the the google colab files :

In [2]:
from google.colab import files
uploaded = files.upload()
del uploaded

Saving kaggle.json to kaggle.json


We then create a kaggle folder and copy kaggle.json to the folder created and give permission for the json to act

In [3]:
#create folder
!mkdir ~/.kaggle

#copy kaggle.json to folder
!cp kaggle.json ~/.kaggle/

#permission
! chmod 600 ~/.kaggle/kaggle.json

We then paste to code given when we click on the download button for the dataset on the kaggle website

In [4]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("bhavikjikadara/fake-news-detection")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/bhavikjikadara/fake-news-detection?dataset_version_number=1...


100%|██████████| 41.0M/41.0M [00:00<00:00, 47.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/bhavikjikadara/fake-news-detection/versions/1


Then change directory to be in the file with the two files

In [5]:
cd /root/.cache/kagglehub/datasets/bhavikjikadara/fake-news-detection/versions/1

/root/.cache/kagglehub/datasets/bhavikjikadara/fake-news-detection/versions/1


##**2.2 Dataset Description**

Description of the dataset

## **2.3 Data Preprocessing**

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

true_df = pd.read_csv('true.csv')
fake_df = pd.read_csv('fake.csv')

### **2.3.1 Initial cleaning**

In [26]:
# Some dates from fake.csv contain month that are written with three letters (ex : Nov)
# Replacing abbreviation to have a uniform format for dates
mois = {
    'Jan ': 'January ',
    'Feb ': 'February ',
    'Mar ': 'March ',
    'Apr ': 'April ',
    'May ': 'May ',
    'Jun ': 'June ',
    'Jul ': 'July ',
    'Aug ': 'August ',
    'Sep ': 'September ',
    'Oct ': 'October ',
    'Nov ': 'November ',
    'Dec ': 'December '
}
fake_df['date'] = fake_df['date'].replace(mois, regex=True)

In [27]:
# We then convert dates to Date variables
true_df['date'] = pd.to_datetime(true_df['date'], errors='coerce')
fake_df['date'] = pd.to_datetime(fake_df['date'], errors='coerce')

In [28]:
# Count missing values
print(true_df.isnull().sum())
print()
print(fake_df.isnull().sum())

title      0
text       0
subject    0
date       0
dtype: int64

title       0
text        0
subject     0
date       45
dtype: int64


After conversion and replacement of the abbreviation, we find 45 null values in fake_df. 45 is a low number so we will just erase them

In [29]:
# Erase null values in fake_df
fake_df.dropna(inplace=True)

In [32]:
# See how many duplicates there are in each dataset

duplicates = true_df.duplicated().sum()
print(duplicates)

duplicates = fake_df.duplicated().sum()
print(duplicates)

0
0


So we erase the duplicates to keep only one occurence

In [31]:
# Erase duplicates
true_df.drop_duplicates(keep='first', inplace=True)
fake_df.drop_duplicates(keep='first', inplace=True)

In [39]:
# Informations about the dataset
print(true_df.info())
print()
print(fake_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 21211 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   title    21211 non-null  object        
 1   text     21211 non-null  object        
 2   subject  21211 non-null  object        
 3   date     21211 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 828.6+ KB
None

<class 'pandas.core.frame.DataFrame'>
Index: 23433 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   title    23433 non-null  object        
 1   text     23433 non-null  object        
 2   subject  23433 non-null  object        
 3   date     23433 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 915.4+ KB
None


In [44]:
# Visualize a sample of the dataset
print(true_df.sample(5))
print()
print(fake_df.sample(5))

                                                  title  \
5275  In sweeping move, Trump puts regulation monito...   
6380  Tillerson says China should be barred from Sou...   
6372  AT&T chief executive, Trump meet amid planned ...   
3872    Senate votes to confirm Gottlieb as head of FDA   
9112  Clinton would use executive action to end tax ...   

                                                   text       subject  \
5275   President Donald Trump signed an executive or...  politicsNews   
6380   U.S. President-elect Donald Trump’s nominee f...  politicsNews   
6372   AT&T Inc (T.N) Chief Executive Randall Stephe...  politicsNews   
3872   The U.S. Senate voted on Tuesday to confirm D...  politicsNews   
9112   U.S. Democratic presidential candidate Hillar...  politicsNews   

           date  
5275 2017-02-24  
6380 2017-01-11  
6372 2017-01-12  
3872 2017-05-09  
9112 2016-06-15  

                                                   title  \
12930  KEITH SCOTT’S BROTHER Tells Ch

We can see that true_df text column contains "(Reuters) - " that can cause overfitting. So we have to delete it.

In [None]:
# Delete all text until "(Reuters) - " in true_df
true_df['text'] = true_df['text'].str.replace(r'^.*\(Reuters\) - ', '', regex=True)

## **2.4 Exploratory Data Analysis (EDA)**

In [15]:
# Statistics
print(true_df.describe())
print()
print(fake_df.describe())

                                date
count                          21211
mean   2017-06-02 21:00:01.527509248
min              2016-01-13 00:00:00
25%              2017-01-27 00:00:00
50%              2017-09-12 00:00:00
75%              2017-11-02 00:00:00
max              2017-12-31 00:00:00

                                date
count                          23433
mean   2016-10-07 05:42:54.062220032
min              2015-03-31 00:00:00
25%              2016-04-06 00:00:00
50%              2016-10-14 00:00:00
75%              2017-04-14 00:00:00
max              2017-12-31 00:00:00


The number of true and false articles is pretty even (47,7% - 52,3%), it's good for a binary classification.

In [16]:
# Analyse the length of
print(true_df['title'].apply(len).describe())
print()
print(fake_df['title'].apply(len).describe())

count    21211.000000
mean        64.658291
std          9.162659
min         26.000000
25%         59.000000
50%         64.000000
75%         70.000000
max        133.000000
Name: title, dtype: float64

count    23433.000000
mean        94.187599
std         27.173264
min         15.000000
25%         77.000000
50%         90.000000
75%        105.000000
max        286.000000
Name: title, dtype: float64
