<a href="https://colab.research.google.com/github/Val2425/MachineLearningProject-Korea2024/blob/main/MachineLearningProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Introduction**


**Purpose:**

This project aims to classify news articles as real or fake using machine learning on textual content alone. Through natural language processing (NLP), we seek to detect misleading information—a crucial skill in today’s digital world.

-

**What is the Problem?**

The rapid spread of fake news, especially via social media, poses significant risks to public opinion and social trust. This project focuses on distinguishing real from fake news articles, an issue highlighted by events like the American elections, where misinformation can heavily influence public sentiment.

-

**Why is This Problem Important?**

Detecting fake news is both a timely and complex challenge. This project is intellectually engaging as it leverages NLP for a socially relevant task while remaining manageable within a binary classification framework.

-

**Expected Outcomes of the Model**

Our objective is to develop a model with an F1-score of at least 0.85, balancing precision and recall to effectively minimize misclassifications. This metric underscores the model’s aim to accurately identify fake news while reducing errors, aligning with the critical nature of the task.

# **2. Methods**

## **2.1 Import dataset from Kaggle**

We decided to use the "Fake News Detection" datataset ([Link to the dataset](https://www.kaggle.com/datasets/bhavikjikadara/fake-news-detection))

First we download our personal kaggle API key on our computer. Then we add it the the google colab files :

In [None]:
from google.colab import files
uploaded = files.upload()
del uploaded

Saving kaggle.json to kaggle.json


We then create a kaggle folder and copy kaggle.json to the folder created and give permission for the json to act

In [None]:
#create folder
!mkdir ~/.kaggle

#copy kaggle.json to folder
!cp kaggle.json ~/.kaggle/

#permission
! chmod 600 ~/.kaggle/kaggle.json

We then paste to code given when we click on the download button for the dataset on the kaggle website

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("bhavikjikadara/fake-news-detection")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/bhavikjikadara/fake-news-detection?dataset_version_number=1...


100%|██████████| 41.0M/41.0M [00:00<00:00, 59.8MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/bhavikjikadara/fake-news-detection/versions/1


Then change directory to be in the file with the two files

In [None]:
cd /root/.cache/kagglehub/datasets/bhavikjikadara/fake-news-detection/versions/1

/root/.cache/kagglehub/datasets/bhavikjikadara/fake-news-detection/versions/1


## **2.2 Data Preprocessing**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

true_df = pd.read_csv('true.csv')
fake_df = pd.read_csv('fake.csv')



In [None]:
# Dimensions of the dataset
print(true_df.shape)
print()
print(fake_df.shape)

(21417, 4)

(23481, 4)


In [None]:
# Informations about the dataset
print(true_df.info())
print()
print(fake_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   title    21417 non-null  object        
 1   text     21417 non-null  object        
 2   subject  21417 non-null  object        
 3   date     21417 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 669.4+ KB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   title    23481 non-null  object        
 1   text     23481 non-null  object        
 2   subject  23481 non-null  object        
 3   date     11868 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 733.9+ KB
None


In [None]:
# Some dates from fake.csv contain month that are written with three letters (ex : Nov)
# Replacing abbreviation to have a uniform format for dates
mois = {
    'Jan ': 'January ',
    'Feb ': 'February ',
    'Mar ': 'March ',
    'Apr ': 'April ',
    'May ': 'May ',
    'Jun ': 'June ',
    'Jul ': 'July ',
    'Aug ': 'August ',
    'Sep ': 'September ',
    'Oct ': 'October ',
    'Nov ': 'November ',
    'Dec ': 'December '
}
fake_df['date'] = fake_df['date'].replace(mois, regex=True)

In [None]:
# We then convert dates to Date variables
true_df['date'] = pd.to_datetime(true_df['date'], errors='coerce')
fake_df['date'] = pd.to_datetime(fake_df['date'], errors='coerce')

In [None]:
# Visualize a sample of the dataset
print(true_df.sample(5))
print()
print(fake_df.sample(5))

                                                   title  \
12960  Turkey condemns U.S. move on Jerusalem as 'irr...   
20186  Cambodian opposition blocked from holding memo...   
9108   Senators, Trump open to ban on some gun sales ...   
2070   'Fully committed' NATO backs new U.S. approach...   
18572  There was no independence referendum in Catalo...   

                                                    text       subject  \
12960  ISTANBUL (Reuters) - Turkey s foreign ministry...     worldnews   
20186  PHNOM PENH (Reuters) - Cambodia s main opposit...     worldnews   
9108   WASHINGTON (Reuters) - U.S. senators signaled ...  politicsNews   
2070   BRUSSELS (Reuters) - NATO allies on Tuesday we...  politicsNews   
18572  MADRID (Reuters) - Spain s northeastern region...     worldnews   

                      date  
12960    December 6, 2017   
20186  September 13, 2017   
9108        June 15, 2016   
2070      August 22, 2017   
18572     October 1, 2017   

                   

In [None]:
# Statistics
print(true_df.describe())
print()
print(fake_df.describe())

                                                    title  \
count                                               21417   
unique                                              20826   
top     Factbox: Trump fills top jobs for his administ...   
freq                                                   14   

                                                     text       subject  \
count                                               21417         21417   
unique                                              21192             2   
top     (Reuters) - Highlights for U.S. President Dona...  politicsNews   
freq                                                    8         11272   

                      date  
count                21417  
unique                 716  
top     December 20, 2017   
freq                   182  

                                                    title   text subject  \
count                                               23481  23481   23481   
unique              

In [None]:
# Count missing values
print(true_df.isnull().sum())
print()
print(fake_df.isnull().sum())

title      0
text       0
subject    0
date       0
dtype: int64

title      0
text       0
subject    0
date       0
dtype: int64


After conversion and replacement of the abbreviation, we find 45 null values in fake_df. 45 is a low number so we will just erase them

In [None]:
fake_df.dropna(inplace=True)