<a target="_blank" href="https://colab.research.google.com/github/Wook22/Fake_News_Classification/blob/main/Fake_News_Analysis.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# **Data Analysis on Fake News**

### Abstract




## Introduction

Have you ever questioned whether the news you see is real or not? Since the internet became widespread, fake news has increasingly been used as a tool to manipulate public opinion. One of the most well-known examples is the Nayirah testimony. On October 10, 1990, a 15-year-old Kuwaiti girl gave a false testimony before the United States Congressional Human Rights Caucus. She claimed to be a volunteer nurse at a Kuwaiti hospital during the Iraqi invasion. In her testimony, she said she witnessed Iraqi soldiers removing premature babies from incubators, stealing the equipment, and leaving the babies to die on the floor. This emotional account played a significant role in shaping public support and helped President George H. W. Bush justify military action against Iraq.

However,

"was shown to be almost certainly false by an ABC reporter, John Martin, in March 1991" (The New York Times)

In January 1992, it was revealed that she had never been a nurse and was, in fact, the daughter of Saud Nasser Al-Saud Al-Sabah, the Kuwaiti ambassador to the United States at the time of her testimony. This raises an important question: What should we believe, and what should we not? In an age where misinformation can spread quickly, it's becoming increasingly difficult to know what is true and what is not.

Throughout this project, I will develop a model that predicts whether a news article is real or fake based on the count of phrases and language used in the text.





## Data Description



In [32]:
df_real = read.csv("BuzzFeed_real_news_content.csv")
df_fake = read.csv("BuzzFeed_fake_news_content.csv")

In [8]:
colnames(df_real)

In [9]:
colnames(df_fake)

In [33]:
df_real["real_fake"] = 0
df_fake["real_fake"] = 1

buzzfeed = rbind(df_real, df_fake)

In [14]:
install.packages(c("tidytext", "dplyr", "stringr", "tidyr"))

# Load necessary libraries
library(dplyr)
library(stringr)
library(tidytext)
library(tidyr)


Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘SnowballC’, ‘janeaustenr’, ‘tokenizers’




In [44]:
# Create temporary dataset
df <- buzzfeed

# Tokenize text and count word frequencies
all_words <- df %>%
  select(text) %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

# Function to estimate syllables by counting vowel groups
estimate_syllables <- function(word) {
  str_count(tolower(word), "[aeiouy]+")
}

# Add syllable counts
all_words <- all_words %>%
  mutate(syllables = estimate_syllables(word)) %>%
  filter(syllables > 2)

# Get top 10 words with more than 5 syllables
top10_words <- head(all_words$word, 20)

# Function to count word occurrences in text
count_word <- function(text, word) {
  str_count(tolower(text), fixed(tolower(word)))
}

# Create new columns for each top word
for (w in top10_words) {
  df[[paste0("word_", w)]] <- sapply(df$text, count_word, word = w)
}

In [45]:
# 'id''title''text''url''top_img''authors''source'
# 'publish_date''movies''images''canonical_link''meta_data'

# Remove columns by name
df_drop <- df %>%
  select(-text, -url, -top_img, -movies, -images, -canonical_link, -meta_data)

In [46]:
head(df_drop)

Unnamed: 0_level_0,id,title,authors,source,publish_date,real_fake,word_hillary,word_president,word_debate,word_obama,⋯,word_election,word_republican,word_foundation,word_according,word_every,word_september,word_united,word_another,word_political,word_candidate
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,Real_1-Webpage,Another Terrorist Attack in NYC…Why Are we STILL Being Politically Correct – Eagle Rising,"View All Posts,Leonora Cravotta",http://eaglerising.com,{'$date': 1474528230000},0,1,1,0,0,⋯,0,0,0,0,1,2,0,0,2,1
2,Real_10-Webpage,"Donald Trump: Drugs a 'Very, Very Big Factor' in Charlotte Protests","More Candace,Adam Kelsey,Abc News,More Adam",http://abcn.ws,,0,0,1,0,0,⋯,0,2,0,2,1,0,0,0,0,0
3,Real_11-Webpage,"Obama To UN: ‘Giving Up Liberty, Enhances Security In America…’ [VIDEO]",Cassy Fiano,http://rightwingnews.com,{'$date': 1474476044000},0,0,0,0,9,⋯,0,0,0,0,2,0,7,0,0,0
4,Real_12-Webpage,Trump vs. Clinton: A Fundamental Clash over How the Economy Works,"Jack Shafer,Erick Trickey,Zachary Karabell",http://politi.co,{'$date': 1474974420000},0,3,1,1,1,⋯,3,1,1,0,6,0,5,1,2,1
5,Real_13-Webpage,"President Obama Vetoes 9/11 Victims Bill, Setting Up Showdown With Congress","John Parkinson,More John,Abc News,More Alexander",http://abcn.ws,,0,1,7,0,5,⋯,1,0,0,0,0,1,1,0,1,0
6,Real_14-Webpage,"CHAOS! NC Protest MOB Ambushes Female Truck Driver, Loots Truck, Sets Cargo On Fire – No One Helps!? [VIDEO]",Cassy Fiano,http://rightwingnews.com,{'$date': 1474473199000},0,1,0,0,0,⋯,0,0,0,0,1,2,0,0,0,0


## Reference

* Opinion | Remember Nayirah, Witness for Kuwait? (Published 1992), www.nytimes.com/1992/01/06/opinion/remember-nayirah-witness-for-kuwait.html. Accessed 6 May 2025.
* Shu, Kai, et al. “FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media.” arXiv.Org, 27 Mar. 2019, arxiv.org/abs/1809.01286.
* Mahudeswaran, Deepak. “FakeNewsNet.” Kaggle, 2 Nov. 2018, www.kaggle.com/datasets/mdepak/fakenewsnet/data.
