<a target="_blank" href="https://colab.research.google.com/github/Wook22/Fake_News_Classification/blob/main/Fake_News_Analysis.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# **Data Analysis on Fake News**

### Abstract




## Introduction

Have you ever questioned whether the news you see is real or not? Since the internet became widespread, fake news has increasingly been used as a tool to manipulate public opinion. One of the most well-known examples is the Nayirah testimony. On October 10, 1990, a 15-year-old Kuwaiti girl gave a false testimony before the United States Congressional Human Rights Caucus. She claimed to be a volunteer nurse at a Kuwaiti hospital during the Iraqi invasion. In her testimony, she said she witnessed Iraqi soldiers removing premature babies from incubators, stealing the equipment, and leaving the babies to die on the floor. This emotional account played a significant role in shaping public support and helped President George H. W. Bush justify military action against Iraq.

However,

"was shown to be almost certainly false by an ABC reporter, John Martin, in March 1991" (The New York Times)

In January 1992, it was revealed that she had never been a nurse and was, in fact, the daughter of Saud Nasser Al-Saud Al-Sabah, the Kuwaiti ambassador to the United States at the time of her testimony. This raises an important question: What should we believe, and what should we not? In an age where misinformation can spread quickly, it's becoming increasingly difficult to know what is true and what is not.

Throughout this project, I will develop a model that predicts whether a news article is real or fake based on the count of phrases and language used in the text.





## Data Description



In [1]:
df_real = read.csv("BuzzFeed_real_news_content.csv")
df_fake = read.csv("BuzzFeed_fake_news_content.csv")

In [2]:
colnames(df_real)

In [3]:
colnames(df_fake)

In [4]:
df_real["real_fake"] = 0
df_fake["real_fake"] = 1

buzzfeed = rbind(df_real, df_fake)

In [5]:
install.packages(c("tidytext", "dplyr", "stringr", "tidyr"))

# Load necessary libraries
library(dplyr)
library(stringr)
library(tidytext)
library(tidyr)


Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘SnowballC’, ‘janeaustenr’, ‘tokenizers’



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [7]:
# Create temporary dataset
df <- buzzfeed

# Tokenize text and count word frequencies
all_words <- df %>%
  select(text) %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

# Function to estimate syllables by counting vowel groups
estimate_syllables <- function(word) {
  str_count(tolower(word), "[aeiouy]+")
}

# Add syllable counts
all_words <- all_words %>%
  mutate(syllables = estimate_syllables(word)) %>%
  filter(syllables > 2)

# Get top 50 words with more than 3 syllables
top50_words <- head(all_words$word, 50)

# Function to count word occurrences in text
count_word <- function(text, word) {
  str_count(tolower(text), fixed(tolower(word)))
}

# Create new columns for each top word
for (w in top50_words) {
  df[[paste0("word_", w)]] <- sapply(df$text, count_word, word = w)
}

In [45]:
# 'id''title''text''url''top_img''authors''source'
# 'publish_date''movies''images''canonical_link''meta_data'

# Remove columns by name
df_drop <- df %>%
  select(-title, -source, -id, -text, -url, -top_img, -movies, -images, -canonical_link, -meta_data)

In [46]:
colnames(df_drop)

In [55]:
df_time_adjust <- df_drop %>%
  mutate(
    publish_timestamp = as.numeric(str_extract(publish_date, "\\d+")),
    publish_date = as.Date(as.POSIXct(publish_timestamp / 1000, origin = "1970-01-01", tz = "UTC"))
  ) %>%
  select(-publish_timestamp)

In [48]:
print(colSums(is.na(df_time_adjust)))


          authors      publish_date         real_fake      word_hillary 
                0                49                 0                 0 
   word_president       word_debate        word_obama       word_police 
                0                 0                 0                 0 
     word_because     word_american word_presidential       word_before 
                0                 0                 0                 0 
     word_america     word_election   word_republican   word_foundation 
                0                 0                 0                 0 
   word_according        word_every    word_september       word_united 
                0                 0                 0                 0 
     word_another    word_political    word_candidate    word_americans 
                0                 0                 0                 0 
    word_national    word_charlotte   word_democratic       word_policy 
                0                 0                

In [56]:
df_dropna <- df_time_adjust %>% drop_na(publish_date)

buzzfeed_data <- df_dropna
head(buzzfeed_data)

Unnamed: 0_level_0,authors,publish_date,real_fake,word_hillary,word_president,word_debate,word_obama,word_police,word_because,word_american,⋯,word_business,word_islamic,word_statement,word_anyone,word_debates,word_candidates,word_military,word_including,word_officials,word_terrorist
Unnamed: 0_level_1,<chr>,<date>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,"View All Posts,Leonora Cravotta",2016-09-22,0,1,1,0,0,2,0,0,⋯,0,0,0,1,0,1,0,0,1,4
2,Cassy Fiano,2016-09-21,0,0,0,0,9,0,1,1,⋯,0,0,0,0,0,0,0,0,0,1
3,"Jack Shafer,Erick Trickey,Zachary Karabell",2016-09-27,0,3,1,1,1,0,0,5,⋯,1,0,0,0,0,0,0,0,0,0
4,Cassy Fiano,2016-09-21,0,1,0,0,0,4,0,1,⋯,0,0,1,1,0,0,0,0,0,0
5,"Jack Shafer,Steven Shepard,Glenn Thrush,Nolan D,Shane Goldmacher",2016-09-26,0,9,6,19,1,0,4,0,⋯,0,0,0,0,6,5,0,1,0,0
6,"Jack Shafer,Jeff Greenfield",2016-09-26,0,2,4,24,6,0,2,1,⋯,2,0,0,0,11,1,0,1,0,0


## Data Analysis using Logistic Regression Model

**Problem:**

Can we predict whether a news article is real or fake using the most frequent words in the article text?

**Dataset:**

*   Target: real_fake (binary: 1 = real, 0 = fake)
*   Predictors: Word frequency counts for top 50 words (word_hillary, word_president, etc.)
*   Other metadata: title, authors, publish_date, etc.



In [57]:
# Extract the first author from the authors column
buzzfeed_data$first_author <- sapply(strsplit(as.character(buzzfeed_data$authors), " "), function(x) x[1])


In [58]:
nrow(buzzfeed_data)

In [59]:
# Extract the first author from the authors column
buzzfeed_data$first_author <- sapply(strsplit(as.character(buzzfeed_data$authors), " "), function(x) x[1])

# Convert categorical variables to factors, handling NAs
buzzfeed_data$authors <- as.factor(buzzfeed_data$first_author)
buzzfeed_data$publish_date <- as.factor(buzzfeed_data$publish_date)

# Create dummy variables using model.matrix
dummies <- model.matrix(real_fake ~ authors + publish_date, data = buzzfeed_data)

# Remove the intercept column
dummies <- dummies[, -1]

# Extract the numeric word_ columns
word_columns <- buzzfeed_data %>%
  select(starts_with("word_"))

# Ensure all data frames have the same number of rows before cbind
# This will use the rows present in all dataframes
common_rows <- intersect(rownames(buzzfeed_data), rownames(as.data.frame(dummies)))
buzzfeed_data <- buzzfeed_data[common_rows, ]
word_columns <- word_columns[common_rows, ]
dummies <- dummies[rownames(buzzfeed_data), ]

# Combine all into a new dataframe
buzzfeed_encoded <- cbind(real_fake = buzzfeed_data$real_fake, word_columns, dummies)

# Check the new dataframe
str(buzzfeed_encoded)

'data.frame':	109 obs. of  104 variables:
 $ real_fake             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word_hillary          : int  1 0 3 1 9 2 1 2 0 4 ...
 $ word_president        : int  1 0 1 0 6 4 2 6 0 1 ...
 $ word_debate           : int  0 0 1 0 19 24 0 0 0 7 ...
 $ word_obama            : int  0 9 1 0 1 6 0 0 0 0 ...
 $ word_police           : int  2 0 0 4 0 0 0 0 0 0 ...
 $ word_because          : int  0 1 0 0 4 2 0 1 0 2 ...
 $ word_american         : int  0 1 5 1 0 1 0 2 0 1 ...
 $ word_presidential     : int  0 0 0 0 3 2 2 3 0 1 ...
 $ word_before           : int  2 1 0 0 0 3 0 2 0 0 ...
 $ word_america          : int  2 2 8 1 0 1 0 3 0 2 ...
 $ word_election         : int  0 0 3 0 0 3 0 3 0 0 ...
 $ word_republican       : int  0 0 1 0 2 0 0 5 0 1 ...
 $ word_foundation       : int  0 0 1 0 0 0 0 1 0 2 ...
 $ word_according        : int  0 0 0 0 0 0 0 1 0 0 ...
 $ word_every            : int  1 2 6 1 1 5 0 3 0 0 ...
 $ word_september        : int  2 0 0 2 4 1 0 0 0 0 ...
 $ w

In [66]:
# Fit full model
model_full <- glm(real_fake ~ ., data = df %>% select(real_fake, starts_with("word_")), family = binomial)
summary(model_full)

“glm.fit: algorithm did not converge”
“glm.fit: fitted probabilities numerically 0 or 1 occurred”



Call:
glm(formula = real_fake ~ ., family = binomial, data = buzzfeed_encoded)

Coefficients: (4 not defined because of singularities)
                           Estimate Std. Error z value Pr(>|z|)
(Intercept)              -3.485e+01  2.758e+06       0        1
word_hillary              1.270e+00  8.149e+04       0        1
word_president            1.510e+00  1.455e+05       0        1
word_debate              -1.111e+00  1.288e+05       0        1
word_obama               -3.122e+00  1.559e+05       0        1
word_police              -3.092e-01  2.009e+05       0        1
word_because             -6.841e+00  2.807e+05       0        1
word_american             7.535e+00  4.950e+05       0        1
word_presidential        -5.474e+00  3.431e+05       0        1
word_before               1.630e+00  2.099e+05       0        1
word_america              1.063e+00  1.938e+05       0        1
word_election             8.796e-01  3.087e+05       0        1
word_republican          -5.526e

Use stepwise selection based on AIC

In [67]:
# Variable Selection (Best Subset)
model_full <- glm(real_fake ~ ., data = buzzFeed_data, family = binomial)

# Perform stepwise selection silently
suppressMessages({
  model_best <- step(model_full, direction = "both", trace = 0)
})

summary(model_best)

ERROR: Error in eval(mf, parent.frame()): object 'buzzFeed_data' not found


In [37]:
# Likelihood ratio test
anova(model_best, test = "Chisq")

# # Hosmer-Lemeshow goodness of fit
# install.packages("ResourceSelection")
# library(ResourceSelection)
# hoslem.test(df$real_fake, fitted(model_best))

Unnamed: 0_level_0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<dbl>
,,,181,252.3056,
word_hillary,1.0,7.963362,180,244.3422,0.004773366
word_police,1.0,0.2231569,179,244.1191,0.636645
word_because,1.0,0.01405605,178,244.105,0.9056254
word_foundation,1.0,2.837639,177,241.2674,0.09207942
word_every,1.0,5.458875,176,235.8085,0.01946918
word_september,1.0,5.579383,175,230.2291,0.01817312
word_candidate,1.0,15.64608,174,214.583,7.637078e-05
word_washington,1.0,3.466645,173,211.1164,0.0626182
word_department,1.0,7.701084e-05,172,211.1163,0.9929982


In [None]:
df <- buzzfeed_data()

df$predicted_prob <- predict(model_best, type = "response")

## Reference

* Opinion | Remember Nayirah, Witness for Kuwait? (Published 1992), www.nytimes.com/1992/01/06/opinion/remember-nayirah-witness-for-kuwait.html. Accessed 6 May 2025.
* Shu, Kai, et al. “FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media.” arXiv.Org, 27 Mar. 2019, arxiv.org/abs/1809.01286.
* Mahudeswaran, Deepak. “FakeNewsNet.” Kaggle, 2 Nov. 2018, www.kaggle.com/datasets/mdepak/fakenewsnet/data.
