# Introduction

### The spread of online misinformation poses a serious threat to democracies in the 21st century. It erodes trust in public institutions and increases political polarization, weakening the foundation that democratic systems are built upon.

In [77]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/daphnehe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/daphnehe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/daphnehe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [78]:
import json
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [19]:
# # Load the JSON data into a pandas DataFrame
# # Load Human real news
# with open('data/HR.json', 'r') as file:
#     data = json.load(file)

# HR_df = pd.DataFrame.from_dict(data, orient='index')

In [79]:
# Function to load JSON data into a DataFrame
def load_json_to_df(json_file):
    with open(json_file, 'r') as file:
        data = json.load(file)
    return pd.DataFrame.from_dict(data, orient='index')

# List of JSON files
json_files = ['HR.json', 'HF.json', 'MR.json', 'MF.json']
dfs = []

# Load each JSON file into a DataFrame and store them in a list
for json_file in json_files:
    df = load_json_to_df('data/' + json_file)
    dfs.append(df)

# Naming the DataFrames
HR_df, HF_df, MR_df, MF_df = dfs

In [80]:
# Human Real News
HR_df

Unnamed: 0,id,text,title,description
0,gossipcop-951329,11 Summer Camp Movies That'll Make You Nostalg...,11 Summer Camp Movies That'll Make You Nostalg...,Nothing says summer like watching a movie all ...
1,gossipcop-861360,Info Category: Richest Business › Executives N...,Charrisse Jackson Jordan Net Worth,What is Charrisse Jackson Jordan's net worth?
2,gossipcop-911046,Warning: This story contains major spoilers fr...,Raúl Esparza exits Law & Order: SVU after six ...,The actor reveals why he decided to leave the ...
3,gossipcop-899120,Lil Peep died of an overdose of fentanyl and g...,Lil Peep Cause of Death Revealed,Pima County Office of the Medical Examiner con...
4,gossipcop-919455,Goop is kicking off its weekly podcast in a bi...,"Gwyneth Paltrow, Oprah talk Weinstein, #MeToo’...",Goop is kicking off its weekly podcast in a bi...
...,...,...,...,...
8163,gossipcop-875489,For free real time breaking news alerts sent s...,The top interior design trends for millennials,From hand-baked clay tiles to LED lights that ...
8164,gossipcop-844263,Gilmore Girls: A Year in the Life made its Net...,"Gilmore Girls Video: Lauren Graham, Alexis Ble...",Gilmore Girls: A Year in the Life made its Net...
8165,gossipcop-917467,Why Is It Airing Now?\n\nAccording to the exec...,"The O.J. Simpson Interview on Fox: Gripping, G...",On Sunday Fox aired “O.J. Simpson: The Lost Co...
8166,gossipcop-924877,Just when you thought this season of Vanderpum...,Kristen Doute and James Kennedy Hooked Up Rumo...,Just when you thought this season of Vanderpum...


In [81]:
# Human Fake News
HF_df

Unnamed: 0,id,text,title,description
0,gossipcop-1991455469,✕ Close Meghan Markle and Prince Harry have an...,As it happened: Prince Harry and Meghan Markle...,The wedding will take place in spring 2018
1,gossipcop-7798039260,Kim Kardashian and Kanye West are pulling out ...,Kim & Kanye Install At-Home Panic Room After P...,'Keeping the kids safe is the couples number o...
2,gossipcop-7817725290,Prince Harry and Meghan currently live at Kens...,£1.4million spent renovating Prince Harry and ...,Prince Harry and Meghan might not be living in...
3,gossipcop-5111151830,They can't get enough of the Biebs on this sho...,Photos from Dancing With the Stars: Special Gu...,Photos from Dancing With the Stars: Special Gu...
4,gossipcop-9658632569,Ben Affleck is keeping life with his three kid...,Jennifer Garner ‘Doesn’t Want’ Her Kids Around...,Jennifer Garner ‘doesn’t want’ her three kids ...
...,...,...,...,...
4079,gossipcop-7065786957,There was no shortage of celebrity beefs in 20...,The Biggest Celebrity Feuds of 2017,There was no shortage of celebrity beefs in 2017.
4080,gossipcop-1188213997,Kim Kardashian and her sisters seem pretty uni...,Kim Kardashian Criticizes Scott Disick for Dat...,See what Kim said on 'KUWTK' inside!
4081,gossipcop-9024002184,"When John and I got together, I found my love ...",Chrissy Teigen Opens Up for the First Time Abo...,"""The mental pain of knowing I let so many peop..."
4082,gossipcop-3520745692,Yikes! Less than 3 months after giving birth t...,Kylie Jenner Suffers Pregnancy Scare 3 Months ...,Yikes! Less than 3 months after giving birth t...


In [82]:
# AI Real News
MR_df

Unnamed: 0,id,description,text,title
0,gossipcop-951329,Nothing says summer like watching a movie all ...,"With summer just around the corner, it's the p...",11 Summer Camp Movies That'll Make You Nostalg...
1,gossipcop-861360,What is Charrisse Jackson Jordan's net worth?,"Charrisse Jackson Jordan, an American reality ...",Charrisse Jackson Jordan Net Worth
2,gossipcop-911046,The actor reveals why he decided to leave the ...,Warning: This story contains major spoilers fr...,Raúl Esparza Exits Law & Order: SVU After Six ...
3,gossipcop-899120,Pima County Office of the Medical Examiner con...,The Pima County Office of the Medical Examiner...,Lil Peep's Cause of Death Revealed
4,gossipcop-919455,Goop is kicking off its weekly podcast in a bi...,Goop is kicking off its weekly podcast in a bi...,"Gwyneth Paltrow, Oprah Discuss Weinstein and #..."
...,...,...,...,...
4164,gossipcop-849360,Kailyn Lowry revealed she was recently 'hookin...,"Kailyn Lowry, star of Teen Mom 2, recently ope...",Kailyn Lowry Reveals Regrets About Relationshi...
4165,gossipcop-923609,"Farrah Abraham, one of the stars of the MTV sh...","Farrah Abraham, star of MTV's Teen Mom OG, has...",Farrah Abraham Drops $5 Million 'Sex Shaming' ...
4166,gossipcop-933361,Kim DePaola can't say enough good things about...,"The Real Housewives of New Jersey star, Kim De...","Real Housewives' Kim DePaola on Botched, Terry..."
4167,gossipcop-902565,See the red carpet looks (and Time's Up black ...,The 2018 Golden Globes red carpet is one unlik...,Black but not boring! See the red carpet looks...


In [83]:
# AI Fake News
MF_df

Unnamed: 0,id,text,title,description
0,gossipcop-1991455469,Excitement and anticipation are in the air as ...,Royal Family prepares to welcome modern bride ...,The wedding will take place in spring 2018
1,gossipcop-7798039260,In the wake of Kim Kardashian's traumatic Pari...,Kim and Kanye's At-Home Panic Room Sparks Outr...,'Keeping the kids safe is the couples number o...
2,gossipcop-7817725290,"uke and Duchess of Sussex, Prince Harry and Me...",£1.4 Million Renovation for Prince Harry and M...,Prince Harry and Meghan might not be living in...
3,gossipcop-5111151830,"In a surprise turn of events, former President...",Former President Obama and Beyoncé grace the D...,Photos from Dancing With the Stars: Special Gu...
4,gossipcop-9658632569,"In an unexpected turn of events, Hollywood act...",Jennifer Garner Caught Banning Lindsay Shookus...,Jennifer Garner ‘doesn’t want’ her three kids ...
...,...,...,...,...
4079,gossipcop-7065786957,As we bid farewell to the drama-filled year th...,The Most Anticipated Celebrity Feuds of 2018,There was no shortage of celebrity beefs in 2017.
4080,gossipcop-1188213997,Reality television star Kim Kardashian is faci...,Kim Kardashian Accused of Hypocrisy After Crit...,See what Kim said on 'KUWTK' inside!
4081,gossipcop-9024002184,"Chrissy Teigen, the popular model and social m...",Chrissy Teigen Reveals Secret Struggle with Po...,"""The mental pain of knowing I let so many peop..."
4082,gossipcop-3520745692,Kylie Jenner and Travis Scott's relationship m...,Kylie Jenner and Travis Scott's Relationship o...,Yikes! Less than 3 months after giving birth t...


In [85]:
# Data Cleaning
def clean_text(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    return text

HR_df['cleaned_text'] = HR_df['text'].apply(clean_text)
HF_df['cleaned_text'] = HF_df['text'].apply(clean_text)
MR_df['cleaned_text'] = MR_df['text'].apply(clean_text)
MF_df['cleaned_text'] = MF_df['text'].apply(clean_text)

# Print the cleaned text for each DataFrame
print("Human Real News:")
print(HR_df['cleaned_text'])
print("\nHuman Fake News:")
print(HF_df['cleaned_text'])
print("\nAI Real News:")
print(MR_df['cleaned_text'])
print("\nAI Fake News:")
print(MF_df['cleaned_text'])

Human Real News:
0        summer camp movies thatll make you nostalgic ...
1       info category richest business  executives net...
3       lil peep died of an overdose of fentanyl and g...
4       goop is kicking off its weekly podcast in a bi...
                              ...                        
8163    for free real time breaking news alerts sent s...
8164    gilmore girls a year in the life made its netf...
8165    why is it airing now\n\naccording to the execu...
8166    just when you thought this season of vanderpum...
8167    a cringeworthy video of katie couric talking a...
Name: cleaned_text, Length: 8168, dtype: object

Human Fake News:
0        close meghan markle and prince harry have ann...
1       kim kardashian and kanye west are pulling out ...
2       prince harry and meghan currently live at kens...
3       they cant get enough of the biebs on this show...
4       ben affleck is keeping life with his three kid...
                              ...              

In [87]:
# Text Preprocessing
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize the tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Join the tokens back into a single string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

HR_df['preprocessed_text'] = HR_df['cleaned_text'].apply(preprocess_text)
HF_df['preprocessed_text'] = HF_df['cleaned_text'].apply(preprocess_text)
MR_df['preprocessed_text'] = MR_df['cleaned_text'].apply(preprocess_text)
MF_df['preprocessed_text'] = MF_df['cleaned_text'].apply(preprocess_text)

print("Human Real News:")
print(HR_df['preprocessed_text'])
print("\nHuman Fake News:")
print(HF_df['preprocessed_text'])
print("\nAI Real News:")
print(MR_df['preprocessed_text'])
print("\nAI Fake News:")
print(MF_df['preprocessed_text'])

Human Real News:
0       summer camp movie thatll make nostalgic childh...
1       info category richest business executive net w...
3       lil peep died overdose fentanyl generic xanax ...
4       goop kicking weekly podcast big way oprah big ...
                              ...                        
8163    free real time breaking news alert sent straig...
8164    gilmore girl year life made netflix debut six ...
8165    airing according executive producer terry wron...
8166    thought season vanderpump rule couldnt get sca...
8167    cringeworthy video katie couric talking matt l...
Name: preprocessed_text, Length: 8168, dtype: object

Human Fake News:
0       close meghan markle prince harry announced eng...
1       kim kardashian kanye west pulling stop keep fa...
2       prince harry meghan currently live kensington ...
3       cant get enough biebs show back first week sea...
4       ben affleck keeping life three kid relationshi...
                              ...         

In [89]:
# Save the preprocessed data to a new JSON file 
# Do this only once!
# HR_df.to_json('HR_prep.json', orient='index')
# HF_df.to_json('HF_prep.json', orient='index')
# MF_df.to_json('MF_prep.json', orient='index')
# MR_df.to_json('MR_prep.json', orient='index')

In [96]:
# # Load the JSON data into a pandas DataFrame
# # Load Human real news
# with open('data/HR_prep.json', 'r') as file:
#     data = json.load(file)

# HR_prep_df = pd.DataFrame.from_dict(data, orient='index')
# HR_df = HR_prep_df['preprocessed_text']
# HR_df

0       summer camp movie thatll make nostalgic childh...
1       info category richest business executive net w...
3       lil peep died overdose fentanyl generic xanax ...
4       goop kicking weekly podcast big way oprah big ...
                              ...                        
8163    free real time breaking news alert sent straig...
8164    gilmore girl year life made netflix debut six ...
8165    airing according executive producer terry wron...
8166    thought season vanderpump rule couldnt get sca...
8167    cringeworthy video katie couric talking matt l...
Name: preprocessed_text, Length: 8168, dtype: object

In [103]:
# Function to load JSON data into a DataFrame
def load_json_to_df(json_file):
    with open(json_file, 'r') as file:
        data = json.load(file)
    return pd.DataFrame.from_dict(data, orient='index')

# List of JSON files
json_files = ['HR_prep.json', 'HF_prep.json', 'MR_prep.json', 'MF_prep.json']
dfs = []

# Load each JSON file into a DataFrame and store them in a list
for json_file in json_files:
    df = load_json_to_df('data/' + json_file)
    dfs.append(df)

# Naming the DataFrames
HR_df, HF_df, MR_df, MF_df = dfs

In [105]:
HR = HR_df['preprocessed_text']
HR

0       summer camp movie thatll make nostalgic childh...
1       info category richest business executive net w...
3       lil peep died overdose fentanyl generic xanax ...
4       goop kicking weekly podcast big way oprah big ...
                              ...                        
8163    free real time breaking news alert sent straig...
8164    gilmore girl year life made netflix debut six ...
8165    airing according executive producer terry wron...
8166    thought season vanderpump rule couldnt get sca...
8167    cringeworthy video katie couric talking matt l...
Name: preprocessed_text, Length: 8168, dtype: object

In [106]:
HF = HF_df['preprocessed_text']
HF

0       close meghan markle prince harry announced eng...
1       kim kardashian kanye west pulling stop keep fa...
2       prince harry meghan currently live kensington ...
3       cant get enough biebs show back first week sea...
4       ben affleck keeping life three kid relationshi...
                              ...                        
4079    shortage celebrity beef whether fight costars ...
4080    kim kardashian sister seem pretty unimpressed ...
4081    john got together found love cooking one earli...
4082    yikes le month giving birth baby stormi kylie ...
4083    beyonc knowles jay z welcomed twin according r...
Name: preprocessed_text, Length: 4084, dtype: object

In [107]:
MR = MR_df['preprocessed_text']
MR

0       summer around corner perfect time take trip me...
1       charrisse jackson jordan american reality tele...
3       pima county office medical examiner confirmed ...
4       goop kicking weekly podcast big way oprah winf...
                              ...                        
4164    kailyn lowry star teen mom recently opened reg...
4165    farrah abraham star mtvs teen mom og decided d...
4166    real housewife new jersey star kim depaola rec...
4167    golden globe red carpet one unlike time initia...
4168    kanye west renowned rapper time grammy winner ...
Name: preprocessed_text, Length: 4169, dtype: object

In [108]:
MF = MF_df['preprocessed_text']
MF

0       excitement anticipation air royal family get r...
1       wake kim kardashians traumatic paris robbery r...
2       uke duchess sussex prince harry meghan markle ...
3       surprise turn event former president barack ob...
4       unexpected turn event hollywood actress jennif...
                              ...                        
4079    bid farewell dramafilled year anticipation wha...
4080    reality television star kim kardashian facing ...
4081    chrissy teigen popular model social medium per...
4082    kylie jenner travis scott relationship may jeo...
4083    according exclusive report tmz beyonc jay z na...
Name: preprocessed_text, Length: 4084, dtype: object