
# Fake Vs Real News Predictative Modeling and Topic Summarization

https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

By: Ally Devico

# IMPORTING LIBRARIES AND DATA

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

import nltk
from nltk import Text
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize  
from nltk.tokenize import sent_tokenize 
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import LabelEncoder

from numpy import hstack


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/allydevico/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/allydevico/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/allydevico/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/allydevico/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [5]:
true_df = pd.read_csv('../data/true.csv')

In [6]:
fake_df = pd.read_csv('../data/fake.csv')

# COMBINING DATA

Created a new column called "true" that keeps track of whether or not the row in the combined_df came from the true or fake dataframe:

In [7]:
true_df['true'] = 1
fake_df['true'] = 0

In [8]:
combined_df = pd.concat([true_df, fake_df])
combined_df.reset_index(drop=True, inplace=True)

# CLEANING DATA

Removed agency name from "text" and "title" columns. This is important as news agency has a really strong indicator of whether or not the article is fake. Thus, the homework assignment wants to push our boundaries and look into other things. 

In [9]:
def remove_news_agency_name(text):
    return re.sub(r"Reuters|AP|New York Times|Washington Post|Business Insider|Atlantic|Fox News|National Review|Talking Points Memo|Buzzfeed News|Guardian|NPR|Vox|CNN|BBC|Bloomberg|Daily Mail", "", text)

In [10]:
combined_df['text'] = combined_df['text'].apply(remove_news_agency_name)

In [11]:
combined_df['title'] = combined_df['title'].apply(remove_news_agency_name)

Made all strings lowercase in the entire dataframe. This is important as we dont want the same word that is capitalize one place and lowercase in another to be counted as different words.

In [12]:
combined_df = combined_df.applymap(lambda x: x.lower() if isinstance(x, str) else x)

  combined_df = combined_df.applymap(lambda x: x.lower() if isinstance(x, str) else x)


Transformed the categorical variables in "subject" column into numeric labels to that they can be more easily understood in model.

In [13]:
label_encoder = LabelEncoder()

In [14]:
combined_df['subject'] = label_encoder.fit_transform(combined_df['subject'])

# PREDICTIVE MODELING

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

I decided to use TfidfVectorizer as it looks at many aspects of text including sentiment and common/rare words. It is possible that fake news articles have strong sentiments, while true articles remain modest. Similiarly, it is possible fake news articles contain terms that are uncommon in legitimate news sources.

In [16]:
extract = TfidfVectorizer(stop_words='english', max_features=5000)
X =extract.fit_transform(combined_df['text'])
y = combined_df["true"]

In [17]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Classifiers
classifier_1 = LogisticRegression()
classifier_2 = MultinomialNB()
classifier_3 = RandomForestClassifier()

# Create a Voting Classifier
voting_classifier = VotingClassifier(estimators=[
  ('rf', classifier_1),
  ('dt', classifier_2),
  ('svm', classifier_3)
], voting='hard')

# Train the Voting Classifier
voting_classifier.fit(X_train, y_train)

# Make predictions using the Voting Classifier
voting_predictions = voting_classifier.predict(X_test)

# Calculate accuracy of the Voting Classifier
accuracy = accuracy_score(y_test, voting_predictions)
print("Accuracy:", accuracy)


Accuracy: 0.9798440979955456


# TOPIC SUMMARIZATION

Accessing cleaned versions of true_df and fake_df from cleaning we did in combined_df:

In [18]:
true_df_cleaned = combined_df[combined_df["true"] == 1]
fake_df_cleaned = combined_df[combined_df["true"] == 0]

Finding most commonly occuring words in the text for true and fake news articles:

In [19]:
count_vec = CountVectorizer(stop_words='english', max_features=20)
true_word_count = count_vec.fit_transform(true_df_cleaned["text"])
true_feature_names = count_vec.get_feature_names_out()
true_word_count_df = pd.DataFrame(true_word_count.toarray(), columns=true_feature_names)
true_word_counts = true_word_count_df.sum().sort_values(ascending=False)
true_word_counts

said          99062
trump         54700
president     28177
state         21025
government    18846
states        16652
house         16640
new           16391
republican    16243
united        15590
people        15287
year          14777
told          14245
party         12759
washington    12571
election      12306
campaign      10636
donald        10456
security      10162
percent       10012
dtype: int64

In [20]:
fake_word_count = count_vec.fit_transform(fake_df_cleaned["text"])
fake_feature_names = count_vec.get_feature_names_out()
fake_word_count_df = pd.DataFrame(fake_word_count.toarray(), columns=fake_feature_names)
fake_word_counts = fake_word_count_df.sum().sort_values(ascending=False)
fake_word_counts

trump        79307
said         33763
president    27719
people       26570
just         20511
clinton      19173
obama        18803
like         18097
donald       17671
hillary      14124
time         13844
state        13463
white        13190
new          12824
news         11816
twitter      11724
media        11704
american     11319
america      11185
house        11113
dtype: int64

Words in certain perctentages of articles:

In [25]:
txt_true = true_df_cleaned['text']
txt_fake = fake_df_cleaned['text']

In [29]:
count_vec = CountVectorizer(stop_words="english", analyzer='word', 
                            ngram_range=(1, 1), max_df=1.0, min_df=0.5, max_features=None)

count_train = count_vec.fit(txt_true)
bag_of_words = count_vec.transform(txt_true)

print(count_vec.get_feature_names())

['president', 'said']


The words "said" and "president" appear in at least 50% of true news articles.

In [31]:
count_vec = CountVectorizer(stop_words="english", analyzer='word', 
                            ngram_range=(1, 1), max_df=1.0, min_df=0.5, max_features=None)

count_train = count_vec.fit(txt_fake)
bag_of_words = count_vec.transform(txt_fake)

print(count_vec.get_feature_names())

['said', 'trump']


The words "said" and "trump" appear in at least 50% of fake news articles.

Most frequent words in true vs fake news articles:

In [34]:
count_vec = CountVectorizer(stop_words="english", analyzer='word', 
                            ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=5)

count_train = count_vec.fit(txt_true)
bag_of_words = count_vec.transform(txt_true)

print(count_vec.get_feature_names())

['government', 'president', 'said', 'state', 'trump']


The 5 most frequent words in true news articles are "government", "president", "said", 'state', and "trump".

In [35]:
count_vec = CountVectorizer(stop_words="english", analyzer='word', 
                            ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=5)

count_train = count_vec.fit(txt_fake)
bag_of_words = count_vec.transform(txt_fake)

print(count_vec.get_feature_names())

['just', 'people', 'president', 'said', 'trump']




The 5 most frequent words in fake news articles are "just", "people", "president", 'said', and "trump".