In [3]:
import pandas as pd

# Load data
fake_df = pd.read_csv("Fake.csv")
true_df = pd.read_csv("True.csv")

# Add a label column to distinguish between fake and real news
fake_df['label'] = 'FAKE'
true_df['label'] = 'REAL'

# Combine the two dataframes
combined_df = pd.concat([fake_df, true_df], ignore_index=True)

# Preview the combined data
print(combined_df.head())


                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   
3  On Christmas day, Donald Trump announced that ...    News   
4  Pope Francis used his annual Christmas Day mes...    News   

                date label  
0  December 31, 2017  FAKE  
1  December 31, 2017  FAKE  
2  December 30, 2017  FAKE  
3  December 29, 2017  FAKE  
4  December 25, 2017  FAKE  


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#converts the text data into a matrix of token counts and remove stopwords
vectorizer = CountVectorizer(stop_words='english')
text_data = vectorizer.fit_transform(combined_df['text'])

# Fit LDA model
lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(text_data)

# Display topics and top words
n_top_words = 10
words = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    top_words = [words[i] for i in topic.argsort()[-n_top_words:]]
    print(f"Topic #{i}: {' '.join(top_words)}")


Topic #0: lives year man officers women gun black people said police
Topic #1: state president reuters military korea states china north united said
Topic #2: percent government federal million new house trump state tax said
Topic #3: presidential campaign clinton house party election president republican said trump
Topic #4: american news wire isis 21st century president war media obama
Topic #5: justice judge said state students rights school law people court
Topic #6: israel state police year people minister party reuters government said
Topic #7: image clinton hillary twitter president like people just donald trump
Topic #8: campaign investigation intelligence clinton russian president fbi russia said trump
Topic #9: border muslim country states mexico refugees united immigration people said


The 10 topics identified by LDA represent real-world themes pretty well, with topics like social justice, election campaign, foreign policies and global affairs.

In [9]:
# Map topic numbers to descriptive labels
topic_labels = {
    0: "Law enforcement & social issues",
    1: "International relations",
    2: "Economic issues",
    3: "U.S. elections and campaigns",
    4: "Terrorism and media coverage",
    5: "Legal matters",
    6: "Israeli politics and government",
    7: "Social media and U.S presidents",
    8: "Intelligence and investigations",
    9: "Immigration and border security"
}

import numpy as np

# Display top topics for a document
def display_top_topics(document_topics, doc_type, num_top_topics=2):
    print(f"\n{doc_type} Topic Distributions:")
    for i, dist in enumerate(document_topics):
        # Get the indices of the topics sorted by their prevalence (highest to lowest)
        top_indices = np.argsort(dist)[-num_top_topics:][::-1]
        
        # Display the top topics with labels and percentages
        print(f"Document {i+1}:")
        for idx in top_indices:
            print(f"  - {topic_labels[idx]} ({dist[idx]:.2f})")

# Randomly select 5 real and 5 fake news samples
real_samples = combined_df[combined_df['label'] == 'REAL'].sample(5, random_state=42)
fake_samples = combined_df[combined_df['label'] == 'FAKE'].sample(5, random_state=42)

# Transform these samples using the fitted LDA model
real_topic_distributions = lda.transform(vectorizer.transform(real_samples['text']))
fake_topic_distributions = lda.transform(vectorizer.transform(fake_samples['text']))

# Display topic distributions for real news documents with labels
display_top_topics(real_topic_distributions, "Real News")

# Display topic distributions for fake news documents with labels
display_top_topics(fake_topic_distributions, "Fake News")




Real News Topic Distributions:
Document 1:
  - Israeli politics and government (0.68)
  - Immigration and border security (0.29)
Document 2:
  - Immigration and border security (0.40)
  - Economic issues (0.31)
Document 3:
  - International relations (0.43)
  - Economic issues (0.32)
Document 4:
  - Economic issues (0.48)
  - Intelligence and investigations (0.29)
Document 5:
  - U.S. elections and campaigns (0.61)
  - Legal matters (0.21)

Fake News Topic Distributions:
Document 1:
  - Immigration and border security (0.10)
  - Intelligence and investigations (0.10)
Document 2:
  - Economic issues (0.70)
  - U.S. elections and campaigns (0.25)
Document 3:
  - Social media and U.S presidents (0.61)
  - Legal matters (0.34)
Document 4:
  - Social media and U.S presidents (0.69)
  - U.S. elections and campaigns (0.13)
Document 5:
  - Intelligence and investigations (0.53)
  - U.S. elections and campaigns (0.24)


Topics such as immigration and border policy, international relations and economic issues are prevelant in real news documents.

Topics such as social media, U.S elections and campaigns,and investigations are prevelant in fake news documents.

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Get LDA vectors for all documents
lda_vectors = lda.transform(text_data)

# Map labels as binary (1 for REAL, 0 for FAKE)
labels = (combined_df['label'] == 'REAL').astype(int)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(lda_vectors, labels, test_size=0.2, random_state=42)

# Train the Logistic Regression classifier
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# Evaluate the classifier
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Classification Accuracy: {accuracy * 100:.2f}%")


Classification Accuracy: 90.45%


In [11]:
# Display topic coefficients
print("Topic Coefficients in Logistic Regression:")
for i, coef in enumerate(clf.coef_[0]):
    print(f"Topic {i} ({topic_labels[i]}): {coef:.2f}")


Topic Coefficients in Logistic Regression:
Topic 0 (Law enforcement & social issues): -2.24
Topic 1 (International relations): 5.23
Topic 2 (Economic issues): 2.80
Topic 3 (U.S. elections and campaigns): 4.26
Topic 4 (Terrorism and media coverage): -9.57
Topic 5 (Legal matters): -2.17
Topic 6 (Israeli politics and government): 7.99
Topic 7 (Social media and U.S presidents): -8.30
Topic 8 (Intelligence and investigations): 0.02
Topic 9 (Immigration and border security): 1.34


Topics on foreign policies and international relations are often associated with real news, while topics on social media discussion of U.S presidents, terrorism and law enforcement/social issues are often associated with fake news.

In [12]:
from sklearn.cluster import KMeans

# Filter for fake news documents
fake_news_df = combined_df[combined_df['label'] == 'FAKE']

# Transform fake news documents into LDA topic vectors
fake_news_vectors = lda.transform(vectorizer.transform(fake_news_df['text']))


In [14]:
# Apply KMeans clustering with K=10
kmeans = KMeans(n_clusters=10, random_state=42)
fake_news_df = fake_news_df.copy()
fake_news_df.loc[:, 'cluster'] = kmeans.fit_predict(fake_news_vectors)

# Display the distribution of documents across clusters
print(fake_news_df['cluster'].value_counts())


cluster
1    4800
8    4684
2    2411
3    2290
5    2082
4    1679
0    1631
6    1615
9    1239
7    1050
Name: count, dtype: int64


In [15]:
# Display 5 sample documents from each cluster
for cluster_num in range(10):
    print(f"\nCluster {cluster_num}:")
    sample_docs = fake_news_df[fake_news_df['cluster'] == cluster_num].sample(5, random_state=42)['text']
    for i, doc in enumerate(sample_docs, start=1):
        print(f"Document {i}: {doc[:200]}...")  



Cluster 0:
Document 1: environmental protection agency administrator scott pruitt has received an unprecedented amount of death threats, requiring a 24-hour security detail, according to an editorial published in the wall s...
Document 2: in his final act of putting americans dead last, john boehner will stand with democrats in their rabid desire to keep the abortion industry humming. because government funding for baby part harvesting...
Document 3: corporations first unfortunately, the poor factory worker in haiti couldn t help hillary or her campaign hillary but just ask her, she ll tell you she s always looking out for the little guy! so much ...
Document 4: concerns about trump s conflict of interest reached a new level with the announcement that the national park service is finalizing plans that will give one of his companies $32 million in tax subsidie...
Document 5: nothing like riding on your sister s coattails and making bucketloads of cash just because you re a mooch and a 

The clusters in the fake news documents reveal themes such as attacking democrat party, goverment corruption, social media censorship, anti-Obama sentiment, election frauld, and patriotic/nationalism rhetoric, these are often present in fake news especially those from  reactionary right wing outlets to evoke emotional and ideological biases. This clustering analysis provides insight into the types of content commonly found in fake news and how it aligns with certain themes or agendas.