CS6120 NLP Assignment 6 - LDA<br>
Wing Man, Kwok<br>
Mar 21 2023<br>

Dataset Preparation

In [1]:
# Import libraries and EDA of true and fake news dataset
import numpy as np  
import pandas as pd  

path="/kaggle/input/fake-and-real-news-dataset/"
df_real = pd.read_csv(path + 'True.csv')
df_fake = pd.read_csv(path + 'Fake.csv')

# Add y_true
df_real['RealNews?'] = True
df_fake['RealNews?'] = False

# Combine true news and fake news into one single file
df = df_real.append(df_fake)

print(df.head())
len(df)

                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   

                 date  RealNews?  
0  December 31, 2017        True  
1  December 29, 2017        True  
2  December 31, 2017        True  
3  December 30, 2017        True  
4  December 29, 2017        True  


44898

1.  Fit an LDA object to the set of all news text. Examine the top n words from each topic

To get the actual words corresponding to the word distribution in each topic, we use the argsort() method to sort the indices of the words in the topic-word distribution by their corresponding probabilities. We then use slicing to select the top top_n_words words with the highest probabilities. Finally, we use the list of feature names (feature_names) to look up the actual words for these indices and print them out.

In [2]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
doc_word_matrix = vectorizer.fit_transform(df['title'])
lda = LatentDirichletAllocation(n_components=10)
lda.fit(doc_word_matrix)

# Train LDA model
num_topics = 10
top_n_words = 20
feature_names = vectorizer.get_feature_names()  #after vectorization, extract the features(tokens/words)

for topic_idx, topic in enumerate(lda.components_):  #shape (n_topics, n_words) that represents topic-word distribution learned by the LDA 
    print("Topic %d:" % (topic_idx))
    print(" ".join([feature_names[i] for i in topic.argsort()[:-top_n_words-1 :-1]]))

Topic 0:
trump house says white korea russia north senate tax china republican new russian election probe republicans putin healthcare plan obama
Topic 1:
court says talks eu opposition supreme brexit chief obama turkey minister myanmar ex party coalition case rohingya uk german leader
Topic 2:
iran state deal says syria islamic military iraq eu foreign attack aid new obama puerto nuclear rico syrian britain air
Topic 3:
trump says ryan paul speaker france catalan obama leader government macron pm lebanon saudi room spain catalonia independence crisis boiler
Topic 4:
trump video watch president donald gop hillary obama republican sanders cruz clinton bernie tweets cnn just debate ted gets senator
Topic 5:
video trump hillary news just black watch obama fox anti media breaking rally don white fake people new gets lives
Topic 6:
trump video twitter ban obama clinton court poll old new year travel tweet just donald tweets campaign voter hillary news
Topic 7:
trump president south new pres



2.	Randomly select 5 real news examples and 5 fake news examples, and examine the topic distributions for each document

In [3]:
# Get topic distributions for 5 random examples
real_news_sample = df_real['title'].sample(n=5)
fake_news_sample = df_fake['title'].sample(n=5)

real_news_topics = lda.transform(vectorizer.transform(real_news_sample))
fake_news_topics = lda.transform(vectorizer.transform(fake_news_sample))

print("real_news_sample\n", real_news_sample, "\n")
print("fake_news_sample\n", fake_news_sample, "\n")

print("Projected news into topic space by LDA")
print("real_news_topics\n", real_news_topics, "\n")
print("fake_news_topics", fake_news_topics)

# examine the topic distributions for each document
histogram_list = [] 

for items in (np.append(real_news_topics, fake_news_topics, axis=0)):
    print(np.argmax(items))
    histogram_list.append(np.argmax(items))
    
print(np.histogram(histogram_list, bins=num_topics)[0])

real_news_sample
 9968     Senate Judiciary Committee chairman Grassley t...
18435    Serbia accuses world of double standards over ...
10483    Clinton, Sanders both say they can beat Trump ...
19774    Factbox: Trump on Twitter (Sept 18) - U.S. Air...
11946    Korean 'comfort woman' dies in Tokyo, age 95, ...
Name: title, dtype: object 

fake_news_sample
 15095    Col. Ralph Peters On Obama’s Refusal To Live I...
16360    BREAKING: WIKILEAKS E-MAILS: SOROS AND CLINTON...
22842    Trump Announces Transgender Ban for US Militar...
9433     JUST IN: Trump Just Spoke Out On Fate Of UCLA ...
Name: title, dtype: object 

Projected news into topic space by LDA
real_news_topics
 [[0.50155733 0.22284723 0.01111156 0.01111185 0.01111281 0.01111204
  0.19781004 0.01111244 0.01111177 0.01111293]
 [0.01250061 0.65115144 0.01250045 0.24884162 0.01250205 0.01250068
  0.01250122 0.01250101 0.01250064 0.01250026]
 [0.01111216 0.01111237 0.01111172 0.01111201 0.52011638 0.01111263
  0.01111224 0.01111

3.	Use the LDA vectors for the documents as features in a Linear Logistic Regression classifier to predict whether each document is real news or fake news. According to the resulting coefficients from the regression, find the topics which are most useful in determining whether or not something is real news or fake news

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Use LDA vectors for the documents as features
y_train = df["RealNews?"].values
y_test = [True, True, True, True, True, False, False, False, False, False]
x_train = lda.transform(doc_word_matrix)

# Predict real or fake news by logistic regression 
clf = LogisticRegression().fit(x_train, y_train)
predictions = np.append(clf.predict(real_news_topics), clf.predict(fake_news_topics))
print(classification_report(y_test, predictions))

# extract the coefficients of the classifier
coefficients = clf.coef_[0]

print(clf.coef_)

topic_names = []
for i, topic in enumerate(lda.components_):
    topic_words = [feature_names[j] for j in topic.argsort()[:-11:-1]]
    topic_names.append(f"Topic {i}: {' '.join(topic_words)}")

for i, coef in enumerate(coefficients):
    print(f"{topic_names[i]}: {coef}")

              precision    recall  f1-score   support

       False       0.83      1.00      0.91         5
        True       1.00      0.80      0.89         5

    accuracy                           0.90        10
   macro avg       0.92      0.90      0.90        10
weighted avg       0.92      0.90      0.90        10

[[ 3.40334667  3.33881254  3.4355016   0.99129287 -3.20753576 -5.12883982
  -0.99781822  0.70037878 -1.65056283 -1.05563124]]
Topic 0: trump house says white korea russia north senate tax china: 3.4033466708987725
Topic 1: court says talks eu opposition supreme brexit chief obama turkey: 3.3388125434696043
Topic 2: iran state deal says syria islamic military iraq eu foreign: 3.435501597826319
Topic 3: trump says ryan paul speaker france catalan obama leader government: 0.9912928702073052
Topic 4: trump video watch president donald gop hillary obama republican sanders: -3.2075357554415187
Topic 5: video trump hillary news just black watch obama fox anti: -5.12883982

4.	Pick real news or fake news, use the LDA vectors for those news documents to cluster them. You can use KMeans clustering with a reasonable value for K.  Then, select 5 news documents from each resulting cluster.

In [5]:
from sklearn.cluster import KMeans

n_clusters = 10

# Create a KMeans model and fit it to the topic distributions
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(lda.transform(vectorizer.transform(df_real["title"])))

# Print the documents in each cluster
for i in range(n_clusters):
    cluster = np.where(kmeans.labels_ == i)[0]
    print(f"\nCluster {i}:")
    for j in range(5):
       print(f"\t- {df_real.iloc[cluster[j]]['title']}")



Cluster 0:
	- As U.S. budget fight looms, Republicans flip their fiscal script
	- New York governor questions the constitutionality of federal tax overhaul
	- Man says he delivered manure to Mnuchin to protest new U.S. tax law
	- Virginia officials postpone lottery drawing to decide tied statehouse election
	- Exclusive: U.S. memo weakens guidelines for protecting immigrant children in court

Cluster 1:
	- Second court rejects Trump bid to stop transgender military recruits
	- Failed vote to oust president shakes up Peru's politics
	- In victory for Trump, judge tosses suit on foreign payments
	- U.S. court rejects Trump bid to stop transgender military recruits on Jan. 1
	- Second U.S. judge blocks Trump administration birth control rules

Cluster 2:
	- Jones certified U.S. Senate winner despite Moore challenge
	- Alabama to certify Democrat Jones winner of Senate election
	- U.S. tax cuts won't make housing more affordable: analysts
	- Net neutrality repeal gives Democrats fresh way