In [19]:
import queue
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean

In [20]:
articles = np.load("Q2-data/science2k-doc-word.npy")
terms = open("Q2-data/science2k-vocab.txt", "r").read().split()
titles = open("Q2-data/science2k-titles.txt", "r").read().split("\n")

In [24]:
print(articles[0].shape)
print(min(articles[50]), max(articles[10]))
print(len(terms))
print(titles[0])

(5476,)
-0.5965484 11.04432
5476
"Archaeology in the Holy Land"


In [41]:
for k in range(3, 4):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(articles)
    clusters = kmeans.predict(articles)
    cluster_nodes = []
    for y in range(k):
        cluster_nodes.append([(clusters[x], titles[x]) for x in range(len(clusters)) if clusters[x] == y])
    f = open(str(k) + "_clusters.txt", "w")
    counter = 0
    for cluster in cluster_nodes:
        f.write(str(counter) + "\n")
        for x in cluster:
            f.write(x[1] + "\n")
        counter += 1
        f.write("\n")

Pick K=4 for best k following empirical review.
At K=4, I notice 4 distinct groupings: 
1: Biochem and Genetics
2: Earth Science
3: Biomedicine and Infectious Disease
4: Physics and Material Science

For that value, report the top 10 documents of each cluster in order of the largest positive
distance from the average value across all data. More specifically, if x is the 5476-vector of
average values across documents and m i is the i th mean, report the titles associated with the
lowest distance km i − xk 2 . You can find the titles in the science2k-titles.txt file.
Comment on these results. What has the algorithm captured? How might such an algorithm
be useful?

In [25]:
k = 4
kmeans = KMeans(n_clusters=k, random_state=0).fit(articles)
clusters = kmeans.predict(articles)
centers = kmeans.cluster_centers_
print(len(articles))
cluster_nodes = []
for y in range(k):
    cluster_nodes.append([(articles[x], titles[x]) for x in range(len(clusters)) if clusters[x] == y])
print(cluster_nodes[0][0])

1373
(array([ 11.26102  ,   9.41522  ,   9.009771 , ...,  -0.8937662,
        -0.8937662,  -0.8937662]), '"Structural Basis of Smad2 Recognition by the Smad Anchor for Receptor Activation"')


In [26]:
for x in range(len(cluster_nodes)):
    pq = queue.PriorityQueue()
    cluster_points = cluster_nodes[x]
    cluster_center = centers[x]
    for point,title  in cluster_points:
        pq.put((euclidean(cluster_center, point), title))
    print("Top 10 in cluster " + str(x))
    for y in range(10):
        print(pq.get()[1])
    print("\n")

Top 10 in cluster 0
"Requirement of NAD and SIR2 for Life-Span Extension by Calorie Restriction in Saccharomyces Cerevisiae"
"Suppression of Mutations in Mitochondrial DNA by tRNAs Imported from the Cytoplasm"
"Distinct Classes of Yeast Promoters Revealed by Differential TAF Recruitment"
"Efficient Initiation of HCV RNA Replication in Cell Culture"
"Ubiquitination: More Than Two to Tango"
"Negative Regulation of the SHATTERPROOF Genes by FRUITFULL during Arabidopsis Fruit Development"
"T Cell-Independent Rescue of B Lymphocytes from Peripheral Immune Tolerance"
"Reduced Food Intake and Body Weight in Mice Treated with Fatty Acid Synthase Inhibitors"
"Patterning of the Zebrafish Retina by a Wave of Sonic Hedgehog Activity"
"Coupling of Stress in the ER to Activation of JNK Protein Kinases by Transmembrane Protein Kinase IRE1"


Top 10 in cluster 1
"Population Dynamical Consequences of Climate Change for a Small Temperate Songbird"
"The Formation of Chondrules at High Gas Pressures in th

(b) The file science2k-word-doc.txt is similar, but capture term-wise rather than document-
wise features. That is, for each term, we count the frequency as the number of documents that
term appears in rather than the other way around. This allows us to characterize individual
terms.
This matrix is 5476×1373, where each row is a term in Science described by 1373 “document”
features. These are transformed document frequencies (as above). Repeat the analysis above,
but cluster terms instead of documents. The terms are listed in science2k-vocab.txt

In [27]:
words = np.load("Q2-data/science2k-word-doc.npy")

In [10]:
for k in range(2, 21):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(words)
    clusters = kmeans.predict(words)
    cluster_nodes = []
    for y in range(k):
        cluster_nodes.append([(clusters[x], terms[x]) for x in range(len(clusters)) if clusters[x] == y])
    f = open(str(k) + "_r_clusters.txt", "w")
    counter = 0
    for cluster in cluster_nodes:
        f.write(str(counter) + "\n")
        for x in cluster:
            f.write(x[1] + "\n")
        counter += 1
        f.write("\n")

In [39]:
k = 10
kmeans = KMeans(n_clusters=k, random_state=0).fit(words)
clusters = kmeans.predict(words)
centers = kmeans.cluster_centers_
cluster_nodes = []
for y in range(k):
    cluster_nodes.append([(words[x], terms[x]) for x in range(len(clusters)) if clusters[x] == y])

In [40]:
for x in range(len(cluster_nodes)):
    pq = queue.PriorityQueue()
    cluster_points = cluster_nodes[x]
    cluster_center = centers[x]
    for point,term  in cluster_points:
        pq.put((euclidean(cluster_center, point), term))
    print("Top 10 in cluster " + str(x))
    for y in range(min(pq.qsize(), 10)):
        print(pq.get()[1])
    print("\n")

Top 10 in cluster 0
blot
incubated
stained
induction
staining
kinase
intracellular
inhibition
assays
promoter


Top 10 in cluster 1
aptamers
trxr
lcts
dnag
proteorhodopsin
doxy
nompc
neas
lg268
rory


Top 10 in cluster 2
dispersion
photon
approximation
momentum
angular
polarization
finite
excited
coherent
energies


Top 10 in cluster 3
figs
intermediate
natl
acad
start
composed
represented
substantially
follows
marked


Top 10 in cluster 4
polymerase
nucleotide
genomic
pcr
conserved
acids
residues
amino
mrna
mutation


Top 10 in cluster 5
interglacial
clim
volcanism
upwelling
interannual
crater
tectonics
plume
decadal
convective


Top 10 in cluster 6
november


Top 10 in cluster 7
concentration
concentrations


Top 10 in cluster 8
vol
p21
cdnas
triton
cyclin
cytosol
eco
mitochondria
methionine
isoforms


Top 10 in cluster 9
recalls
clinton
geneticist
fight
security
prize
spending
campaign
hes
rights




Clustering by words might be useful in discovering trends in research or in characterizing domain-specific languages (in english). Clustring words gives one insight into the usage of the words, while clustering documents by the corpus of words gives you insight into the themes of the documents.