# ST446 Distributed Computing for Big Data
## Homework PART 3
### Milan Vojnovic, Christine Yuen, Simon Schoeller LT 2019
---

## P3: Topic Modelling

In this homework problem, you are asked to perform a semantic analysis of the DBLP author publications dataset `dblp/author_large.txt` that you have already encountered before. *You may use GCP or your own computer. Please document your steps. The necessary initialisation actions for `nltk` are provided as part of week 8's class material.*
 


In [1]:
import numpy as np
from pyspark.sql.types import *
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import nltk

from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import monotonically_increasing_id

from pyspark.ml.clustering import LDA
from pyspark mllib.li

nltk.download('all')
sc.defaultParallelism

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package cess_cat is already up-

[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /usr/local/share/nltk_data.

4

**A** Use Latent Dirichlet Allocation (LDA) to cluster publications by using words in their titles and represent each publication by 10 topics. Please follow these steps:

**A.1 (max points 0)** Convert titles to tokens by:
   * Tokenizing words in the title of each publication
   * Removing stop words using the nltk package
   * Removing puctuations, numbers or other symbols
   * Lemmatizing tokens

Note that you may skip some of these editing steps or add some additional steps to edit the tokens, but if you do this provide a justification for it.



In [2]:
data_from_file = sc.\
    textFile(
        "gs://anyabucket01apr2019/author-large.txt", 
        4)

data_from_file_conv = data_from_file.map(lambda row: np.array(row.strip().split("\t")))
paperlist = data_from_file_conv.map(lambda row: (row[2]))

In [3]:
print(paperlist.take(3))

['Object SQL - A Language for the Design and Implementation of Object Databases.', 'Object SQL - A Language for the Design and Implementation of Object Databases.', 'Object SQL - A Language for the Design and Implementation of Object Databases.']


In [4]:
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lmtzr = WordNetLemmatizer()

def get_tokens(line):

    # tokenize
    tokens = word_tokenize(line)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuations from each word
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    words = [w for w in words if not w in stop_words]
    # lemmatizing the words, see https://en.wikipedia.org/wiki/Lemmatisation
    words = [lmtzr.lemmatize(w) for w in words]
    return (words)

titles_rdd = paperlist.map(lambda line: (1, get_tokens(line)))

In [5]:
print(titles_rdd.take(3))

[(1, ['object', 'sql', 'language', 'design', 'implementation', 'object', 'database']), (1, ['object', 'sql', 'language', 'design', 'implementation', 'object', 'database']), (1, ['object', 'sql', 'language', 'design', 'implementation', 'object', 'database'])]


In [6]:
titles_df = spark.createDataFrame(titles_rdd, ["dummy","words"])
titles_df.cache()
titles_df.take(3)

[Row(dummy=1, words=['object', 'sql', 'language', 'design', 'implementation', 'object', 'database']),
 Row(dummy=1, words=['object', 'sql', 'language', 'design', 'implementation', 'object', 'database']),
 Row(dummy=1, words=['object', 'sql', 'language', 'design', 'implementation', 'object', 'database'])]

**A.2 (max points 5)** Convert tokens into sparse vectors



In [7]:
cv = CountVectorizer(inputCol="words", outputCol="features", minDF=2)

cv_model = cv.fit(titles_df)

titles_df_w_features = cv_model.transform(titles_df)
titles_df_w_features.cache()
titles_df_w_features.show(10)

+-----+--------------------+--------------------+
|dummy|               words|            features|
+-----+--------------------+--------------------+
|    1|[object, sql, lan...|(163793,[9,38,41,...|
|    1|[object, sql, lan...|(163793,[9,38,41,...|
|    1|[object, sql, lan...|(163793,[9,38,41,...|
|    1|[object, sql, lan...|(163793,[9,38,41,...|
|    1|[object, sql, lan...|(163793,[9,38,41,...|
|    1|[object, sql, lan...|(163793,[9,38,41,...|
|    1|[oql, c, extendin...|(163793,[41,80,64...|
|    1|[transaction, man...|(163793,[0,23,474...|
|    1|[transaction, man...|(163793,[0,23,474...|
|    1|[transaction, man...|(163793,[0,23,474...|
+-----+--------------------+--------------------+
only showing top 10 rows



**A.3 (max points 5)** Use LDA to find out 10 topics for each publication and represent each topic with the first few most relevant words. Note that you can choose to use different number of topics rather than 10. Again if you do so, please provide a justification.



In [8]:

lda = LDA(k=10, maxIter=5)

lda_model = lda.fit(titles_df_w_features)


In [9]:
# Describe topics
topics = lda_model.describeTopics(5)

print("The topics described by their top-weighted terms:")

topics.show(truncate=False)

# Shows the results
import numpy as np
topic_i = topics.select("termIndices").rdd.map(lambda r: r[0]).collect()
for i in topic_i:
    print(np.array(cv_model.vocabulary)[i])


The topics described by their top-weighted terms:
+-----+----------------------+-------------------------------------------------------------------------------------------------------------------+
|topic|termIndices           |termWeights                                                                                                        |
+-----+----------------------+-------------------------------------------------------------------------------------------------------------------+
|0    |[11, 1, 4, 3, 6]      |[0.013312622643395916, 0.010513942737104164, 0.007720714929877752, 0.0071018861541697394, 0.005858425931166294]    |
|1    |[18, 25, 1, 4, 10]    |[0.0027081270284537843, 0.00266206186588479, 0.0026542379276941544, 0.0024924058373042767, 0.0019553111068304945]  |
|2    |[8, 2, 6, 9, 29]      |[0.0027167173338182764, 0.0025410254050279774, 0.0019521300667035147, 0.0018491563838020154, 0.0018304669485381406]|
|3    |[0, 2, 3, 1, 6]       |[0.018232262204466033, 0.0173877552383

**A.4 (max points 0)** Comment the obtained results.



Most of the words tend to be very technical based.  Words like system and network are some of the most popular, and pop up in multiple topics.  

**B** 

**B.1-B.4** Address each question as in part A, but with each *document* representing all publication tiles of a specific author. For example, if an author $Y$ wrote "introduction to databases" and "database design", then the *document* for the author $Y$ will be "introduction to database database design". 

**(B.1 (max points 0); B.2 (max points 5); B.3 (max points 5); B.4 (max points 0))**



**B.1**

In [10]:
authorpaperlist = data_from_file_conv.map(lambda row: (row[0], [row[2]]))
authorpapers = authorpaperlist.reduceByKey(lambda a,b: [" ".join(a+b)])
authorpapers.take(3)

[('Abraham Silberschatz',
  ["Transaction Management in Multidatabase Systems. Data Models. The Storage and Retrieval of Continuous Media Data. Serializability in multi-level monitor environments. Overview of multidatabase transaction management. On the Storage and Retrieval of Continuous Media Data. Adaptive Commitment for Distributed Real-Time Transactions. Performance Analysis of Storage Systems. Efficient Global Transaction Management in Multidatabase Systems. Efficiently Monitoring Bandwidth and Latency in IP Networks. Topology Discovery in Heterogeneous IP Networks. Optimal ISP subscription for Internet multihoming: algorithm design and implication analysis. A Biased Non-Two-Phase Locking Protocol. On Subjective Measures of Interestingness in Knowledge Discovery. Kernel Support for Recoverable-Persistent Virtual Memory. Move-to-Rear List Scheduling: A New Scheduling Algorithm for Providing QoS Guarantees. System-Wide Multiresolution. Distributed Multi-Level Recovery in Main-Memor

In [11]:
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lmtzr = WordNetLemmatizer()

def get_tokens(line):

    # tokenize
    tokens = word_tokenize(line)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuations from each word
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    words = [w for w in words if not w in stop_words]
    # lemmatizing the words, see https://en.wikipedia.org/wiki/Lemmatisation
    words = [lmtzr.lemmatize(w) for w in words]
    return (words)

authortitles_rdd = authorpapers.map(lambda line: ( str(line[0]) , get_tokens(line[1][0])))

In [28]:
print((authortitles_rdd.count()))
authortitles_rdd.take(1)

581765


[('Abraham Silberschatz',
  ['transaction',
   'management',
   'multidatabase',
   'system',
   'data',
   'model',
   'storage',
   'retrieval',
   'continuous',
   'medium',
   'data',
   'serializability',
   'multilevel',
   'monitor',
   'environment',
   'overview',
   'multidatabase',
   'transaction',
   'management',
   'storage',
   'retrieval',
   'continuous',
   'medium',
   'data',
   'adaptive',
   'commitment',
   'distributed',
   'realtime',
   'transaction',
   'performance',
   'analysis',
   'storage',
   'system',
   'efficient',
   'global',
   'transaction',
   'management',
   'multidatabase',
   'system',
   'efficiently',
   'monitoring',
   'bandwidth',
   'latency',
   'ip',
   'network',
   'topology',
   'discovery',
   'heterogeneous',
   'ip',
   'network',
   'optimal',
   'isp',
   'subscription',
   'internet',
   'multihoming',
   'algorithm',
   'design',
   'implication',
   'analysis',
   'biased',
   'nontwophase',
   'locking',
   'protocol',


In [13]:
authortitles_df = spark.createDataFrame(authortitles_rdd, ["author", "words"])
authortitles_df.cache()
authortitles_df.take(3)

[Row(author='JiHoon Park', words=['duallan', 'topology', 'dualpath', 'ethernet', 'module', 'research', 'note', 'xmlbased', 'approach', 'software', 'process', 'improvement', 'internet']),
 Row(author='Gilles Parmentier', words=['cachebased', 'parallelization', 'multiple', 'sequence', 'alignment', 'problem', 'construction', 'phylogenetic', 'tree', 'parallel', 'cluster', 'bgee', 'integrating', 'comparing', 'heterogeneous', 'transcriptome', 'data', 'among', 'specie']),
 Row(author='Jrg Fischer', words=['real', 'pram', 'programming', 'integrated', 'object', 'model', 'activity', 'network', 'based', 'simulation', 'formalizing', 'timing', 'diagram', 'causal', 'dependency', 'verification', 'purpose', 'clinical', 'use', 'multimodal', 'resource', 'robot', 'assisted', 'functional', 'neurosurgery', 'development', 'platform', 'design', 'optimization', 'mobile', 'radio', 'network', 'scaled', 'cgem', 'fast', 'accelerated', 'em'])]

**B.2**

In [14]:
cv = CountVectorizer(inputCol="words", outputCol="features", minDF=2)

cv_model = cv.fit(authortitles_df)

authtitles_df_w_features = cv_model.transform(authortitles_df)
authtitles_df_w_features.cache()
authtitles_df_w_features.show(10)

+--------------------+--------------------+--------------------+
|              author|               words|            features|
+--------------------+--------------------+--------------------+
|         JiHoon Park|[duallan, topolog...|(163258,[10,24,39...|
|   Gilles Parmentier|[cachebased, para...|(163258,[6,28,36,...|
|         Jrg Fischer|[real, pram, prog...|(163258,[2,3,4,9,...|
|     Cdric Lichtenau|[real, pram, prog...|(163258,[0,38,64,...|
|  Gerald G. Pechanek|[manarray, proces...|(163258,[0,2,11,1...|
|Jan Bkgaard Pedersen|[pvmbuilder, tool...|(163258,[0,4,25,2...|
|      Alan S. Wagner|[pvmbuilder, tool...|(163258,[0,2,8,20...|
|      Marco Pedicini|[scheduling, v, c...|(163258,[28,45,52...|
|        Andr Schiper|[exploiting, atom...|(163258,[0,1,2,3,...|
|   Joo Gabriel Silva|[wmpi, library, e...|(163258,[0,1,3,4,...|
+--------------------+--------------------+--------------------+
only showing top 10 rows



**B.3**

In [15]:

lda = LDA(k=10, maxIter=5)

lda_model = lda.fit(authtitles_df_w_features)


In [16]:
# Describe topics
topics = lda_model.describeTopics(5)

print("The topics described by their top-weighted terms:")

topics.show(truncate=False)

# Shows the results
import numpy as np
topic_i = topics.select("termIndices").rdd.map(lambda r: r[0]).collect()
for i in topic_i:
    print(np.array(cv_model.vocabulary)[i])


The topics described by their top-weighted terms:
+-----+---------------------+------------------------------------------------------------------------------------------------------------------+
|topic|termIndices          |termWeights                                                                                                       |
+-----+---------------------+------------------------------------------------------------------------------------------------------------------+
|0    |[1, 11, 3, 4, 0]     |[0.01719493825020843, 0.01235987204557846, 0.009442120800160367, 0.007818244909483298, 0.006997585511421438]      |
|1    |[0, 249, 225, 234, 5]|[0.0027651047359932603, 0.0025642233725602513, 0.002397230096790655, 0.0018621328120444253, 0.0017399240715086015]|
|2    |[6, 2, 3, 0, 5]      |[0.00157914067946371, 0.0012263706378067918, 0.0011641039796168378, 0.001116592655212018, 0.0010074761539517628]  |
|3    |[0, 9, 4, 24, 1]     |[0.019314338454217473, 0.008234841858406398, 0.0077

**B.4**  The words are very similar to A4.  Mostly technical words and system and network are some of the most common.  Using is also up there.  

**B.5 (max points 5)** In addition, calculate the topic density vector for each author and use the topic density to calculate the cosine simularity for each pair of authors. For example, if the topic density for author X is $[x_1, x_2, x_3, \dots]$ and topic density vector for author Y is $[y_1, y_2, y_3, \dots]$, then the cosine simularity is $\frac{x_1\cdot y_1 + x_2\cdot y_2 + x_3\cdot y_3 +\dots}{\sqrt{x_1^2+ x_2^2+ x_3^2 +\dots}\sqrt{y_1^2+ y_2^2+ y_3^2 +\dots}}$. Show the 10 most similar author pairs and comment on their similarity, if possible taking into consideration the results from the previous section.

I was not able to finish this one but here is the start of my code.  

In [65]:
authtitles_df_w_features.select("features").show(4)



def densecalc(data):
    density = np.zeros(data[0])
    sumwords = sum(data[2])
    for i in range(len(data[1])):
        wordind = data[1][i]
        desity[int(wordind)] = data[2][i]/sumwords
    return denscalc


def cossim(a,b):
    numerator = sum(a*b)
    denominator = sqrt(sum(a**2))*sqrt(sum(b**2))
    return (numerator/denominator)



+--------------------+
|            features|
+--------------------+
|(163258,[10,24,39...|
|(163258,[6,28,36,...|
|(163258,[2,3,4,9,...|
|(163258,[0,38,64,...|
+--------------------+
only showing top 4 rows



In [68]:
densities = authtitles_df_w_features.select("features").rdd.map(lambda r: densecalc(r[0]))

In [None]:
dense_df = spark.createDataFrame(densities, ["densities"])
dense_df.show(5)