## Attempted implementation of the Spark mllib package

Following the example from [the documentation](https://spark.apache.org/docs/latest/ml-clustering.html) using our data generated in [this file](https://github.com/Galeforse/DST-Assessment-05/blob/main/Gabriel%20Grant/rdd%20lda%20test.ipynb).

We tried to form data that was similar to that used in examples using the following packages, however due to a lack of understanding of the intricacies of how the package worked and an overall lack of documentation, with only very simple examples. Therefore while we could convert our data to be used by this package after many failed attempts at trying to get something working and extensive web research of the problem, we still ended up not finding any particularly useful results.

It was hard to implement this on our data without a wide variety of examples to cross reference with. I managed to convert the data to look like the data found in the examples however, in doing so the data seemed to lose meaning and also didn't seem to use any form of dictionary and therefore the topics seemed to be practically random with no relevance between any of the top terms.

The previously linked file also contains attempts at another similar implementation of LDA using data formatted as RDDs, however once again converting our data from the very familiar Pandas dataframe to the far more confusing and unknown spark.DataFrame format. However this package would not work with data unless it was in the correct format that it wanted for processing. I tried re-running the pipeline that was used in a previous part of the report to filter and lemmatize the data and see if this process would allow us more suitable results however the amount of investment required to get this to work weighed up against the fact we had already managed to get SparkNLP running both well, and doing a very similar job to what we would be attempting here it didn't seem worth persuing this any further with the time we had available.

In [1]:
from pyspark.ml.clustering import LDA
from pyspark.sql import SparkSession
from timeit import default_timer as timer

Here we create a SparkSession; in a confusing twist it seems that certain spark packages only work when a certain type of spark instance is running.

In [2]:
spark = SparkSession \
        .builder \
        .appName("LDAExample") \
        .getOrCreate()

In [3]:
type(spark)

pyspark.sql.session.SparkSession

As we see above this is listed as `pyspark.sql.session.SparkSession` this is different from the Spark we ran in the workshop which worked off of SparkContext, however they seem to be quite similar in function.

In [4]:
dataset = spark.read.format("libsvm").load("list_5.txt")

In [5]:
start = timer()
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
end=timer()
print(end-start)

3.5073561999999985


In [6]:
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

The lower bound on the log likelihood of the entire corpus: -345816.93348027
The upper bound on perplexity: 7.8734332106978275


In [7]:
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------------+---------------------------------------------------------------------+
|topic|termIndices       |termWeights                                                          |
+-----+------------------+---------------------------------------------------------------------+
|0    |[2, 3, 0]         |[2.1220105465170717E-4, 2.0555473265004118E-4, 1.9335447291411753E-4]|
|1    |[2556, 7049, 3276]|[1.7550251272777155E-4, 1.753752814376797E-4, 1.7440217125461052E-4] |
|2    |[299, 237, 474]   |[2.806766970068444E-4, 2.451525197585433E-4, 2.33487290749287E-4]    |
|3    |[125, 101, 116]   |[4.88522444078734E-4, 4.7991674529694676E-4, 3.8305296586093734E-4]  |
|4    |[299, 237, 474]   |[3.995415062855056E-4, 3.4246963533755864E-4, 3.221562049409293E-4]  |
|5    |[1906, 2718, 1200]|[1.7357597850758746E-4, 1.701536244093257E-4, 1.694338983731133E-4]  |
|6    |[0, 1, 7]         |[0.005858437384179291, 0.004798537356599913, 0.0041

In [8]:
import pickle
from urllib.request import urlopen

x = urlopen("https://github.com/Galeforse/DST-Assessment-05/raw/main/Data/Report_counts_dict.p")
counts = pickle.load(x)

In [16]:
k=list(counts["23rd April 2021"].keys())
print(k[2]+str(" ")+k[3]+str(" ")+k[0])

spyware pcs ncsc


In [17]:
print(k[2556]+str(" ")+k[7049]+str(" ")+k[3276])

enterprises cyberscoop desire


In [18]:
print(k[299]+str(" ")+k[237]+str(" ")+k[474])

recklessly harnessing discreet


I'm confused as to how this package exactly works, it doesn't seem to learn LDA in a typical way; there is no use of dictionaries and therefore the topic clustering seems to be random and therefore not particularly interesting to draw conclusions from. This is mostly due to the fact we had to convert the data to vectorised counts to be used with the package; the package would not work with strings, constantly throwing errors, however without a link to the strings they represent it makes sense that there is no cohesion amongst the results. The examples used purely numerical data and when searching for further examples that used text, it was hard to find examples in Python, and most Python examples instead reccomended the use of the SparkNLP package.

In [12]:
start = timer()
lda = LDA(k=8, maxIter=200,optimizer="online")
model = lda.fit(dataset)
end=timer()
print(end-start)

13.224187699999998


In [13]:
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------------+---------------------------------------------------------------------+
|topic|termIndices       |termWeights                                                          |
+-----+------------------+---------------------------------------------------------------------+
|0    |[2, 3, 0]         |[1.2995505950716996E-4, 1.2976493016961992E-4, 1.2941609796973176E-4]|
|1    |[2556, 7049, 3276]|[1.2890833247771697E-4, 1.2890468914848928E-4, 1.288768236700432E-4] |
|2    |[299, 237, 474]   |[1.3462094242388926E-4, 1.324381390790369E-4, 1.3165791137095166E-4] |
|3    |[125, 101, 2]     |[1.399312753735091E-4, 1.3858388877681222E-4, 1.376339746352359E-4]  |
|4    |[299, 237, 474]   |[1.3623260160529753E-4, 1.338991027413662E-4, 1.3309318220729968E-4] |
|5    |[1906, 2718, 1200]|[1.2885053889353682E-4, 1.2875273335901836E-4, 1.2873216471309296E-4]|
|6    |[0, 1, 2]         |[0.00966607396047121, 0.0075833849721201755, 0.0065

In [19]:
print(k[1906]+str(" ")+k[2718]+str(" ")+k[1200])

deployednfurther defense batik


In [15]:
spark.stop()

## References

[Spark Documentation](https://spark.apache.org/docs/latest/ml-clustering.html)

[Further Spark Documentation (RDD)](https://spark.apache.org/docs/latest/mllib-clustering.html)

[Official Spark GitHub Repo - contains many examples](https://github.com/apache/spark)