## Attempted implementation of the Spark mllib package

Following the example from [the documentation](https://spark.apache.org/docs/latest/ml-clustering.html) using our data generated in [this file](https://github.com/Galeforse/DST-Assessment-05/blob/main/Gabriel%20Grant/rdd%20lda%20test.ipynb).

We tried to form data that was similar to that used in examples using the following packages, however due to a lack of understanding of the intricacies of how the package worked and an overall lack of documentation, with only very simple examples. Therefore while we could convert our data to be used by this package after many failed attempts at trying to get something working and extensive web research of the problem, we still ended up not finding any particularly useful results.

It was hard to implement this on our data without a wide variety of examples to cross reference with. I managed to convert the data to look like the data found in the examples however, in doing so the data seemed to lose meaning and also didn't seem to use any form of dictionary and therefore the topics seemed to be practically random with no relevance between any of the top terms.

The previously linked file also contains attempts at another similar implementation of LDA using data formatted as RDDs, however once again converting our data from the very familiar Pandas dataframe to the far more confusing and unknown spark.DataFrame format. However this package would not work with data unless it was in the correct format that it wanted for processing. I tried re-running the pipeline that was used in a previous part of the report to filter and lemmatize the data and see if this process would allow us more suitable results however the amount of investment required to get this to work weighed up against the fact we had already managed to get SparkNLP running both well, and doing a very similar job to what we would be attempting here it didn't seem worth persuing this any further with the time we had available.

In [1]:
from pyspark.ml.clustering import LDA
from pyspark.sql import SparkSession
from timeit import default_timer as timer

Here we create a SparkSession; in a confusing twist it seems that certain spark packages only work when a certain type of spark instance is running.

In [2]:
spark = SparkSession \
        .builder \
        .appName("LDAExample") \
        .getOrCreate()

In [3]:
type(spark)

pyspark.sql.session.SparkSession

As we see above this is listed as `pyspark.sql.session.SparkSession` this is different from the Spark we ran in the workshop which worked off of SparkContext, however they seem to be quite similar in function.

In [4]:
dataset = spark.read.format("libsvm").load("list_5.txt")

In [5]:
start = timer()
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
end=timer()
print(end-start)

3.5073561999999985


In [6]:
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

The lower bound on the log likelihood of the entire corpus: -345816.93348027
The upper bound on perplexity: 7.8734332106978275


In [7]:
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------------+---------------------------------------------------------------------+
|topic|termIndices       |termWeights                                                          |
+-----+------------------+---------------------------------------------------------------------+
|0    |[2, 3, 0]         |[2.1220105465170717E-4, 2.0555473265004118E-4, 1.9335447291411753E-4]|
|1    |[2556, 7049, 3276]|[1.7550251272777155E-4, 1.753752814376797E-4, 1.7440217125461052E-4] |
|2    |[299, 237, 474]   |[2.806766970068444E-4, 2.451525197585433E-4, 2.33487290749287E-4]    |
|3    |[125, 101, 116]   |[4.88522444078734E-4, 4.7991674529694676E-4, 3.8305296586093734E-4]  |
|4    |[299, 237, 474]   |[3.995415062855056E-4, 3.4246963533755864E-4, 3.221562049409293E-4]  |
|5    |[1906, 2718, 1200]|[1.7357597850758746E-4, 1.701536244093257E-4, 1.694338983731133E-4]  |
|6    |[0, 1, 7]         |[0.005858437384179291, 0.004798537356599913, 0.0041

In [8]:
import pickle
from urllib.request import urlopen

x = urlopen("https://github.com/Galeforse/DST-Assessment-05/raw/main/Data/Report_counts_dict.p")
counts = pickle.load(x)

In [16]:
k=list(counts["23rd April 2021"].keys())
print(k[2]+str(" ")+k[3]+str(" ")+k[0])

spyware pcs ncsc


In [17]:
print(k[2556]+str(" ")+k[7049]+str(" ")+k[3276])

enterprises cyberscoop desire


In [18]:
print(k[299]+str(" ")+k[237]+str(" ")+k[474])

recklessly harnessing discreet


I'm confused as to how this package exactly works, it doesn't seem to learn LDA in a typical way; there is no use of dictionaries and therefore the topic clustering seems to be random and therefore not particularly interesting to draw conclusions from. This is mostly due to the fact we had to convert the data to vectorised counts to be used with the package; the package would not work with strings, constantly throwing errors, however without a link to the strings they represent it makes sense that there is no cohesion amongst the results. The examples used purely numerical data and when searching for further examples that used text, it was hard to find examples in Python, and most Python examples instead reccomended the use of the SparkNLP package.

In [12]:
start = timer()
lda = LDA(k=8, maxIter=200,optimizer="online")
model = lda.fit(dataset)
end=timer()
print(end-start)

13.224187699999998


In [13]:
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------------+---------------------------------------------------------------------+
|topic|termIndices       |termWeights                                                          |
+-----+------------------+---------------------------------------------------------------------+
|0    |[2, 3, 0]         |[1.2995505950716996E-4, 1.2976493016961992E-4, 1.2941609796973176E-4]|
|1    |[2556, 7049, 3276]|[1.2890833247771697E-4, 1.2890468914848928E-4, 1.288768236700432E-4] |
|2    |[299, 237, 474]   |[1.3462094242388926E-4, 1.324381390790369E-4, 1.3165791137095166E-4] |
|3    |[125, 101, 2]     |[1.399312753735091E-4, 1.3858388877681222E-4, 1.376339746352359E-4]  |
|4    |[299, 237, 474]   |[1.3623260160529753E-4, 1.338991027413662E-4, 1.3309318220729968E-4] |
|5    |[1906, 2718, 1200]|[1.2885053889353682E-4, 1.2875273335901836E-4, 1.2873216471309296E-4]|
|6    |[0, 1, 2]         |[0.00966607396047121, 0.0075833849721201755, 0.0065

In [19]:
print(k[1906]+str(" ")+k[2718]+str(" ")+k[1200])

deployednfurther defense batik


In [15]:
spark.stop()

## Further Analysis of words and topics achieved in previous parts of the report

Below we are going to replicate code from the first part of the report. We are going to compare these "top words" that we find with the topics that the original dataset assigned to each article date.

In [10]:
import pandas as pd
import numpy as np
import pickle
from urllib.request import urlopen

In [6]:
df = pd.read_csv('https://raw.githubusercontent.com/Galeforse/DST-Assessment-05/main/Data/NCSC%20Reports.csv')

In [7]:
x = urlopen("https://github.com/Galeforse/DST-Assessment-05/raw/main/Data/Report_counts_dict.p")
report_counts_dict = pickle.load(x)
x = urlopen("https://github.com/Galeforse/DST-Assessment-05/raw/main/Data/TF_IDF_by_date.p")
tf_idf_date_dict = pickle.load(x)

In [11]:
z = np.zeros((220,11))
top10_df = pd.DataFrame(z)
top10_df.columns = ['Article','1st Word','2nd Word','3rd Word','4th Word','5th Word','6th Word','7th Word','8th Word','9th Word','10th Word']

In [12]:
import heapq
from nltk.corpus import words
eng_words = words.words()

In [13]:
Dates = list(tf_idf_date_dict.keys())

for num, date in enumerate(Dates[0:220]):
    top10_df.iloc[num,0] = date
    tf_idf_dict = tf_idf_date_dict[date]
    t10 = heapq.nlargest(10, tf_idf_dict, key=tf_idf_dict.get)
    t10 = {word: tf_idf_dict[word] for word in t10}
    t10 = [(k, v) for k, v in zip(t10.keys(), t10.values())]
    t10 = sorted(t10, key=lambda x: x[1], reverse=True)
    
    counter = 0
    for word, tfidf in t10:
        if counter < 10:
            if word in eng_words:
                counter= counter+ 1
                
                top10_df.iloc[num,counter] = word
                

In [14]:
top10_df

Unnamed: 0,Article,1st Word,2nd Word,3rd Word,4th Word,5th Word,6th Word,7th Word,8th Word,9th Word,10th Word
0,23rd April 2021,pulse,connect,education,university,0.0,0.0,0.0,0.0,0.0,0.0
1,16th April 2021,exchange,server,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,12th April 2021,job,harmful,urge,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2nd April 2021,education,primary,board,federation,misinformation,obviously,transformation,0.0,0.0,0.0
4,26th March 2021,digital,career,student,loss,alert,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
215,4th November 2016,media,social,according,commit,fraud,easy,0.0,0.0,0.0,0.0
216,28th October 2016,debit,bank,micro,trend,pager,0.0,0.0,0.0,0.0,0.0
217,24th October 2016,advertising,growing,number,skimming,0.0,0.0,0.0,0.0,0.0,0.0
218,17th October 2016,involved,discreet,variant,financial,banking,0.0,0.0,0.0,0.0,0.0


In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Article,topics,Links
0,0,23rd April 2021,['The NCSC is aware that a malicious piece of ...,"['Cyber attack', 'Cyber strategy', 'Education'...",https://www.ncsc.gov.uk/report/weekly-threat-r...
1,1,16th April 2021,['Cyber security researchers have uncovered a ...,"['Cyber strategy', 'Patching', 'Vulnerabilities']",https://www.ncsc.gov.uk/report/weekly-threat-r...
2,2,12th April 2021,"['Cyber security researchers, Esentire, have w...","['Phishing', 'Social media', 'Personal data', ...",https://www.ncsc.gov.uk/report/weekly-threat-r...
3,3,2nd April 2021,['The UK education sector continues to face an...,"['Education', 'Incident management', 'Secure d...",https://www.ncsc.gov.uk/report/weekly-threat-r...
4,4,26th March 2021,['Earlier this month Microsoft confirmed that ...,"['Cyber attack', 'Education', 'Mitigation', 'P...",https://www.ncsc.gov.uk/report/weekly-threat-r...


Below we will take a few entries in the dataset and compare the top words with the topics that are described in the original dataset.

In [16]:
print(top10_df.iloc[0])
print(df["topics"][0])

Article      23rd April 2021
1st Word               pulse
2nd Word             connect
3rd Word           education
4th Word          university
5th Word                 0.0
6th Word                 0.0
7th Word                 0.0
8th Word                 0.0
9th Word                 0.0
10th Word                0.0
Name: 0, dtype: object
['Cyber attack', 'Cyber strategy', 'Education', 'Vulnerabilities', 'Secure design and development', 'Research & ', 'Academia']


Here we can see the relatively strong correlation between education. It is worth noting that all of these topics will involve cyber in some way as that is the data we were working with in order to make have our project be relevant in the cybersecurity field.

In [17]:
print(top10_df.iloc[1])
print(df["topics"][1])

Article      16th April 2021
1st Word            exchange
2nd Word              server
3rd Word                 0.0
4th Word                 0.0
5th Word                 0.0
6th Word                 0.0
7th Word                 0.0
8th Word                 0.0
9th Word                 0.0
10th Word                0.0
Name: 1, dtype: object
['Cyber strategy', 'Patching', 'Vulnerabilities']


In [18]:
print(top10_df.iloc[2])
print(df["topics"][2])

Article      12th April 2021
1st Word                 job
2nd Word             harmful
3rd Word                urge
4th Word                 0.0
5th Word                 0.0
6th Word                 0.0
7th Word                 0.0
8th Word                 0.0
9th Word                 0.0
10th Word                0.0
Name: 2, dtype: object
['Phishing', 'Social media', 'Personal data', 'Vulnerabilities']


In [19]:
print(top10_df.iloc[3])
print(df["topics"][3])

Article      2nd April 2021
1st Word          education
2nd Word            primary
3rd Word              board
4th Word         federation
5th Word     misinformation
6th Word          obviously
7th Word     transformation
8th Word                0.0
9th Word                0.0
10th Word               0.0
Name: 3, dtype: object
['Education', 'Incident management', 'Secure design and development']


Some articles seem to mention many words more than once that may not seem overly relevant to the topics shown but perhaps when reading the article we would see how relevant these are:

In [25]:
df["Article"][3]

"['The UK education sector continues to face an increased threat from ransomware attacks with a notable rise since students returned to the classroom.\\nRansomware is a type of malware which can make data or systems unusable until the victim makes a payment. This can obviously have a huge impact in an education environment.\\nThe Harris Federation, who run a number of primary and secondary schools in the London area, issued a statement confirming they had been victims of a ransomware attack. They have been working with the NCSC and the NCA since the incident.\\nLast week we re-issued an alert to the education establishments with updated advice and guidance following the trend of attacks against the sector. The original alert was published in September 2020 and schools, colleges and universities are urged to read and use the advice where possible.\\nThe NCSC has also published a number of resources for schools to help them improve their cyber security.', 'A recent report from PwC cites 

This paragraph is quite short, highlighting a potential bias in topic modelling as it's sometimes hard to group together a topic out of just a few sentences.

In [20]:
print(top10_df.iloc[52])
print(df["topics"][52])

Article      3rd April 2020
1st Word        unfortunate
2nd Word           hospital
3rd Word           convince
4th Word              click
5th Word            spotted
6th Word                try
7th Word              local
8th Word                0.0
9th Word                0.0
10th Word               0.0
Name: 52, dtype: object
['Cyber threat', 'Phishing', 'Security monitoring', 'Vulnerabilities']


In [21]:
print(top10_df.iloc[69])
print(df["topics"][69])

Article      29th November 2019
1st Word                weekend
2nd Word                   shop
3rd Word                  black
4th Word               shopping
5th Word                   help
6th Word                    0.0
7th Word                    0.0
8th Word                    0.0
9th Word                    0.0
10th Word                   0.0
Name: 69, dtype: object
['Cyber attack', 'Cyber threat', 'Personal data', 'Security monitoring', 'Vulnerabilities', 'People-centred security', 'Mitigation']


From the above I draw the conclusion that this article is probably about Black Friday sales, of which increasingly more are taking place online. All of the topics suggested here, particularly "People-centred security" are relevant. I'll quickly check the article to see if this is the case and whether using this method has given us a good impression of what kind of topics are taking place, which we see is true!

In [26]:
df["Article"][69]

"['The Black Friday and Cyber Monday sales are now upon us with consumers set to be tempted with bargains this weekend and beyond.\\nHowever, with the promise of ‘unmissable’ deals there is also an important message for consumers to consider. Ensuring your online accounts are as secure as possible before making the most of those offers is crucial and will help to defend against cyber criminals.\\nLast year we wrote a blog post about the increased risk of cyber criminals taking advantage of online shoppers at this time of year. We have also published guidance about how to shop online securely. This advice will help you avoid scams and help you with next steps if you have been unlucky enough to fall victim to cyber crime. \\nTo help online shoppers we have also been running a social media campaign which uses the 8 tips in our online shopping guidance. The campaign will help customers to focus on three important areas: preparing to shop, while you are shopping and after you’ve shopped.   

In [22]:
print(top10_df.iloc[154])
print(df["topics"][154])

Article      16th February 2018
1st Word                 mining
2nd Word                   urge
3rd Word               forensic
4th Word                    0.0
5th Word                    0.0
6th Word                    0.0
7th Word                    0.0
8th Word                    0.0
9th Word                    0.0
10th Word                   0.0
Name: 154, dtype: object
['Cyber threat']


In [23]:
df.tail(20)

Unnamed: 0.1,Unnamed: 0,Title,Article,topics,Links
201,201,3rd March 2017,['Drone-enabled hacking\nAn organisation’s mos...,['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
202,202,24th February 2017,['Ex-employee threats to business\nA disgruntl...,['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
203,203,17th February 2017,['Official Launch of the National Cyber Securi...,['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
204,204,13th February 2017,"[""Polish banks in watering hole attack\nThe Po...",['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
205,205,27th January 2017,"[""Twitterbots spreading fake news on the inter...",['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
206,206,20th January 2017,"['Password security\nIn November 2016, a study...",['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
207,207,13th January 2017,"[""The year of ransomware...\n...is how 2016 ha...",['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
208,208,6th January 2017,['Vulnerabilities in travel booking systems\nS...,['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
209,209,16th December 2016,"[""Successful take-down of DDoS for hire servic...",['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...
210,210,9th December 2016,['Infected routers vulnerable to further attac...,['Cyber threat'],https://www.ncsc.gov.uk/report/weekly-threat-r...


The last lot of entries are classified purely as "Cyber threat" and therefore do not suggest and particularly interesting conclusions we would be able to draw from them.

## References

[Spark Documentation](https://spark.apache.org/docs/latest/ml-clustering.html)

[Further Spark Documentation (RDD)](https://spark.apache.org/docs/latest/mllib-clustering.html)

[Official Spark GitHub Repo - contains many examples](https://github.com/apache/spark)