# Cluster Analysis and topic modelling using LDA

## Task
Cluster the posts using LDA (Latent Dirichlet Allocation)

## Data
* Take the same data that was used with KMeans - posts on facebook pages, but take only the cluster that corresponds to english pages

## Notes
* Use LDA instead of KMeans
* You may want to play with number of topics and the size of vocabulary (the default size of CountVectorizer is 262144)
* You may want to do some more preprocessing of the text
 * for instance remove punctuation
 * or add some more words on the list provided to the StopWordsRemover


## About LDA
* for more details about LDA see <a target="_blank" href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">wiki</a>
* LDA model assumes that each document (post message in our case) is composed of some topics (number of these topics has to specified as input parameter)
* Each of these topics can be characterized by a set of words (bellow we provide a udf get_words that allows you to see the words to each topic)
* For each document you will get a topic distribution (a probability or weight for each topic in the document)
* The most probable topic in the document can be interpreted as cluster (bellow we provide a udf get_cluster that gives you index of the most probable topic)

## Documentation
<br>
* Pyspark documentation of DataFrame API is <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html">here</a>

* Pyspark documentation of ML Pipelines library is <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html">here</a>

* Prezentation slides are accessed <a target="_blank" href = "https://docs.google.com/presentation/d/1XNKIfE5Atj_Mzse0wjmbwLecmVs2YkWm9cqOLqDVWPo/edit?usp=sharing">here</a>

### Import functions and modules

In [4]:
from pyspark.sql.functions import col, count, desc, array_contains, split, explode, regexp_replace, lit

from pyspark.sql.types import ArrayType, StringType

from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, Normalizer, CountVectorizer

from pyspark.ml.clustering import LDA

from pyspark.ml import Pipeline


import numpy as np

### Load Data

hint
* here we will use the dataset that you saved in the previous notebook so copy the table_name and use it here

In [6]:
# take the generated name from the previous notebook:
table_name = 'fuinstsmrlmpfysimfvk'

data = spark.table(table_name)

### Explore the data

hint
* see how many records you have

In [8]:
data.count()

In [9]:
display(data)

page_id,message
143171562377465,"The British High Commission Bridgetown continues its support of various developmental projects across the region, through the provision of grant funding to launch these initiatives. Most recently, the British High Commission in conjunction with Criminal Justice Advisor Sirah Abraham, has provided funding to support a series of Case Building Workshops in Antigua, Grenada, St. Lucia and St. Vincent. The workshops were held under the theme: ""Strengthening the Rule of Law Through a Conspiracy."""
295699560235,KIOTI Tractor is honoring all Heroes with our Heroes Reward! Purchase a KIOTI tractor or UTV and receive up to a $150 discount. Find out more at
263906084882,"With 7 individual seats, the all-new 5008 SUV can be configured in a way to perfectly meet your needs."
151478384936070,"Superintendent Lynn Goodall has risen through the ranks in her 22 years with us. Over the years she has batted off remarks that she could not build her policing career whilst raising a family. Her advice to any parent wanting to join the police whilst raising a family is you can do both, go and give it a go. We offer our police officers flexible working and various types of leave to achieve a healthy work-balance.  Our officers FitTheBill - could you? Apply here: www.essex.police.uk/fitthebill"
313001204840,ISM Staff wants to thank Parent Connection and all parents who have donated their wonderful goodies to the Staff Holiday Tea yesterday. A big thanks also to the students for your kind words! We really appreciate all you do for our school.
146314478800368,Here are some incredible deals for you. We are so happy to share these deals. doula birtharts herbalist aromatherapist
128603890541167,"Still trying to fit in a cool class for winter quarter? Check out Counterstory, taught by SSP own Jonah Willihnganz! Counterstory is a method developed in critical legal studies that emerges out of the broad “narrative turn” in the humanities and social science. This course explores the value of this turn, especially for marginalized communities, and the use of counterstory as analysis, critique, and self-expression. Using an interdisciplinary approach, we examine counterstory as it has developed in critical theory, critical pedagogy, and critical race theory literatures, and explore it as a framework for liberation, cultural work, and spiritual exploration."
8301814001,"The Computer Science program at Stern College for Women - Yeshiva University stresses both the practical and theoretical aspects of computing, preparing students for employment in various fields of computer science and to pursue advanced studies. Learn more at: www.yu.edu/stern/ug/computer-science"
79010917739,"ATVriders just picked up the New Polaris Industries ACE 150 from Don's in York, PA as we prepare to head to the BITD UTV World Championship next weekend, and Faith Foley will be racing for the Gonzalez Team in GPS built Polaris ACE 150"
227903283902393,Where do I begin... It’s been 2 hours that my son opened his BB8 droid he got for Xmas. While trying to charge it out of nowhere the antenna broke! How is this possible? How can they manufacture an expensive product with such low quality antenna? It was just touched!


### Remove punctuation

hint
* it seems to be reasonable to do some more preprocessing on the data - one of the steps is removing the punctuation
* you can use <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace">regexp_replace</a> function of DF API
* you may try to use this (or some similar) regular expression: "[(.|?|,|:|;|!|>|<)]"

In [11]:
reg = "[(.|?|,|:|;|!|>|<)]"

pages = data.withColumn('message', regexp_replace('message', reg, ' '))

### See how many words you have in total in your documents

hint
* use functions <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.split">split</a> and <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode">explode</a> on the message field
* select the exploded message field and call <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct">distinct</a> on it (or use <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates">dropDuplicates</a> equivalently)
* count number of rows

In [13]:
(
  pages
  .withColumn('words', split('message', ' '))
  .select(explode('words').alias('word'))
  .distinct()
  .count()
)

### Construct the pipeline

hint
* do vector representation for the texts
 * use: 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer">Tokenizer</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover">StopWordsRemover</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer">CountVectorizer</a>
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.IDF">IDF</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer">Normalizer</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA">LDA</a>
* you will have to choose number of topics for the LDA
* See the slides 83, 84, 85, 101 in the presentation

Notes
* with KMeans we used HashingTF to compute the term frequency as input for IDF
* here we are using countVectorizer so we can work with actual words and see how the topics are described later on

In [15]:
tokenizer = Tokenizer(inputCol='message', outputCol='words')

stopWordsRemover = StopWordsRemover(inputCol='words', outputCol='noStopWords')

countVectorizer = CountVectorizer(vocabSize=1000, inputCol='noStopWords', outputCol='tf', minDF=1)

idf = IDF(inputCol='tf', outputCol='idf')

normalizer = Normalizer(inputCol='idf', outputCol='features')

lda = LDA(k=7, maxIter=10)

pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, countVectorizer, idf, normalizer, lda])

model = pipeline.fit(pages)

### Apply the model on the data

hint
* just call transform, since the model is a transformer
* pass the training data as argument to the transform function

In [17]:
predictions = model.transform(pages)

## See the result of LDA

hint
* select name, message, topicDistribution to see the probabilities for each topic in given document

In [19]:
display(
  predictions
  .select('message', 'topicDistribution')
)

message,topicDistribution
"The British High Commission Bridgetown continues its support of various developmental projects across the region through the provision of grant funding to launch these initiatives Most recently the British High Commission in conjunction with Criminal Justice Advisor Sirah Abraham has provided funding to support a series of Case Building Workshops in Antigua Grenada St Lucia and St Vincent The workshops were held under the theme ""Strengthening the Rule of Law Through a Conspiracy ""","List(1, 7, List(), List(0.027369827058390375, 0.8366525262922494, 0.02695157921452368, 0.02794141614002113, 0.02712488455425168, 0.027365641061518092, 0.026594125679045563))"
KIOTI Tractor is honoring all Heroes with our Heroes Reward Purchase a KIOTI tractor or UTV and receive up to a $150 discount Find out more at,"List(1, 7, List(), List(0.716627158480905, 0.05344773858865361, 0.045570170040631554, 0.04729734443113987, 0.045846468070969364, 0.04627551029965621, 0.044935610088044343))"
With 7 individual seats the all-new 5008 SUV can be configured in a way to perfectly meet your needs,"List(1, 7, List(), List(0.043303692576722785, 0.7416868823688527, 0.04262903520803348, 0.04410500130086344, 0.04290358256686272, 0.04330086049127149, 0.042070945487393555))"
Superintendent Lynn Goodall has risen through the ranks in her 22 years with us Over the years she has batted off remarks that she could not build her policing career whilst raising a family Her advice to any parent wanting to join the police whilst raising a family is you can do both go and give it a go We offer our police officers flexible working and various types of leave to achieve a healthy work-balance Our officers FitTheBill - could you Apply here www essex police uk/fitthebill,"List(1, 7, List(), List(0.028143941211445277, 0.03236846168704487, 0.02768266906697969, 0.8285039731584999, 0.027867241641429636, 0.028109961731960054, 0.027323751502640617))"
ISM Staff wants to thank Parent Connection and all parents who have donated their wonderful goodies to the Staff Holiday Tea yesterday A big thanks also to the students for your kind words We really appreciate all you do for our school,"List(1, 7, List(), List(0.030848964212103487, 0.8160020936905878, 0.030427513487705237, 0.031420427427143384, 0.030552764565314444, 0.030801233998722206, 0.029947002618423448))"
Here are some incredible deals for you We are so happy to share these deals doula birtharts herbalist aromatherapist,"List(1, 7, List(), List(0.04614310451665776, 0.449202108604958, 0.045373409481414564, 0.3226589743961554, 0.045703617930476476, 0.046070481538553545, 0.044848303531784144))"
Still trying to fit in a cool class for winter quarter Check out Counterstory taught by SSP own Jonah Willihnganz Counterstory is a method developed in critical legal studies that emerges out of the broad “narrative turn” in the humanities and social science This course explores the value of this turn especially for marginalized communities and the use of counterstory as analysis critique and self-expression Using an interdisciplinary approach we examine counterstory as it has developed in critical theory critical pedagogy and critical race theory literatures and explore it as a framework for liberation cultural work and spiritual exploration,"List(1, 7, List(), List(0.031000638177963447, 0.8150245090496492, 0.030524887904462228, 0.031606745117413675, 0.03071904285923535, 0.03099540894763968, 0.030128767943636416))"
The Computer Science program at Stern College for Women - Yeshiva University stresses both the practical and theoretical aspects of computing preparing students for employment in various fields of computer science and to pursue advanced studies Learn more at www yu edu/stern/ug/computer-science,"List(1, 7, List(), List(0.034610854841997205, 0.7936710950498518, 0.0340378958056873, 0.03526818359282498, 0.0342833438945689, 0.03454901064212442, 0.03357961617294546))"
ATVriders just picked up the New Polaris Industries ACE 150 from Don's in York PA as we prepare to head to the BITD UTV World Championship next weekend and Faith Foley will be racing for the Gonzalez Team in GPS built Polaris ACE 150,"List(1, 7, List(), List(0.7858744937018848, 0.04036666122334379, 0.034445072743059746, 0.035706391634021534, 0.034674318785524814, 0.034946299242674894, 0.033986762669490495))"
Where do I begin It’s been 2 hours that my son opened his BB8 droid he got for Xmas While trying to charge it out of nowhere the antenna broke How is this possible How can they manufacture an expensive product with such low quality antenna It was just touched,"List(1, 7, List(), List(0.03453046577443737, 0.7940196222233846, 0.034006586510796105, 0.035161215062364616, 0.03420206192719399, 0.034553307066023844, 0.033526741435799506))"


### Helper functions (udfs)

In [21]:
# Some useful UDFs that will help you to do the next tasks

# vocabulary your model is using:
vocab = model.stages[2].vocabulary

# udf to extract the words for the topics
@udf(ArrayType(StringType()))
def get_words(termIndices):
  return [vocab[idx] for idx in termIndices]


# udf to determine the main topic for the document
@udf('integer')
def get_cluster(vec):
  return int(np.argmax(vec))


# udf to get the probability of a given topic in the document
@udf('double')
def get_topic_probability(vec, topic):
  return float(vec[topic])

### Describe topics

hint
* each topic is characterized by a set of words
* use <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LDAModel.describeTopics">describeTopics()</a> method of the LDA model to get the indices of the words in your vocabulary (model.stages[n].describeTopics(), here n is the index of LDA in your pipeline)
* use the udf get_words to see the actual words

In [23]:
display(
  model.stages[5].describeTopics()
  .withColumn('x', get_words(col('termIndices')))
)

topic,termIndices,termWeights,x
0,"List(27, 15, 12, 1, 320, 135, 874, 508, 152, 193)","List(0.005453356058659963, 0.0042269902356285634, 0.0036853657082672013, 0.003444398562306881, 0.0032383174473739305, 0.003074810944381931, 0.0030279180926769595, 0.002937833729368876, 0.002870150240277002, 0.0028231425558278953)","List(christmas, team, 2017, -, christmascountdown, pro, blood, utv, thanks, enjoy)"
1,"List(1, 3, 4, 59, 9, 6, 0, 5, 8, 2)","List(0.005662129465032932, 0.005228293573671817, 0.004813394602762254, 0.004679283103524866, 0.00439873416104851, 0.0042533979659292575, 0.0042128525699349165, 0.004181779541745846, 0.004020431923736725, 0.003907672586342381)","List(-, new, us, strike, needs, time, , &, day, various)"
2,"List(5, 340, 36, 266, 138, 473, 655, 667, 23, 209)","List(0.003392887386753695, 0.0026773266941220782, 0.0026510183392695996, 0.0025318251737436633, 0.002520698712889777, 0.0024613045786776055, 0.002315796725513754, 0.0023050214502113122, 0.002266826810589063, 0.0021572903591331764)","List(&, narrative, first, usa, video, kitchen, marie, curie, great, facebook)"
3,"List(10, 32, 224, 16, 201, 26, 144, 487, 618, 619)","List(0.005218511502387996, 0.005197038119681677, 0.005117998222880958, 0.004883485874745343, 0.004673550748233838, 0.004510653716246233, 0.003336160901347958, 0.0032437784206264374, 0.0031851806325761964, 0.002986084639904546)","List(get, com, tix, www, apolloecigs, available, ️, @cleanbuilds, flavor, photography)"
4,"List(19, 30, 45, 49, 306, 375, 468, 205, 682, 189)","List(0.013129112630804903, 0.010164132097702724, 0.009584721658693077, 0.008107210999757516, 0.0065774161495065955, 0.005802063540945015, 0.005335126440350322, 0.005232924723779083, 0.005043584102684862, 0.0048957845736303275)","List(na, se, v, je, jsem, dekuji, den, z, jak, si)"
5,"List(24, 125, 169, 126, 102, 104, 82, 157, 199, 186)","List(0.011121290884194692, 0.006885406083673143, 0.006594279732711318, 0.006440372055763428, 0.006030412830609391, 0.005868611723557708, 0.005642332359525896, 0.005458475947091516, 0.005233523319489919, 0.004807454121308286)","List(de, à, en, la, compte, et, le, un, pour, les)"
6,"List(87, 652, 246, 597, 208, 314, 366, 257, 52, 41)","List(0.0025620841667263606, 0.0024847065199286996, 0.002282005005287848, 0.0022297794455796173, 0.00219331687412895, 0.0021733770062443303, 0.002016835124109379, 0.0018936282649489491, 0.0018621400687176407, 0.0018183150204850232)","List(5, we’re, 12, winners, doula, enter, unique, giving, work, 1)"


### Find the most likely topic for each document

hint
* add new column named 'cluster' using the udf get_cluster to get the most likely topic for each post
* as argument for the udf use column topicDistribution which the result of LDA. This column contains vector with probabilities for each topic in the post
* you can now groupBy this new column and count how many posts are in given cluster

In [25]:
display(
   predictions
  .select('page_id', 'topicDistribution', 'message')
  .withColumn('cluster', get_cluster('topicDistribution'))
  .groupBy('cluster')
  .count()
)

cluster,count
1,2478
6,2
3,442
5,323
4,244
2,94
0,171


## Order the documents by probability of specific topic

hint
* choose a topic index (for example 0)
* add new column called 'topicProbability' and extract here the probability your selected topic
 * these probabilities are in the column topicDistribution
 * to extract the probability you can use udf get_topic_probability implemented above. Just pass in the column topicDistribution and the index of your selected topic (you have to use the lit function for the topic index, for example: lit(0))
* order the DataFrame in descending order by this new column topicProbability

In [27]:
display(
   predictions
  .select('page_id', 'topicDistribution', 'message')
  .withColumn('topicProbability', get_topic_probability(col('topicDistribution'), lit(0)))
  .orderBy(desc('topicProbability'))
)

page_id,topicDistribution,message,topicProbability
22018596340,"List(1, 7, List(), List(0.8441248692005097, 0.029333180929942738, 0.0250994886359909, 0.025986600648593396, 0.02524990539558327, 0.025453055815231166, 0.024752899374148866))",We have arrested a man following reports of criminal damage incidents in Popley overnight on Saturday 23 September It was reported that as many as 40 cars had their tyres slashed on various streets including Bermuda Close John Hunt Drive Malta Close Montserrat Place Pershore Road and Timor Close A 27-year-old man has been arrested in connection with this investigation and remains in custody at this time Officers would still like to hear from any witnesses or anyone who would like to report a similar incident that has not done so already Anyone with information should call 101 quoting 44170370120 or contact the charity Crimestoppers anonymously on 0800 555 111,0.8441248692005097
490087721012880,"List(1, 7, List(), List(0.8424951642573122, 0.029675380532121116, 0.02534301220446537, 0.02626531075041647, 0.025496203397497953, 0.025722843654643116, 0.025002085203543706))",Dear newcomers dear expats We all know that starting a life in France or even just speaking French can be quite challenging so let’s love and learn it together SCOLINGUA a team of young and dynamic teachers offers highly personalized playful and efficient French lessons Give yourself the opportunity to deal happily and easily with your French environment join us and enjoy it Scolingua EIRL 06 70 35 02 68 www scolingua com Facebook page Scolingua,0.8424951642573122
104279176284807,"List(1, 7, List(), List(0.8418110901278281, 0.02971636473173872, 0.025454384134097553, 0.02634409952112083, 0.025633586225000515, 0.025926075061103938, 0.025114400199110296))",7 décembre 2017 - Saviez-vous que le Père Noël était gourmand et qu’il faisait souvent appel à nos équipes du Room Service - December 7th 2017 - Did you know that Santa Claus is a foodie and often calls our Room Service team for a snack Christmascountdown  Lancelot Drone Prod LeBristolParis Travellermade,0.8418110901278281
490087721012880,"List(1, 7, List(), List(0.8380667357243676, 0.030502252381440918, 0.02605123546965196, 0.027007982089293902, 0.026216071632149156, 0.02644742883236918, 0.02570829387072719))",Dear expats We know that starting a new life in France or just speaking French is quite challenging so let’s love and learn it together SCOLINGUA a team of young and dynamic teachers offers you and your family highly personalized playful and efficient French lessons Give yourself the opportunity to deal happily and easily with your French environment join us and enjoy it Scolingua EIRL 06 70 35 02 68 www scolingua com,0.8380667357243676
104279176284807,"List(1, 7, List(), List(0.8379469062659648, 0.030334322793523254, 0.026078324073041777, 0.02699282411648116, 0.026254504409288868, 0.02666461674748647, 0.025728501594213863))",13 décembre 2017 - Pas de Noël sans chocolat pour notre équipe pâtisserie - December 13th 2017 - The festive season without chocolate No way for our pastry team Christmascountdown Lancelot Drone Prod LeBristolParis Travellermade,0.8379469062659648
104279176284807,"List(1, 7, List(), List(0.8375804861074007, 0.030433786596265947, 0.02613482708166197, 0.027052246521000807, 0.026316772119617943, 0.026692663174207626, 0.025789218399844962))",6 décembre 2017 - La mission de nos fruitiers Garder l’équipe du Père Noël en forme - December 6th 2017 - Our fruitiers keep Santa’s team healthy Christmascountdown  Lancelot Drone Prod,0.8375804861074007
104279176284807,"List(1, 7, List(), List(0.8369222207008922, 0.030615446330185047, 0.026244550457076393, 0.027167909366404824, 0.026424870007437606, 0.02672877045898762, 0.025896232679016275))",8 décembre 2017 - Nos bagagistes s’entraînent pour la distribution des cadeaux le Jour-J - December 8th 2017 - Our bellmen are training to help Santa’s reindeer deliver the gifts on Christmas Day Christmascountdown  Lancelot Drone Prod LeBristolParis Travellermade,0.8369222207008922
364631603583133,"List(1, 7, List(), List(0.8365529802534368, 0.030726395984918744, 0.026318798039692018, 0.027246852513216253, 0.026483658222275507, 0.02670062113031929, 0.025970693856141223))",T-3 days to Christmas Eve and we’re getting so close we can almost hear those reindeer bells The next team members in our countdown are the wonderful Claire and Petra from our Housekeeping Team What do Claire and Petra love most about Christmas Claire loves spending quality time with her family over the Christmas season and Petra loves all the Christmas movies especially Home Alone TeamSouthampton ChristmasCountdown,0.8365529802534368
129765987186706,"List(1, 7, List(), List(0.8358555333299101, 0.030877943129771223, 0.026409451528252078, 0.027380205590009893, 0.026582873839787562, 0.026808242363139643, 0.0260857502191294))","Alexandre Ziegler who has held several prominent positions with the government of France is now the French Ambassador to India A man with a variety of interests he lives in Delhi which he calls home though he has travelled extensively across India During one of his many visits to Mumbai a city he describes as ""vibrant and creative"" Mr Ziegler talks to BT about the common passions shared by the French and Indians like art cinema politics and of course food",0.8358555333299101
104279176284807,"List(1, 7, List(), List(0.8350350349871655, 0.030890610069402743, 0.026549111312269273, 0.027481914239843797, 0.026740409281424668, 0.027101359165583084, 0.02620156094431069))",4 décembre 2017 - Les préparatifs de Noël ont commencé dans les cuisines du 114 Faubourg - December 4th 2017 - Looks like the 114 Faubourg Team is already in the Christmas spirit ChristmasCountdown  Lancelot Drone Prod,0.8350350349871655
