# Cluster Analysis and topic modelling using LDA

## Task
Cluster the posts using LDA (Latent Dirichlet Allocation)

## Data
* Take the same data that was used with KMeans - posts on facebook pages, but take only the cluster that corresponds to english pages

## Notes
* Use LDA instead of KMeans
* You may want to play with number of topics and the size of vocabulary (the default size of CountVectorizer is 262144)
* You may want to do some more preprocessing of the text
 * for instance remove punctuation
 * or add some more words on the list provided to the StopWordsRemover


## About LDA
* for more details about LDA see <a target="_blank" href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">wiki</a>
* LDA model assumes that each document (post message in our case) is composed of some topics (number of these topics has to specified as input parameter)
* Each of these topics can be characterized by a set of words (bellow we provide a udf get_words that allows you to see the words to each topic)
* For each document you will get a topic distribution (a probability or weight for each topic in the document)
* The most probable topic in the document can be interpreted as cluster (bellow we provide a udf get_cluster that gives you index of the most probable topic)

## Documentation
<br>
* Pyspark documentation of DataFrame API is <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html">here</a>

* Pyspark documentation of ML Pipelines library is <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html">here</a>

* Prezentation slides are accessed <a target="_blank" href = "https://docs.google.com/presentation/d/1XNKIfE5Atj_Mzse0wjmbwLecmVs2YkWm9cqOLqDVWPo/edit?usp=sharing">here</a>

### Import functions and modules

In [4]:
from pyspark.sql.functions import col, count, desc, array_contains, split, explode, regexp_replace, lit

from pyspark.sql.types import ArrayType, StringType

from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, Normalizer, CountVectorizer

from pyspark.ml.clustering import LDA

from pyspark.ml import Pipeline


import numpy as np

### Load Data

hint
* here we will use the dataset that you saved in the previous notebook so copy the table_name and use it here

In [6]:
# take the generated name from the previous notebook:
table_name = 'muutodfmuwfcmpxjfvwy'

data = spark.table(table_name)

### Explore the data

hint
* see how many records you have

In [8]:
data.count()

In [9]:
display(data)

### Remove punctuation

hint
* it seems to be reasonable to do some more preprocessing on the data - one of the steps is removing the punctuation
* you can use <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace">regexp_replace</a> function of DF API
* you may try to use this (or some similar) regular expression: "[(.|?|,|:|;|!|>|<)]"

In [11]:
reg = "[(.|?|,|:|;|!|>|<)]"

pages = data.withColumn('message', regexp_replace('message', reg, ' '))

### See how many words you have in total in your documents

hint
* use functions <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.split">split</a> and <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode">explode</a> on the message field
* select the exploded message field and call <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct">distinct</a> on it (or use <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates">dropDuplicates</a> equivalently)
* count number of rows

In [13]:
(
  pages
  .withColumn('words', split('message', ' '))
  .select(explode('words').alias('word'))
  .distinct()
  .count()
)

### Construct the pipeline

hint
* do vector representation for the texts
 * use: 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer">Tokenizer</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover">StopWordsRemover</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer">CountVectorizer</a>
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.IDF">IDF</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer">Normalizer</a> 
 * <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA">LDA</a>
* you will have to choose number of topics for the LDA
* See the slides 83, 84, 85, 101 in the presentation

Notes
* with KMeans we used HashingTF to compute the term frequency as input for IDF
* here we are using countVectorizer so we can work with actual words and see how the topics are described later on

In [15]:
tokenizer = Tokenizer(inputCol='message', outputCol='words')

stopWordsRemover = StopWordsRemover(inputCol='words', outputCol='noStopWords')

countVectorizer = CountVectorizer(vocabSize=1000, inputCol='noStopWords', outputCol='tf', minDF=1)

idf = IDF(inputCol='tf', outputCol='idf')

normalizer = Normalizer(inputCol='idf', outputCol='features')

lda = LDA(k=7, maxIter=10)

pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, countVectorizer, idf, normalizer, lda])

model = pipeline.fit(pages)

### Apply the model on the data

hint
* just call transform, since the model is a transformer
* pass the training data as argument to the transform function

In [17]:
predictions = model.transform(pages)

## See the result of LDA

hint
* select name, message, topicDistribution to see the probabilities for each topic in given document

In [19]:
display(
  predictions
  .select('message', 'topicDistribution')
)

### Helper functions (udfs)

In [21]:
# Some useful UDFs that will help you to do the next tasks

# vocabulary your model is using:
vocab = model.stages[2].vocabulary

# udf to extract the words for the topics
@udf(ArrayType(StringType()))
def get_words(termIndices):
  return [vocab[idx] for idx in termIndices]


# udf to determine the main topic for the document
@udf('integer')
def get_cluster(vec):
  return int(np.argmax(vec))


# udf to get the probability of a given topic in the document
@udf('double')
def get_topic_probability(vec, topic):
  return float(vec[topic])

### Describe topics

hint
* each topic is characterized by a set of words
* use <a target="_blank" href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LDAModel.describeTopics">describeTopics()</a> method of the LDA model to get the indices of the words in your vocabulary (model.stages[n].describeTopics(), here n is the index of LDA in your pipeline)
* use the udf get_words to see the actual words

In [23]:
display(
  model.stages[5].describeTopics()
  .withColumn('x', get_words(col('termIndices')))
)

### Find the most likely topic for each document

hint
* add new column named 'cluster' using the udf get_cluster to get the most likely topic for each post
* as argument for the udf use column topicDistribution which the result of LDA. This column contains vector with probabilities for each topic in the post
* you can now groupBy this new column and count how many posts are in given cluster

In [25]:
display(
   predictions
  .select('page_id', 'topicDistribution', 'message')
  .withColumn('cluster', get_cluster('topicDistribution'))
  .groupBy('cluster')
  .count()
)

## Order the documents by probability of specific topic

hint
* choose a topic index (for example 0)
* add new column called 'topicProbability' and extract here the probability your selected topic
 * these probabilities are in the column topicDistribution
 * to extract the probability you can use udf get_topic_probability implemented above. Just pass in the column topicDistribution and the index of your selected topic (you have to use the lit function for the topic index, for example: lit(0))
* order the DataFrame in descending order by this new column topicProbability

In [27]:
display(
   predictions
  .select('page_id', 'topicDistribution', 'message')
  .withColumn('topicProbability', get_topic_probability(col('topicDistribution'), lit(0)))
  .orderBy(desc('topicProbability'))
)