# Apache Spark (Pyspark) Text Clustering - LDA

## Notebook Setup

The following will load the required libraries and setup the environment for use

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.clustering import LDA
#import pyspark.sql.functions as F
#from pyspark.ml.feature import

In [2]:
spark = SparkSession.builder.master('local[*]').config("spark.driver.memory", "24g").appName('spark').getOrCreate()
sc = spark.sparkContext
print(f'pyspark version: {sc.version}')

22/11/13 23:37:46 WARN Utils: Your hostname, cheetah resolves to a loopback address: 127.0.1.1; using 192.168.1.66 instead (on interface eno1)
22/11/13 23:37:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/13 23:37:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
pyspark version: 3.3.0


## Load the Prepared App Reviews Dataset

The data was already prepared in the [apache-spark text prep notebook](../../text_prep/apache-spark/text_prep_pyspark_linux.ipynb).

In [3]:
mobile_app_reviews_df = spark.read.parquet("../../data/cleaned/appsearch_reviews_clean.txt")

Show the first few lines to ensure it read in properly

In [4]:
mobile_app_reviews_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|              app_id|                text|              tokens|             vectors|                 idf|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|com.lg.crazy.taxi...| Catalat catlat p...|[, catalat, catla...|(262144,[5,11,231...|(262144,[5,11,231...|
|com.lg.crazy.taxi...|Okay I like it bu...|[okay, i, like, i...|(262144,[1,3,14,2...|(262144,[1,3,14,2...|
|com.lg.crazy.taxi...|Love Its amazing ...|[love, its, amazi...|(262144,[3,21,22,...|(262144,[3,21,22,...|
|com.lg.crazy.taxi...| Adhd Circle of life|[adhd, circle, of...|(262144,[13,371,2...|(262144,[13,371,2...|
|com.lg.crazy.taxi...|        Awsome Great|     [awsome, great]|(262144,[17,423],...|(262144,[17,423],...|
|com.lg.crazy.taxi...| Its kind of fun ...|[, its, kind, of,...|(262144,[1,3,11,1...|(262144,[1,3,11,1...|
|com.lg.crazy.taxi...|     Awesome Aw

## Run the LDA model

In [5]:
%%time
lda = LDA(k=6, seed=1, optimizer="em", maxIter=10, featuresCol="idf")

lda_model = lda.fit(mobile_app_reviews_df)



CPU times: user 95.1 ms, sys: 19.5 ms, total: 115 ms
Wall time: 4min 45s


                                                                                

## Some summary information about the model

Here we can look at the topics by their top-weighted terms. We can re-process the data to remove some stop words here if we see any issues with the topics.

In [6]:
topics = lda_model.describeTopics(5)
topics.show(truncate=False)

+-----+---------------+---------------------------------------------------------------------------------------------------------------+
|topic|termIndices    |termWeights                                                                                                    |
+-----+---------------+---------------------------------------------------------------------------------------------------------------+
|0    |[0, 1, 2, 3, 4]|[0.009289626977969494, 0.007965217406059433, 0.007870085390111577, 0.007183354118510596, 0.006949654704042732] |
|1    |[0, 1, 2, 3, 4]|[0.009268184801376806, 0.008013803147922329, 0.007870600674189373, 0.007203990668316692, 0.006960621698365707] |
|2    |[0, 1, 2, 3, 4]|[0.009284798522312085, 0.007988603935662137, 0.007870173574687575, 0.007160821343449149, 0.006967605974319328] |
|3    |[0, 1, 2, 3, 4]|[0.009293736749169081, 0.007973858257217002, 0.007866823522359003, 0.007138938329049696, 0.0069616364060373715]|
|4    |[0, 1, 2, 3, 4]|[0.009255196034142392, 0.