# Topic modeling piplines using Latent Dirichlet Allocation on emojis and on n-grams
A work in progress, th pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. The token information is preserved with the special unicode character ‚ìî (a circled-e; U+24d4). This approach has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used as 'decorators'. It is partially a workaround to handle the fact that there is (apparently) no good solution for normalizing long strings of emojis (native to Spark NLP).



References: The O'Reilly Spark NLP book, page 76 and https://github.com/maobedkova/TopicModelling_PySpark_SparkNLP


To do:
- Finalize this version of the pipeline and write summary. Do general cleaning up (e.g., of imports)
- improve interaction of lemmatization and final n-gram. Re-lemmatize?
- user validation techniques for clustering then come back and re-set number of topics.
- some punctuation still makes it through to the n-grams because POS model scans partially-noramlized document
- ? add custom lemmatization rules as in cant |-> can't (& give up the less common meaning of 'cant')?

In [1]:
%config Completer.use_jedi = False

%load_ext autoreload
%autoreload 1

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql import types as T

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()
sys.path.append('..')
%aimport pipelines

In [2]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

df = df.sample(withReplacement=False, fraction=0.05, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created"))

emojis_regex = "["+"".join(pipelines.emoji_ranges)+"]"

texts = (
    # to do: find uniform way of preprocessing usingn pure Spark
    #        can only find convoluted ways in Spark NLP.
    #        Is using a UDF slower?
    df.withColumn("text_no_emojis",
                  F.regexp_replace("text",
                                   emojis_regex, " "))  # replacing with "" is bad
    .withColumn("text_no_emojis", 
                  F.regexp_replace("text_no_emojis", "[‚Äú‚Äù]", "\""))
    .withColumn("text_no_emojis", 
                F.regexp_replace("text_no_emojis", "[‚Äò‚Äô]", "\'"))
    # to keep positions of emojis (not necessary, currently)
    .select(["text", "text_no_emojis"])
)

CPU times: user 8.6 ms, sys: 4.07 ms, total: 12.7 ms
Wall time: 4.06 s


Look first at some relevant examples.

In [17]:
text_list = [
    "I paid $5. Did you?",
    "'It's' was its own problem, wasn't it?",
    "What's-his-name wasn't selling.",
    "Don't sell GME, I say. I don't sell.",
    "He's a seller. I do not sell!",
    "Shouldn't sell. Should not sell",
    "I'm gonna sell? Should sell!",
    "I don't see why anybody should ever sell.",
    "Some say one musn't hold. Rubbish! One should hold.",
    "They're there. They've been there.",
    "Trading, good trading, and good companies"
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list, 
                                            "text_no_emojis": text_list}))

pipeline = pipelines.build_lda_pipeline()
pipeline_model = pipeline.fit(eg_df)
light_model = LightPipeline(pipeline_model)
# We compare inference time with and without using light pipeline.
# Anecdotally, we get a 10-20% speedup in wall time.
# %time processed_texts = pipeline_model.transform(texts)
%time processed_egs = light_model.transform(eg_df)
print("Columns: {processed_egs.columns)}")
(processed_egs.select(["text", 
                      "finished_unigrams", 
                      "finished_pos_tags", 
                      "finished_ngrams"])
 .show(truncate=50))

CPU times: user 34.8 ms, sys: 11.6 ms, total: 46.3 ms
Wall time: 197 ms
Columns: {processed_egs.columns)}
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+---------------------------------+
|                                              text|                                 finished_unigrams|                            finished_pos_tags|                  finished_ngrams|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+---------------------------------+
|                               I paid $5. Did you?|                             [i, pay, $5, do, you]|                      [NNP, VB, NN, VBP, PRP]|                        [paid $5]|
|            'It's' was its own problem, wasn't it?|     [it, have, be, it, own, problem, be, not, it]|     [PRP, VBP, VB, PRP, JJ, NN, VB, RB, PRP]|         

Now fit to WSB posts.

In [15]:
pipeline = pipelines.build_lda_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)
%time processed_texts = light_model.transform(texts)
print(f"Processed (and counted) {df.count()} rows.")
(processed_texts.select(["text", 
                         "finished_ngrams", 
                         "finished_emojis"])
 .show(truncate=60))

CPU times: user 39.5 ms, sys: 6.2 ms, total: 45.7 ms
Wall time: 188 ms
Processed (and counted) 1326 rows.
+------------------------------------------------------------+------------------------------------------------------------+---------------+
|                                                        text|                                             finished_ngrams|finished_emojis|
+------------------------------------------------------------+------------------------------------------------------------+---------------+
|Exit the system. The CEO of NASDAQ pushed to halt trading...|[nasdaq push, halt tradi, give investor, disallowing buy,...|             []|
|                             420 wasn‚Äôt a meme. GME üöÄ üöÄ üöÄ|                                                          []|   [üöÄ, üöÄ, üöÄ]|
|               Y'all broke it. How do we fix it? Any advice?|                                               [y'all broke]|             []|
|   They're trying to say this was all d

## Topic Modeling using meaningful n-grams

In [5]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_ngrams')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)

CPU times: user 37.1 ms, sys: 9.41 ms, total: 46.5 ms
Wall time: 1min 6s


In [6]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))

In [7]:
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=160))

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|                                                                                                                                                     topic_words|
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|    0|                [gme fall, market manipulation, would be, could be, don't know, limit match, $110k today tomorrow, $gme ride, emergency fund, gamestop. go fuck]|
|    1|                   [pop, go, high performance, worth contact, short seller, real time, house rep, net revenue, creator peripheral, moon insert emojis, old lady]|
|    2|[short position, stock account, do not need, don't have, gme share, don't sell, doge buy doge buy doge buy doge buy doge buy doge, gme go, instituti

## For fun: Topic Modelling using Latent Dirichlet Allocation on Emojis Only

In [8]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_emojis')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)

CPU times: user 26.4 ms, sys: 4.61 ms, total: 31 ms
Wall time: 3.18 s


In [9]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))

In [10]:
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=80))

+-----+----------------------------------------+
|topic|                             topic_words|
+-----+----------------------------------------+
|    0|  [üëé, ü••, ü§ë, Ô∏è, üìà, üå¥, ü§∑, üöÄ, üòÇ, ‚Äç]|
|    1|[üíé, üêª, üåà, üëê, üöÄ, üö®, ü§≤, ü§î, üôå, üåù]|
|    2|[üöÄ, üòå, ü™ê, ü§≤, üåò, üåà, üêª, üåï, üçå, üôè]|
|    3|[üôå, üòî, ü¶ç, üíé, üé•, üçø, ü§ô, üçå, ü™ê, üôè]|
|    4| [üåë, ü§°, üçå, üê∏, üöÄ, üåï, ü™ê, ü¶ç, Ô∏è, üé•]|
+-----+----------------------------------------+



For fun: can you match the topics here with the topics extcated using the emojis? 
Note: nothing in the method guarantees that this will be possible

    +-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |topic|                                                                                                                                                     topic_words|
    +-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |    0|                [gme fall, market manipulation, would be, could be, don't know, limit match, $110k today tomorrow, $gme ride, emergency fund, gamestop. go fuck]|
    |    1|                   [pop, go, high performance, worth contact, short seller, real time, house rep, net revenue, creator peripheral, moon insert emojis, old lady]|
    |    2|[short position, stock account, do not need, don't have, gme share, don't sell, doge buy doge buy doge buy doge buy doge buy doge, gme go, institutional inve...|
    |    3|            [market share, taibbi savage, wall street, short interest, high performance, retail brokerage, former hedge, will shoot, current downward, might be]|
    |    4|                          [hedge fund, will be, retail investor, daily average trade, would have, last quarter, daily trade, trade growth, last week, can't buy]|
    +-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

