# Topic modeling piplines using Latent Dirichlet Allocation on emojis and on n-grams
This pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. We use neural part-of-speech tagging to generate to generate meaningful and relevant n-grams, then do additional normalization for dimensionality reduction. Our approach to handilng emojis has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used more decoratoratively.

In our pipeline_development notebook, we tested the performance of our pipeline against a spaCy pipeline and saw a very substantial improvement. TODO: writes specifics here, but a much simpler preprocessing pipeline in spaCy took 6 mins and ours is maybe like 30s (on my laptop)?

What's more, using Spark NLP's LightPipeline class, we get a 10-20% speedup in inference.

References: The O'Reilly Spark NLP book, page 76 and https://github.com/maobedkova/TopicModelling_PySpark_SparkNLP

TODO: 
- come and set topics after clustering with some appropriate validation technique.
- need to handle "process_img" in normalization.

In [1]:
%config Completer.use_jedi = False

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from pyspark.sql import types as T
from sparknlp.base import LightPipeline
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.clustering import LDA

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
sys.path.append('..')

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()

%load_ext autoreload
%autoreload 1
%aimport pipelines

In [2]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

df = df.sample(withReplacement=False, fraction=0.1, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
      .drop("title", "body", "url", "comms_num", "created"))

CPU times: user 6.94 ms, sys: 468 µs, total: 7.4 ms
Wall time: 3.98 s


## Quick illustration of text processing with examples

In [2]:
text_list = [
    "Shouldn't sell 💎 🙌 should not sell",
    "I paid a steep $5🚀🚀🚀",
    "What's-his-name wasn't selling.",
    "Don't sell GME, I say. I don't sell.",
    "He's a seller. I do not sell!",
    "I'm gonna sell? Should sell!",
    "I don't see why anybody should ever sell.",
    "They're there. They've been there.",
    "Trading, it's good trading",
    "'It's' was its own problem, wasn't it?",
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list, 
                                            "text_no_emojis": text_list}))

pipeline = pipelines.build_lda_preproc_pipeline()
pipeline_model = pipeline.fit(eg_df)
processed_egs = pipeline_model.transform(eg_df)
processed_egs = pipelines.lda_preproc_finisher(processed_egs)

In [3]:
processed_egs.toPandas()

Unnamed: 0,text,finished_ngrams,finished_emojis
0,I'm gonna sell? Should sell!,[should_sell],[]
1,I paid a steep $5🚀🚀🚀,[steep_$5],"[🚀, 🚀, 🚀]"
2,"'It's' was its own problem, wasn't it?",[own_problem],[]
3,Shouldn't sell 💎 🙌 should not sell,"[should_not_sell, should_not_sell]","[💎, 🙌]"
4,"Don't sell GME, I say. I don't sell.",[do_not_sell_gme],[]
5,I don't see why anybody should ever sell.,[should_ever_sell],[]
6,"Trading, it's good trading",[good_trade],[]


## Now fit to WallStreetBets posts

In [3]:
texts = pipelines.preprocess_texts(df)

pipeline = pipelines.build_lda_preproc_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)
%time processed_texts = light_model.transform(texts)
%time processed_texts = pipelines.lda_preproc_finisher(processed_texts)
print(f"Processed (and counted) {df.count()} rows.")

CPU times: user 39.2 ms, sys: 2.52 ms, total: 41.7 ms
Wall time: 310 ms
CPU times: user 19.2 ms, sys: 3.22 ms, total: 22.4 ms
Wall time: 250 ms
Processed (and counted) 2624 rows.


In [4]:
processed_texts.toPandas()

Unnamed: 0,text,finished_ngrams,finished_emojis
0,Exit the system. The CEO of NASDAQ pushed to h...,"[will_change, may_have, will_look, should_have...",[]
1,Currently Holding AMC and NOK - Is it retarded...,"[should_move, gme_today]",[]
2,Y'all broke it. How do we fix it? Any advice?,[fix_it],[]
3,Are we ready to attack the Citadel !!!!. https...,[citadel],[]
4,My brokerage wants to force close my GME calls...,"[gme_calls, big_risk, unusual_situation, situa...",[]
...,...,...,...
2025,Glad I can't trade after hours when I've had a...,[can_not_trade],[😮]
2026,Just to clear up the SEC probe misunderstandin...,"[world_series, sec_probe_misunderstandings, kn...",[]
2027,I am your typical retard from Germany. Hold st...,"[typical_retard, diamond_hands]","[🚀, 🚀]"
2028,Holding strong here in th UK,[th_uk],[]


## Topic Modeling using meaningful n-grams

In [5]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_ngrams')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)

CPU times: user 39.1 ms, sys: 2.21 ms, total: 41.3 ms
Wall time: 1min 25s


In [6]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=160))

+-----+-----------------------------------------------------------------------------------------------------------------------------------------+
|topic|                                                                                                                              topic_words|
+-----+-----------------------------------------------------------------------------------------------------------------------------------------+
|    0|           [should_be, short_squeeze, stock_market, can_buy, end_game, financial_advice, may_have, short_market, can_afford, moass_crash]|
|    1|[market_manipulation, will_be, free_market, next_week, retail_investors, robin_hood, fuck_robinhood, ballot_box, might_be, single_person]|
|    2|              [wall_street, will_not_be, halt_trade, free_money, will_not_sell, short_sell, may_be, would_like, will_see, short_interest]|
|    3|   [last_week, short_interest, buy_gme, quick_setup, x200b_process_img, bloomberg_terminal, can_say, can_get, diamond

## For fun: Topic Modelling using Latent Dirichlet Allocation on Emojis Only

In [4]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_emojis')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)

CPU times: user 47.3 ms, sys: 1.95 ms, total: 49.2 ms
Wall time: 1min 44s


In [6]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=80))

+-----+----------------------------------------+
|topic|                             topic_words|
+-----+----------------------------------------+
|    0|[🪐, 🤝, 🥥, 🎲, 😢, 😭, 💎, 📈, 🌴, 🙌]|
|    1|[💎, 🤚, 🤑, 🙌, 🚀, 💍, 🤘, 🦧, 👉, 🌝]|
|    2|[👎, 🌕, 🌙, 😂, 🌚, 🐸, 😴, 🚀, 👍, 🌗]|
|    3|[🚀, 💎, 🙌, 🙏, 👐, 🚨, 🦍, 🤡, 🌈, 🐻]|
|    4| [📈, ️, 🤲, 👾, 💈, 🤝, 🎲, 🚀, 🪐, 💎]|
+-----+----------------------------------------+



For fun: can you match the topics here with the topics extcated using the emojis? 
Note: nothing in the method guarantees that this will be possible