# Topic modeling piplines using Latent Dirichlet Allocation on emojis and on n-grams
This pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. We use neural part-of-speech tagging to generate to generate meaningful and relevant n-grams, then do additional normalization for dimensionality reduction. Our approach to handilng emojis has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used more decoratoratively.

In our pipeline_development notebook, we tested the performance of our pipeline against a spaCy pipeline and saw a very substantial improvement. TODO: writes specifics here, but a much simpler preprocessing pipeline in spaCy took 6 mins and ours is maybe like 30s (on my laptop)?

What's more, using Spark NLP's LightPipeline class, we get a 10-20% speedup in inference.

References: The O'Reilly Spark NLP book, page 76 and https://github.com/maobedkova/TopicModelling_PySpark_SparkNLP

TODO: 
- come and set topics after clustering with some appropriate validation technique.
- refine emoji matcher regex to allow for certain multi-char strings (but no repetitions).

In [1]:
%config Completer.use_jedi = False

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from pyspark.sql import types as T
from sparknlp.base import LightPipeline
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.clustering import LDA

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
sys.path.append('..')

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()

%load_ext autoreload
%autoreload 1
%aimport pipelines

In [2]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

df = df.sample(withReplacement=False, fraction=0.2, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
      .drop("title", "body", "url", "comms_num", "created"))

CPU times: user 7.6 ms, sys: 499 µs, total: 8.1 ms
Wall time: 3.55 s


## Quick illustration of text processing with examples

In [2]:
text_list = [
    "Shouldn't sell 💎 🙌 should not sell",
    "I paid a steep $5🚀🚀🚀",
    "What's-his-name wasn't selling.",
    "Don't sell GME, I say. I don't sell.",
    "He's a seller. I do not sell!",
    "I'm gonna sell? Should sell!",
    "I don't see why anybody should ever sell.",
    "They're there. They've been there.",
    "Trading, it's good trading",
    "'It's' was its own problem, wasn't it?",
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list, 
                                            "text_no_emojis": text_list}))

pipeline = pipelines.build_lda_preproc_pipeline()
pipeline_model = pipeline.fit(eg_df)
processed_egs = pipeline_model.transform(eg_df)
processed_egs = pipelines.lda_preproc_finisher(processed_egs)

In [3]:
processed_egs.show(truncate=40)

Unnamed: 0,text,finished_ngrams,finished_emojis
0,I'm gonna sell? Should sell!,[should_sell],[]
1,I paid a steep $5🚀🚀🚀,[steep_$5],"[🚀, 🚀, 🚀]"
2,"'It's' was its own problem, wasn't it?",[own_problem],[]
3,Shouldn't sell 💎 🙌 should not sell,"[should_not_sell, should_not_sell]","[💎, 🙌]"
4,"Don't sell GME, I say. I don't sell.",[do_not_sell_gme],[]
5,I don't see why anybody should ever sell.,[should_ever_sell],[]
6,"Trading, it's good trading",[good_trade],[]


## Now fit to WallStreetBets posts

In [3]:
texts = pipelines.preprocess_texts(df)
pipeline = pipelines.build_lda_preproc_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)
def process_texts():
    processed_texts = light_model.transform(texts)
    processed_texts = pipelines.lda_preproc_finisher(processed_texts)
    return processed_texts
%time processed_texts = process_texts()
print(f"Processed (and counted) {df.count()} rows.")

CPU times: user 55.4 ms, sys: 10.3 ms, total: 65.6 ms
Wall time: 546 ms
Processed (and counted) 5182 rows.


In [4]:
processed_texts.show(truncate=40)

+----------------------------------------+----------------------------------------+----------------------------------------+
|                                    text|                         finished_ngrams|                         finished_emojis|
+----------------------------------------+----------------------------------------+----------------------------------------+
|Exit the system. The CEO of NASDAQ pu...|[will_change, may_have, will_look, sh...|                                      []|
|SHORT STOCK DOESN'T HAVE AN EXPIRATIO...|[next_week, may_be, false_expectation...|                                      []|
|Currently Holding AMC and NOK - Is it...|                [should_move, gme_today]|                                      []|
|We need to stick together and 💎🖐 th...|[fellow_poors, rise_up, ah_manipulati...|                                    [💎]|
|Patcher and other media outlets calli...|                          [ponzi_scheme]|                                      []|
|I'

## Topic Modeling using meaningful n-grams

In [4]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_ngrams')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)
print(f"Processed (and counted) {df.count()} rows.")

Processed (and counted) 5182 rows.
CPU times: user 45.8 ms, sys: 9.39 ms, total: 55.2 ms
Wall time: 3min 2s


In [8]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=120))

+-----+------------------------------------------------------------------------------------------------------------------------+
|topic|                                                                                                             topic_words|
+-----+------------------------------------------------------------------------------------------------------------------------+
|    0|[can_not_buy, td_ameritrade, will_be, fuck_robinhood, short_sell, right_now, can_do, robinhood_customer, will_have, l...|
|    1|       [buy_gme, next_week, hold_hold, can_get, investor_day, last_year, should_be, wall_street, will_not_let, buy_$gme]|
|    2|[process_img, gme_today, melvin_capital, short_interest, bb_nok, will_win, last_week, buy_more, will_not_allow, bloom...|
|    3|[private_boomer, trade_republic, market_cap, first_time, will_be, huge_profits, buy_buy, positions_hf, ark_invest, cr...|
|    4|[would_be, short_squeeze, wall_street, gamma_squeeze, will_buy, market_manipulation, can_s

## For fun: Topic Modelling using Latent Dirichlet Allocation on Emojis Only

In [4]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_emojis')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)
print(f"Processed (and counted) {df.count()} rows.")

Processed (and counted) 5182 rows.
CPU times: user 52.6 ms, sys: 3.31 ms, total: 55.9 ms
Wall time: 3min 33s


In [5]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=120))

+-----+----------------------------------------+
|topic|                             topic_words|
+-----+----------------------------------------+
|    0| [👐, 🌕, 😂, 💎, 🦍, 🌈, 💫, 🚀, 😎, ‍]|
|    1|[😈, 🥶, 🚀, 🚨, 🌚, 🔥, 😢, 🍿, 🌕, 💎]|
|    2|[🚀, 🌚, 🔥, 🥜, 🥲, 💎, 😤, 👋, 🙌, 👨]|
|    3| [💎, 🙌, 🤲, 🦍, 🚀, ️, 🤚, 🥥, 🍆, 🍑]|
|    4|[🌙, 🍑, 🥺, 🚀, 🌗, 🌏, 😡, 😠, 🪐, 🤬]|
+-----+----------------------------------------+



For fun: can you match the topics here with the topics extcated using the emojis? 
Note: nothing in the method guarantees that this will be possible