# Topic modeling piplines using Latent Dirichlet Allocation on emojis and on n-grams
This pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. We use neural part-of-speech tagging to generate to generate meaningful and relevant n-grams, then do additional normalization for dimensionality reduction. Our approach to handilng emojis has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used more decoratoratively.

In our pipeline_development notebook, we tested the performance of our pipeline against a spaCy pipeline and saw a very substantial improvement. TODO: writes specifics here, but a much simpler preprocessing pipeline in spaCy took 6 mins and ours is maybe like 30s (on my laptop)?

What's more, using Spark NLP's LightPipeline class, we get a 10-20% speedup in inference.

References: The O'Reilly Spark NLP book, page 76 and https://github.com/maobedkova/TopicModelling_PySpark_SparkNLP

TODO: 
- come and set topics after clustering with some appropriate validation technique.
- refine emoji matcher regex to allow for certain multi-char strings (but no repetitions).

In [1]:
%config Completer.use_jedi = False

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from pyspark.sql import types as T
from sparknlp.base import LightPipeline
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.clustering import LDA

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = f"{os.environ['JAVA_HOME']}/bin:{os.environ['PATH']}"

# to allow importing from parent directory of notebooks folder
sys.path.append('..')

DATA_PATH = "../data/reddit_wsb.csv"

spark = sparknlp.start()

%load_ext autoreload
%autoreload 1
%aimport pipelines

In [2]:
%%time
df = spark.read.csv(DATA_PATH,
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

df = df.sample(withReplacement=False, fraction=0.2, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
      .drop("title", "body", "url", "comms_num", "created"))

CPU times: user 4.81 ms, sys: 2.98 ms, total: 7.79 ms
Wall time: 3.47 s


## Quick illustration of text processing with examples

In [3]:
text_list = [
    "Shouldn't sell 💎 🙌 should not sell",
    "I paid a steep $5🚀🚀🚀",
    "What's-his-name wasn't selling.",
    "Don't sell GME, I say. I don't sell.",
    "He's a seller. I do not sell!",
    "I'm gonna sell? Should sell!",
    "I don't see why anybody should ever sell.",
    "They're there. They've been there.",
    "Trading, it's good trading. 👱‍♀️",
    "'It's' was its own problem, wasn't it?",
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list, 
                                            "text_no_emojis": text_list}))

pipeline = pipelines.build_lda_preproc_pipeline()
pipeline_model = pipeline.fit(eg_df)
processed_egs = pipeline_model.transform(eg_df)
processed_egs = pipelines.lda_preproc_finisher(processed_egs)

In [4]:
processed_egs.toPandas()

Unnamed: 0,text,finished_ngrams,finished_emojis
0,I'm gonna sell? Should sell!,[should_sell],[]
1,"Trading, it's good trading. 👱‍♀️",[good_trading],[👱‍♀️]
2,I paid a steep $5🚀🚀🚀,[steep_$5],"[🚀, 🚀, 🚀]"
3,"'It's' was its own problem, wasn't it?",[own_problem],[]
4,Shouldn't sell 💎 🙌 should not sell,"[should_not_sell, should_not_sell]","[💎, 🙌]"
5,"Don't sell GME, I say. I don't sell.",[do_not_sell_gme],[]
6,I don't see why anybody should ever sell.,[should_ever_sell],[]


## Now fit to WallStreetBets posts

In [5]:
texts = pipelines.preprocess_texts(df)
pipeline = pipelines.build_lda_preproc_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)
def process_texts():
    processed_texts = light_model.transform(texts)
    processed_texts = pipelines.lda_preproc_finisher(processed_texts)
    return processed_texts
%time processed_texts = process_texts()
print(f"Processed (and counted) {df.count()} rows.")

CPU times: user 48 ms, sys: 14.9 ms, total: 62.8 ms
Wall time: 613 ms
Processed (and counted) 5182 rows.


In [6]:
processed_texts.toPandas()

Unnamed: 0,text,finished_ngrams,finished_emojis
0,Exit the system. The CEO of NASDAQ pushed to h...,"[will_change, may_have, will_look, should_have...",[]
1,SHORT STOCK DOESN'T HAVE AN EXPIRATION DATE. H...,"[next_week, may_be, false_expectation, will_se...",[]
2,Currently Holding AMC and NOK - Is it retarded...,"[should_move, gme_today]",[]
3,We need to stick together and 💎🖐 the ever lovi...,"[fellow_poors, rise_up, ah_manipulation, will_...",[💎]
4,Patcher and other media outlets calling this a...,[ponzi_scheme],[]
...,...,...,...
4006,Some loss porn for you degens. $GME to the moo...,[loss_porn],"[🦧, 🍌, 💎, 🚀, 🚀, 🌕]"
4007,"Listen up plebs, Tony hawk just railed a fat l...","[park_lot, chainsmokers_tony_hawk_invest, migh...",[]
4008,"GME Yolo Loss Porn- Day 3, from $23,358k down ...","[financial_advisor, gme_yolo_loss_porn_day, mo...",[🐵]
4009,"Hey, go fuck yourselves!",[hey_go_fuck_yourselves],[]


## Topic Modeling using meaningful n-grams

In [6]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_ngrams')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)
print(f"Processed (and counted) {df.count()} rows.")

Processed (and counted) 5182 rows.
CPU times: user 45.3 ms, sys: 12.1 ms, total: 57.4 ms
Wall time: 2min 43s


In [7]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=120))

+-----+------------------------------------------------------------------------------------------------------------------------+
|topic|                                                                                                             topic_words|
+-----+------------------------------------------------------------------------------------------------------------------------+
|    0|[call_option, can_be, share_price, imply_volatility, standard_definition, price_risk, interest_rate, good_fight, will...|
|    1|[last_week, can_not_buy, melvin_capital, webull_account, big_asshole, massive_conflict, short_position, make_money, n...|
|    2|[short_squeeze, gamma_squeeze, financial_advice, would_be, lose_money, fuck_robinhood, can_do, can_hold, development_...|
|    3|[will_be, loss_porn, robin_hood, would_be, stock_market, federal_reserve, would_have, should_not_be, buy_gme, financi...|
|    4|[wall_street, class_action, market_manipulation, will_see, will_not_be, will_be, free_mark

## For fun: Topic Modelling using Latent Dirichlet Allocation on Emojis Only

In [8]:
%%time
tf_model = (
    CountVectorizer()
    .setInputCol('finished_emojis')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)
print(f"Processed (and counted) {df.count()} rows.")

Processed (and counted) 5182 rows.
CPU times: user 44.4 ms, sys: 10.2 ms, total: 54.5 ms
Wall time: 3min 6s


In [9]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=80))

+-----+----------------------------------------+
|topic|                             topic_words|
+-----+----------------------------------------+
|    0|[🚀, 🌱, 💎, 🤚, 👿, 👨, 🌕, 🤝, 🙌, 📄]|
|    1|[🚀, 💎, 🙌, 🦍, 👐, 🔥, 🤚, 💪, 🧤, 💰]|
|    2|[🤲, 🌈, 🐻, 💎, 🌑, 👏, 🥲, 🤔, 💥, 🙌]|
|    3|[🌙, 🌚, 💵, 🧃, 🍁, 👎, 🍿, 🎮, 🚀, 😓]|
|    4|[🚨, 📈, 😠, 😡, 🤑, 🤬, 😉, 😌, 🌎, 🪐]|
+-----+----------------------------------------+



For fun: can you match the topics here with the topics extcated using the emojis? 
Note: nothing in the method guarantees that this will be possible