# Bag of Words, Bag of Emojis and n-Grams pipeline
A work in progress, this pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. The token information is preserved with the special unicode character ⓔ (a circled-e; U+24d4). This approach has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used as 'decorators'. It is partially a workaround to handle the fact that there is (apparently) no good solution for normalizing long strings of emojis (native to Spark NLP).

References: The O'Reilly Spark NLP book, page 76 and https://github.com/maobedkova/TopicModelling_PySpark_SparkNLP

In [1]:
%config Completer.use_jedi = False

%load_ext autoreload
%autoreload 1

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql import types as T

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()
sys.path.append('..')
%aimport pipelines

In [2]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

# df = df.sample(withReplacement=False, fraction=0.05, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created"))

emojis_regex = "["+"".join(pipelines.emoji_ranges)+"]"

texts = (
    df.withColumn("text_no_emojis",
                  F.regexp_replace(df["text"],
                                   emojis_regex, "Ⓔ"))
    .withColumn("text_no_emojis", 
                  F.regexp_replace("text_no_emojis", "[“”]", "\""))
    .withColumn("text_no_emojis", 
                F.regexp_replace("text_no_emojis", "[‘’]", "\'"))
    # to keep positions of emojis (not necessary, currently)
    .select(["text", "text_no_emojis"])
)

CPU times: user 5.6 ms, sys: 6.54 ms, total: 12.1 ms
Wall time: 3.92 s


In [3]:
%%time
pipeline = pipelines.build_bowbae_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)

CPU times: user 219 ms, sys: 35.5 ms, total: 255 ms
Wall time: 6.52 s


In [4]:
# We compare inference time with and without using light pipeline.
# Anecdotally, we get a 10-20% speedup in wall time.
# %time processed_texts = pipeline_model.transform(texts)
%time processed_texts = light_model.transform(texts)
print(f"Processed (and counted) {df.count()} rows.")
processed_texts

CPU times: user 78.7 ms, sys: 12.3 ms, total: 91 ms
Wall time: 512 ms
Processed (and counted) 25647 rows.


DataFrame[text: string, text_no_emojis: string, finished_tokenized: array<string>, finished_emojis: array<string>, finished_unigrams: array<string>, finished_naive_ngrams: array<string>, finished_pos_tags: array<string>, finished_ngrams: array<string>]

In [5]:
processed_texts.select(["finished_naive_ngrams", "finished_ngrams"]).show(truncate=60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                       finished_naive_ngrams|                                             finished_ngrams|
+------------------------------------------------------------+------------------------------------------------------------+
|                            [money_sending, sending_message]|                                                          []|
|[math_professor, professor_scott, scott_steiner, steiner_...|       [math professor scott steiner, disaster for gamestop]|
|[exit_system, ceo_nasdaq, nasdaq_pushed, pushed_halt, hal...|[enough sentiment, long on gme, clear that this is a rigg...|
|[new_sec, sec_filing, filing_gme, someone_less, less_reta...|                                 [new sec, gme! can someone]|
|[distract_gme, gme_thought, thought_amc, amc_brothers, br...|                                         [distract from gme]|
|       

In [6]:
processed_texts.select(["finished_ngrams", "finished_emojis"]).show(truncate=60)

+------------------------------------------------------------+----------------------------------------------------+
|                                             finished_ngrams|                                     finished_emojis|
+------------------------------------------------------------+----------------------------------------------------+
|                                                          []|                                        [🚀, 💎, 🙌]|
|       [math professor scott steiner, disaster for gamestop]|                                                  []|
|[enough sentiment, long on gme, clear that this is a rigg...|                                                  []|
|                                 [new sec, gme! can someone]|                                                  []|
|                                         [distract from gme]|                                                  []|
|                                                          []|             

## For fun: Topic Modelling using Latent Dirichlet Allocation on Emojis Only
(Will select number of topics later using some clustering validation strategy)

(To do: strip unicode control characters or track emoji stringss instead of constituent characters).

Will we be able to match these with topics described in more reasonable ways? I doubt it.

In [7]:
tf_model = (
    CountVectorizer()
    .setInputCol('finished_emojis')
    .setOutputCol('tfs')
    .fit(processed_texts)
)
lda_feats = tf_model.transform(processed_texts)

idf_model = (
    IDF()
    .setInputCol('tfs')
    .setOutputCol('idfs')
    .fit(lda_feats)
)
lda_feats = idf_model.transform(lda_feats).select(["tfs", "idfs"])

lda = (
    LDA()
    .setFeaturesCol('idfs')
    .setK(5)
    .setMaxIter(5)
)

lda_model = lda.fit(lda_feats)

In [8]:
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))

In [9]:
(lda_model
 .describeTopics()
 .withColumn('topic_words', udf_to_words(F.col('termIndices')))
 .select(["topic", "topic_words"])
 .show(truncate=80))

+-----+----------------------------------------+
|topic|                             topic_words|
+-----+----------------------------------------+
|    0| [💎, 🙌, 🦍, ️, 👐, 🌝, 🚀, 🍌, 🌕, 🐻]|
|    1| [🤲, 💎, ‍, 📈, 🚨, 🤬, 🤝, 👊, 🥲, 💵]|
|    2|[💎, 🚀, 🤚, 🙌, 💰, 🌚, 🧻, 👋, 😪, 😊]|
|    3|  [😂, 🦀, 👍, 🤷, ️, 🙏, 🐧, 👇, ‍, 📢]|
|    4|[🚀, 💎, 🌙, 🌑, 🦍, 😡, 🥜, 🤲, 🙌, 👐]|
+-----+----------------------------------------+

