# Illustration of Bag of Words / Bag of Emojis pipeline
A work in progresse, this pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. The token information is preserved with the special unicode character ⓔ (a circled-e; U+24d4). This approach has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used as 'decorators'. It is partially a workaround to handle the fact that there is (apparently) no good solution for normalizing long strings of emojis.

In [1]:
%config Completer.use_jedi = False

%load_ext autoreload
%autoreload 1

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()
sys.path.append('..')
%aimport pipelines

In [22]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

# df = df.sample(withReplacement=False, fraction=0.05, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created"))

emojis_regex = "["+"".join(pipelines.emoji_ranges)+"]"

texts = (
    df.withColumn("text_no_emojis",
                  F.regexp_replace(df["text"],
                                   emojis_regex, "Ⓔ"))
    .select(["text", "text_no_emojis"])
)

pipeline = pipelines.build_bowbae_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)

CPU times: user 139 ms, sys: 42.1 ms, total: 181 ms
Wall time: 1.45 s


In [46]:
# We compare inference time with and without using light pipeline.
# Anecdotally, we get a 10-20% speedup in wall time.
%time processed_texts = pipeline_model.transform(texts)
%time processed_texts = light_model.transform(texts)
print(f"Processed (and counted) {df.count()} rows.")

CPU times: user 52.4 ms, sys: 15.4 ms, total: 67.8 ms
Wall time: 147 ms
CPU times: user 49.2 ms, sys: 6.62 ms, total: 55.8 ms
Wall time: 124 ms
Processed (and counted) 25647 rows.


In [32]:
processed_texts.select(["finished_unigrams", "finished_emojis"]).show(100, truncate=60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                           finished_unigrams|                                             finished_emojis|
+------------------------------------------------------------+------------------------------------------------------------+
|                                 [money, send, message, ⓔⓔⓔ]|                                                [🚀, 💎, 🙌]|
|[math, professor, scott, steiner, say, number, spell, dis...|                                                          []|
|[exit, system, ceo, nasdaq, push, halt, trade, give, inve...|                                                          []|
|[new, sec, filing, gme, someone, less, retarded, please, ...|                                                          []|
|                 [distract, gme, think, amc, brother, aware]|                                                          []|
|          