# Bag of Words, Bag of Emojis and n-Grams pipeline
A work in progress, this pipeline is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. The token information is preserved with the special unicode character ⓔ (a circled-e; U+24d4). This approach has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used as 'decorators'. It is partially a workaround to handle the fact that there is (apparently) no good solution for normalizing long strings of emojis (native to Spark NLP).

In [1]:
%config Completer.use_jedi = False

%load_ext autoreload
%autoreload 1

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()
sys.path.append('..')
%aimport pipelines

In [2]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

# df = df.sample(withReplacement=False, fraction=0.05, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created"))

emojis_regex = "["+"".join(pipelines.emoji_ranges)+"]"

texts = (
    df.withColumn("text_no_emojis",
                  F.regexp_replace(df["text"],
                                   emojis_regex, "Ⓔ"))
    .withColumn("text_no_emojis", 
                  F.regexp_replace("text_no_emojis", "[“”]", "\""))
    .withColumn("text_no_emojis", 
                F.regexp_replace("text_no_emojis", "[‘’]", "\'"))
    # to keep positions of emojis (not necessary, currently)
    .select(["text", "text_no_emojis"])
)

CPU times: user 12.5 ms, sys: 353 µs, total: 12.8 ms
Wall time: 3.89 s


In [3]:
%%time
pipeline = pipelines.build_bowbae_pipeline()
pipeline_model = pipeline.fit(texts)
light_model = LightPipeline(pipeline_model)

CPU times: user 270 ms, sys: 21.1 ms, total: 291 ms
Wall time: 6.88 s


In [4]:
# We compare inference time with and without using light pipeline.
# Anecdotally, we get a 10-20% speedup in wall time.
# %time processed_texts = pipeline_model.transform(texts)
%time processed_texts = light_model.transform(texts)
print(f"Processed (and counted) {df.count()} rows.")
processed_texts

CPU times: user 81 ms, sys: 25.1 ms, total: 106 ms
Wall time: 637 ms
Processed (and counted) 25647 rows.


DataFrame[text: string, text_no_emojis: string, finished_tokenized: array<string>, finished_emojis: array<string>, finished_unigrams: array<string>, finished_naive_ngrams: array<string>, finished_pos_tags: array<string>, finished_ngrams: array<string>]

In [5]:
processed_texts.select(["finished_naive_ngrams", "finished_ngrams"]).show(truncate=60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                       finished_naive_ngrams|                                             finished_ngrams|
+------------------------------------------------------------+------------------------------------------------------------+
|                            [money_sending, sending_message]|                                                          []|
|[math_professor, professor_scott, scott_steiner, steiner_...|       [math professor scott steiner, disaster for gamestop]|
|[exit_system, ceo_nasdaq, nasdaq_pushed, pushed_halt, hal...|[enough sentiment, long on gme, clear that this is a rigg...|
|[new_sec, sec_filing, filing_gme, someone_less, less_reta...|                                 [new sec, gme! can someone]|
|[distract_gme, gme_thought, thought_amc, amc_brothers, br...|                                         [distract from gme]|
|       

In [6]:
processed_texts.select(["finished_ngrams", "finished_emojis"]).show(truncate=60)

+------------------------------------------------------------+----------------------------------------------------+
|                                             finished_ngrams|                                     finished_emojis|
+------------------------------------------------------------+----------------------------------------------------+
|                                                          []|                                        [🚀, 💎, 🙌]|
|       [math professor scott steiner, disaster for gamestop]|                                                  []|
|[enough sentiment, long on gme, clear that this is a rigg...|                                                  []|
|                                 [new sec, gme! can someone]|                                                  []|
|                                         [distract from gme]|                                                  []|
|                                                          []|             