# Illustration of Bag of Words / Bag of Emojis pipeline
A work in progressed, this is designed to preserve symbols, words and phrases of interest (e.g., special vocabulary) in the context of WallStreetBets posts, while splitting off a separate bag of emojis. This approach has advantages and disadvantages. It is probably a useful and efficient way of preserving much of the emoji sentiment (and even the evolution of sentiment throughough a post), and especially so when the emojis are used as 'decorators'.

In [2]:
%config Completer.use_jedi = False

%load_ext autoreload
%autoreload 1

import os
import sys
import pandas as pd

import sparknlp
import pyspark.sql.functions as F
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

data_path = "../data/reddit_wsb.csv"

spark = sparknlp.start()
sys.path.append('..')
%aimport pipelines

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
%%time
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

# df = df.sample(withReplacement=False, fraction=0.05, seed=1)

df = (df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created"))

emojis_regex = "["+"".join(pipelines.emoji_ranges)+"]"

texts = (
    df.withColumn("text_no_emojis",
                  F.regexp_replace(df["text"],
                                   emojis_regex, "Ⓔ"))
    .select(["text", "text_no_emojis"])
)

pipeline = pipelines.build_bowbae_pipeline()
processed_texts = pipeline.fit(texts).transform(texts)

CPU times: user 176 ms, sys: 45.5 ms, total: 222 ms
Wall time: 6.91 s


In [6]:
processed_texts.select(["finished_unigrams", "finished_emojis"]).show(100, truncate=80)

+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                               finished_unigrams|                                                                 finished_emojis|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                     [money, send, message, ⓔⓔⓔ]|                                                                    [🚀, 💎, 🙌]|
|[math, professor, scott, steiner, say, number, spell, disaster, gamestop, sho...|                                                                              []|
|[exit, system, ceo, nasdaq, push, halt, trade, give, investor, chance, recali...|                                                                              []|
|             [new,