# Intro
This is the notebook I used while developing the pipeline
References:
  - https://github.com/maobedkova/TopicModelling_PySpark_SparkNLP
  - and the O'Reilly Spark NLP book, page 76.
  
Cf https://github.com/GoogleCloudDataproc/cloud-dataproc/blob/master/codelabs/spark-nlp/topic_model.py, which seems to be very coarse (but I should run my data against it, I guess).

See Conclusion section at the bottom for some final comments.

In [1]:
%config Completer.use_jedi = False
# https://stackoverflow.com/questions/40536560/ipython-and-jupyter-autocomplete-not-working
%load_ext autoreload
%autoreload 1

import os
import sys
import pandas as pd
import unicodedata

import sparknlp
import pyspark.sql.functions as F
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

data_path = "../data/reddit_wsb.csv"
print("\nThe character encoding of the csv file:")
! file -i {data_path}


The character encoding of the csv file:
../data/reddit_wsb.csv: application/csv; charset=utf-8


In order to set multiline=True, we have to use use Java 8. Even still, the column body containing commas within quotes containing quotes, and this confused the csv parser. Solved following https://stackoverflow.com/questions/40413526/reading-csv-files-with-quoted-fields-containing-embedded-commas. 

In [2]:
spark = sparknlp.start()
sys.path.append('..')
%aimport lda_pipeline

# Test pipeline

In [491]:
# Note: Converting from Pandas df via df = spark.createDataFrame(df_pd) gives
# >> WARN  TaskSetManager:66 - Stage 2 contains a task of very large size 
# >> (1473 KB). The maximum recommended task size is 100 KB.

df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

df = df.sample(withReplacement=False, fraction=0.05, seed=1)
# print(f'{df.where(df["timestamp"].isNull()).count()} null timestamp values.')

# combine text columns and drop unwanted columns
df = (
    df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created")
)

texts = df.select("text")

pipeline = lda_pipeline.build_unigram_pipeline()

In [492]:
pipeline = lda_pipeline.build_unigram_pipeline()

text_list = [
    "The S&P will go down and we'll have $100000.",
    "Will, let's be sure your calculation lets us be 💯.",
    "Wasnt was its own problem, wasn't it?",
    "420 wasn’t a meme. GME 🚀 🚀 🚀",
    "Don't sell 👱‍♀️, yall shouldn't sell",
    "Y'all: do not sell, should not sell, never sell",
    "🙅 🙅🏻 🙅🏼 🙅🏽 🙅🏾 🙅🏿",
    "stop, game stop, game stonk, cody"
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_texts = spark.createDataFrame(pd.DataFrame({"text": text_list}))

eg_processed_texts = pipeline.fit(eg_texts).transform(eg_texts)

eg_processed_texts.show(truncate=False)

+---------------------------------------------------+---------------------------------+
|text                                               |finished_unigrams                |
+---------------------------------------------------+---------------------------------+
|The S&P will go down and we'll have $100000.       |[s&p, go, $100000]               |
|Will, let's be sure your calculation lets us be 💯.|[sure, calculation, we, 💯]      |
|Wasnt was its own problem, wasn't it?              |[problem]                        |
|420 wasn’t a meme. GME 🚀 🚀 🚀                    |[420, meme, gme, 🚀🚀🚀]         |
|Don't sell 👱‍♀️, yall shouldn't sell              |[dontsell, 👱‍♀️, shouldntsell]  |
|Y'all: do not sell, should not sell, never sell    |[notsell, notsell, neversell]    |
|🙅 🙅🏻 🙅🏼 🙅🏽 🙅🏾 🙅🏿                        |[🙅, 🙅, 🙅, 🙅, 🙅, 🙅]         |
|stop, game stop, game stonk, cody                  |[stop, gamestop, gamestonk, cody]|
+---------------------------------------------------+------

In [493]:
%%time
processed_texts = pipeline.fit(texts).transform(texts)
print(processed_texts)
pddf = processed_texts.toPandas()
def examine(i):
    sep_string = "\n" + "-"*100 + "\n"
    print(*pddf.iloc[i], sep=sep_string)
i = 74 # 96

DataFrame[text: string, finished_unigrams: array<string>]
CPU times: user 130 ms, sys: 24.4 ms, total: 154 ms
Wall time: 3.44 s


In [582]:
examine(i)
i += 1

Robinhood should rebrand to Sheriff of Nottingham! What a joke...
----------------------------------------------------------------------------------------------------
['robinhood', 'rebrand', 'sheriff', 'nottingham', 'joke']


assembler -> tokenizer -> cleaner -> lemmatizer -> normalizer ->
with stopwords_cleaner given from pretrained

### Some inspection results (varying i-values in examine(i) & using earlier version of lda_pipeline.py)
  1. emoji's dropped
  2. "y'all" |-> "yall"
  3. becomes one word
  4. "am" actually comes from "2am"; should let numerals survive
  8. "I'm" |-> im; "its" and "thats" and "isnt" survive ("lets is a lost cause"); 2008 is dropped
  
 Conclusions:
   - ✓ should keep: numerals, $, &
   - ✓ long urls
   - IOU: repeated characters as in "holdddddd" and "woooooo" and 🚀, 🚀, 🚀
   - ✓ contractions not handled properly
       added contractions with RIGHT SINGLE QUOTATION MARK to stopwords list
   - ✓ keep emojis
   - ✓ handle words like isnt and that should have an apostrophe.
   - ✓ "don't sell" should be an exception?

## Contractions

In [7]:
assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

tokenizer = (
    Tokenizer()
    .setInputCols(['document'])
    .setOutputCol('tokenized')
)

# tokenizer.addSplitChars(
#     unicodedata.lookup('RIGHT SINGLE QUOTATION MARK')
# )

# char_names = ['LEFT SINGLE QUOTATION MARK',
#               'RIGHT SINGLE QUOTATION MARK',
#               'LEFT DOUBLE QUOTATION MARK',
#               'RIGHT DOUBLE QUOTATION MARK']
# for name in char_names:
#     tokenizer.addContextChars(unicodedata.lookup(name))


stopwords_cleaner = (
    StopWordsCleaner.pretrained("stopwords_en", "en")
    .setInputCols(['tokenized'])
    .setOutputCol('cleaned')
    .setCaseSensitive(False)
)

# char = unicodedata.lookup('APOSTROPHE')
# replacement = unicodedata.lookup('RIGHT SINGLE QUOTATION MARK')
# stopwords = stopwords_cleaner.getStopWords()
# for s in stopwords_cleaner.getStopWords():
#     if char in s:
#         stopwords.append(s.replace(char, replacement))
# stopwords.sort()
# stopwords_cleaner.setStopWords(stopwords)

finisher = (
    Finisher()
    .setInputCols(['tokenized', 
                   'cleaned',
    ])
)

pipeline = Pipeline().setStages([assembler,
                                 tokenizer,
                                 stopwords_cleaner,
                                 finisher])


stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]


In [8]:
print(*stopwords_cleaner.getStopWords())

a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently d definitely described despite did didn't different do does doesn't doing don't done down downwards during e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example

In [9]:
print(tokenizer.getContextChars())

['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"]


In [10]:
text_list = [
    "It's was its own problem, wasn't it?",
    "420 wasn’t a meme. GME 🚀 🚀 🚀",
    "halt trading “to give investors a chance to recalibrate their positions”."
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list}))

pipeline_model = pipeline.fit(empty_df)
result = pipeline_model.transform(eg_df)
result.show(truncate=50) 

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                finished_tokenized|                                  finished_cleaned|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|              It's was its own problem, wasn't it?|  [It's, was, its, own, problem, ,, wasn't, it, ?]|                                   [problem, ,, ?]|
|                   420 wasn’t a meme. GME 🚀 🚀 🚀|        [420, wasn’t, a, meme, ., GME, 🚀, 🚀, 🚀]|           [420, wasn’t, meme, ., GME, 🚀, 🚀, 🚀]|
|halt trading “to give investors a chance to rec...|[halt, trading, “to, give, investors, a, chance...|[halt, trading, “to, give, investors, chance, r...|
+--------------------------------------------------+---------------------------

In [11]:
# a-ha!
s1 = text_list[1]
s2 = "420 wasn't a meme. GME 🚀 🚀 🚀"
assert len(s1) == len(s2)
for c1, c2 in zip(s1, s2):
    ord1, ord2 = ord(c1), ord(c2)
    if ord1 != ord2:
        print((c1, hex(ord1)), (c2, hex(ord2)))

('’', '0x2019') ("'", '0x27')


There is a difference between "apostrophe" and "right single quotation mark". examine(0) suggests that I have a problem with "* double quotation mark" as well, leading to preservation of "to" when I process "“to". On page 281 they replace these things one-by-one using pythongs str.replace.

In [12]:
print(unicodedata.name("\u2018"))
char_names = ['LEFT SINGLE QUOTATION MARK',
              'RIGHT SINGLE QUOTATION MARK',
              'LEFT DOUBLE QUOTATION MARK',
              'RIGHT DOUBLE QUOTATION MARK']
[unicodedata.lookup(name) for name in char_names]

LEFT SINGLE QUOTATION MARK


['‘', '’', '“', '”']

## Emojis

In [13]:
emojis = pd.DataFrame([
    ['🙅','🙆','🙇','🙋','🙌','🙍','🙎','🙏'],
    ['🙅🏻','🙆🏻','🙇🏻','🙋🏻','🙌🏻','🙍🏻','🙎🏻','🙏🏻'],
    ['🙅🏼','🙆🏼','🙇🏼','🙋🏼','🙌🏼','🙍🏼','🙎🏼','🙏🏼'],
    ['🙅🏽','🙆🏽','🙇🏽','🙋🏽','🙌🏽','🙍🏽','🙎🏽','🙏🏽'],
    ['🙅🏾','🙆🏾','🙇🏾','🙋🏾','🙌🏾','🙍🏾','🙎🏾','🙏🏾'],
    ['🙅🏿','🙆🏿','🙇🏿','🙋🏿','🙌🏿','🙍🏿','🙎🏿','🙏🏿'],
    ['🙅','🙆','🙇','🙋','🙌','🙍','🙎','🙏'],
    ['🙅🏻','🙆🏻','🙇🏻','🙋🏻','🙌🏻','🙍🏻','🙎🏻','🙏🏻'],
    ['🙅🏼','🙆🏼','🙇🏼','🙋🏼','🙌🏼','🙍🏼','🙎🏼','🙏🏼'],
    ['🙅🏽','🙆🏽','🙇🏽','🙋🏽','🙌🏽','🙍🏽','🙎🏽','🙏🏽'],
    ['🙅🏾','🙆🏾','🙇🏾','🙋🏾','🙌🏾','🙍🏾','🙎🏾','🙏🏾'],
    ['🙅🏿','🙆🏿','🙇🏿','🙋🏿','🙌🏿','🙍🏿','🙎🏿','🙏🏿']
])

Just to check that we understand the decoding of emojis we compare with the "Emoji Modifiers" section of the wikipedia article https://en.wikipedia.org/wiki/Emoticons_(Unicode_block). 

> Five symbol modifier characters were added with Unicode 8.0 to provide a range of skin tones for human emoji. These modifiers are called EMOJI MODIFIER FITZPATRICK TYPE-1-2, -3, -4, -5, and -6 (U+1F3FB–U+1F3FF): 🏻 🏼 🏽 🏾 🏿. They are based on the Fitzpatrick scale for classifying human skin color. 

In [14]:
import unicodedata
eg = list(emojis[1])
for s in eg:
    print(s, [hex(ord(c)) for c in s])

🙆 ['0x1f646']
🙆🏻 ['0x1f646', '0x1f3fb']
🙆🏼 ['0x1f646', '0x1f3fc']
🙆🏽 ['0x1f646', '0x1f3fd']
🙆🏾 ['0x1f646', '0x1f3fe']
🙆🏿 ['0x1f646', '0x1f3ff']
🙆 ['0x1f646']
🙆🏻 ['0x1f646', '0x1f3fb']
🙆🏼 ['0x1f646', '0x1f3fc']
🙆🏽 ['0x1f646', '0x1f3fd']
🙆🏾 ['0x1f646', '0x1f3fe']
🙆🏿 ['0x1f646', '0x1f3ff']


Note: '\u...' is for 16-bit hex values, while '\U...' is for 32-bit.


In [15]:
print('\U0001f645'+'\U0001f3ff')
print(chr(0x1f645),"+",chr(0x1f3ff),
      "=", chr(0x1f645) + chr(0x1f3ff))

print(*[chr(n) for n in range(0x1f3fb, 0x1f3ff+3)])

🙅🏿
🙅 + 🏿 = 🙅🏿
🏻 🏼 🏽 🏾 🏿 🐀 🐁


So we can just strip `chr(n) for n in range(0x1f3fb, 0x1f3ff+1)` if we don't think there's any useful content in the skin colors used. Of course, in some applications one would definitely want to keep this data, but I don't see a reason to in this context. Why? AFAIK no special meaning to different colors; small set; no demographic questions; even if I wanted to use the skin color info, are there actually good studies about, e.g., how to adjust observed rates to estimate user demograpphics?).

In [16]:
assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

tokenizer = (
    Tokenizer()
    .setInputCols(['document'])
    .setOutputCol('tokenized')
)

# tokenizer.addSplitChars(
#     unicodedata.lookup('RIGHT SINGLE QUOTATION MARK')
# )

# char_names = ['LEFT SINGLE QUOTATION MARK',
#               'RIGHT SINGLE QUOTATION MARK',
#               'LEFT DOUBLE QUOTATION MARK',
#               'RIGHT DOUBLE QUOTATION MARK']
# for name in char_names:
#     tokenizer.addContextChars(unicodedata.lookup(name))


stopwords_cleaner = (
    StopWordsCleaner.pretrained("stopwords_en", "en")
    .setInputCols(['tokenized'])
    .setOutputCol('cleaned')
    .setCaseSensitive(False)
)

# char = unicodedata.lookup('APOSTROPHE')
# replacement = unicodedata.lookup('RIGHT SINGLE QUOTATION MARK')
# stopwords = stopwords_cleaner.getStopWords()
# for s in stopwords_cleaner.getStopWords():
#     if char in s:
#         stopwords.append(s.replace(char, replacement))
# stopwords.sort()
# stopwords_cleaner.setStopWords(stopwords)

lemmatizer = (
    LemmatizerModel.pretrained()
    .setInputCols(['cleaned'])
    .setOutputCol('lemmatized')
)



finisher = (
    Finisher()
    .setInputCols([# 'tokenized', 
                   # 'cleaned',
                   # 'lemmatized',
                   'normalized'
    ])
)


stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


From: http://unicode.org/faq/emoji_dingbats.html

*Q: Can you point me to some examples of emoji characters in Unicode?*

*A: The emoji are spread throughout many blocks of Unicode. See Unicode Emoji Charts for a listing of the emoji characters.*

I broke out some my exploration of emojis into a separate notebook because it'll probably be useful to have that down the line.

In [17]:
# testing unicode ranges in regular expressions

keep_regex = "".join(
    ['[^0-9A-Za-z$&%',
     # stop sign and characters for becky 
     # (and gendering in general) are special cases
     '\u200d\u2640\u2641\u26A5\ufe0f\U0001f6d1',
     # now some emoji ranges that cover the ones most
     # commonly used in WSB posts     
     '\U0001f324-\U0001f393',
     '\U0001f39e-\U0001f3f0',
     '\U0001f400-\U0001f4fd',
     '\U0001f5fa-\U0001f64f',
     '\U0001f680-\U0001f6c5',
     '\U0001f90c-\U0001f93a',
     '\U0001f947-\U0001f978',
     '\U0001f9cd-\U0001f9ff]'])
normalizer = (
    Normalizer()
    .setInputCols(['lemmatized'])
    .setOutputCol('normalized')
    .setLowercase(True)
    .setCleanupPatterns([keep_regex,
                         'http.*'])
)

pipeline = Pipeline().setStages([assembler,
                                 tokenizer,
                                 stopwords_cleaner,
                                 lemmatizer,
                                 normalizer,
                                 finisher])
text_list = [
    # these should be kept
    "🧸 🐂 👱‍♀️ 💎🤲 🧻🤲 🎮🛑 🚀 📈 🍗",
    # some of these should be dropped; cf emojis.ipynb
    "🌀 🌤 🎞 🐀 📿 🕐 🗺 🚀 🤌 🥇 🥺 🧍 🪐"
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list}))

pipeline_model = pipeline.fit(empty_df)
result = pipeline_model.transform(eg_df)
result.show(truncate=False)

+--------------------------------------+---------------------------------------------+
|text                                  |finished_normalized                          |
+--------------------------------------+---------------------------------------------+
|🧸 🐂 👱‍♀️ 💎🤲 🧻🤲 🎮🛑 🚀 📈 🍗   |[🧸, 🐂, 👱‍♀️, 💎🤲, 🧻🤲, 🎮🛑, 🚀, 📈, 🍗]|
|🌀 🌤 🎞 🐀 📿 🕐 🗺 🚀 🤌 🥇 🥺 🧍 🪐|[🌤, 🎞, 🐀, 🗺, 🚀, 🤌, 🥇, 🧍]             |
+--------------------------------------+---------------------------------------------+



In the end, I will just keep all of the emojis.

In [18]:
assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

sentence_detector = (
    SentenceDetector()
    .setInputCols(['document'])
    .setOutputCol('sentences')
)

tokenizer = (
    Tokenizer()
    .setInputCols(['sentences'])
    .setOutputCol('tokenized')
)

stopwords_cleaner = (
    StopWordsCleaner.pretrained("stopwords_en", "en")
    .setInputCols(['tokenized'])
    .setOutputCol('cleaned')
    .setCaseSensitive(False)
)

finisher = (
    Finisher()
    .setInputCols(['sentences', 
                   'tokenized'
    ])
)

pipeline = Pipeline().setStages([assembler,
                                 sentence_detector,
                                 tokenizer,
                                 finisher])

text_list = [
    # these should be kept
    "Can this tell 'what is truly a sentence'? Or can it not.",
    # some of these should be dropped; cf emojis.ipynb
    "Uhhh.... ok."
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list}))

pipeline_model = pipeline.fit(empty_df)
result = pipeline_model.transform(eg_df)
result.show(truncate=False)

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
+--------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------------------------------------+
|text                                                    |finished_sentences                                         |finished_tokenized                                                         |
+--------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------------------------------------+
|Can this tell 'what is truly a sentence'? Or can it not.|[Can this tell 'what is truly a sentence'?, Or can it not.]|[Can, this, tell, ', what, is, truly, a, sentence, '?, Or, can, it, not, .]|
|Uhhh.... ok.                                            |[Uhhh., ., ., ., ok.]                                      |[Uhhh

## Tokenizer exceptions

In [8]:
(assembler,
 sentence_detector,
 tokenizer,
 stopwords_cleaner,
 lemmatizer,
 normalizer,
 finisher) = lda_pipeline.get_unigram_pipeline_components()

tokenizer.setExceptions(["\S+ sell", "\S+ hold"])
tokenizer.setCaseSensitiveExceptions(True)

pipeline = lda_pipeline.build_unigram_pipeline(
    (assembler,
     sentence_detector,
     tokenizer,
     stopwords_cleaner,
     lemmatizer,
     normalizer,
     finisher
    )
)
T = pipeline.getStages()[2]
print(type(T))
T.getExceptions()

<class 'sparknlp.annotator.Tokenizer'>


['\\S+ sell', '\\S+ hold']

In [9]:
text_list = [
    "Don't sell, I say, don't sell.",
    "Do not sell, do not sell!",
    "Shouldn't sell. Should not sell",
    "Why not sell? Shoudl sell!",
    "I don't see why anybody should ever sell.",
    "Some say one mustn't hold. Rubbish! One should hold."
]

empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list}))

pipeline_model = pipeline.fit(empty_df)
result = pipeline_model.transform(eg_df)
result.show(truncate=False)

+----------------------------------------------------+---------------------------------+
|text                                                |finished_unigrams                |
+----------------------------------------------------+---------------------------------+
|Don't sell, I say, don't sell.                      |[dontsell, dontsell]             |
|Do not sell, do not sell!                           |[notsell, notsell]               |
|Shouldn't sell. Should not sell                     |[shouldntsell, notsell]          |
|Why not sell? Shoudl sell!                          |[notsell, shoudlsell]            |
|I don't see why anybody should ever sell.           |[eversell]                       |
|Some say one mustn't hold. Rubbish! One should hold.|[mustnthold, rubbish, shouldhold]|
+----------------------------------------------------+---------------------------------+



## Adding n-grams

In [463]:
! cat matcher_rules.csv

🚀, ROCKET


In [587]:
from sparknlp.annotator import NGramGenerator

(assembler,
 sentence_detector,
 tokenizer,
 stopwords_cleaner,
 lemmatizer,
 normalizer,
 finisher) = lda_pipeline.get_unigram_pipeline_components()

pipeline = lda_pipeline.build_unigram_pipeline(
    pipeline_components = (assembler,
                           sentence_detector,
                           tokenizer,
                           stopwords_cleaner,
                           lemmatizer,
                           normalizer,
                           finisher)
)

(assembler,
 sentence_detector,
 tokenizer,
 stopwords_cleaner,
 lemmatizer,
 normalizer,
 finisher) = pipeline.getStages()

# ! echo "[🚀|🚀\s+]+~rockets" > matcher_rules.csv
# matcher = RegexMatcher()
# matcher.setExternalRules("matcher_rules.csv", delimiter="~")
# matcher.setInputCols(["document"]).setOutputCol("matcher_output")
# tokenizer.setInputCols(["matcher_output"])

# ! echo "🚀+,🚀" > slangDict.csv
# normalizer.setSlangDictionary("slangDict.csv", delimiter=",")

ngrammer = (
    NGramGenerator()
    .setN(3)
    .setEnableCumulative(False)
    .setDelimiter("_")
)

(ngrammer.setInputCols(["unigrams"])
 .setOutputCol("ngrams"))

finisher.setInputCols([# "tokenized",
                       # "unigrams", 
#                        "ngrams"
                        "matcher_output"
                      ])


# from sparknlp.annotator import RegexTokenizer
# tokenizer = RegexTokenizer()
# tokenizer_pattern = "".join([
#     "\s+"
# #     "|(🚀\s+){3,}|(🚀){3,}|(💎🙌){2,}"
# ])
# # tokenizer_pattern= "\s+"
# tokenizer.setPattern(tokenizer_pattern)
# tokenizer.setInputCols(["sentences"])
# tokenizer.setOutputCol("tokenized")
# print(tokenizer.getPattern())

# from sparknlp.annotator import DocumentNormalizer
# doc_normalizer = DocumentNormalizer()

pipeline = (
    Pipeline().setStages([
        assembler,
        matcher,
#         sentence_detector,
#         tokenizer,
#         stopwords_cleaner,
#         lemmatizer,
#         normalizer,
#         ngrammer,
         finisher
    ])
)

cleanup_patterns = normalizer.getCleanupPatterns()
cleanup_patterns.append("(🚀){3,}\1")
normalizer.setCleanupPatterns(cleanup_patterns)
normalizer.getCleanupPatterns()

['[^0-9A-Za-z$&%=\u200d♀♁⚥️🌀-🌡🌤-🎓🎞-🏰🐀-📽📿-🔽🕐-🕧🗺-🙏🚀-🛅\U0001f90c-🤺🥇-\U0001f978🥺-\U0001f9cb\U0001f9cd-🧿\U0001fa90-\U0001faa8]',
 'http.*',
 '(🚀){3,}\x01']

In [588]:
text_list = [
    "Does doo-dad split? Does it split   right?",
    "🚀",
    "Hey 🚀🚀🚀🚀🚀",
    "Hey 🚀 🚀 🚀 🚀 🚀",
    "💎🙌 💎🙌 💎🙌💎🙌"
]
empty_df = spark.createDataFrame([['']]).toDF("text")
eg_df = spark.createDataFrame(pd.DataFrame({"text": text_list}))
pipeline_model = pipeline.fit(empty_df)
result = pipeline_model.transform(eg_df)
result.select(["text", "finished_matcher_output"]).show(truncate=False)

+------------------------------------------+-----------------------+
|text                                      |finished_matcher_output|
+------------------------------------------+-----------------------+
|Does doo-dad split? Does it split   right?|[ ,  ,  ,  ,  ,    ]   |
|🚀                                        |[🚀]                   |
|Hey 🚀🚀🚀🚀🚀                            |[ 🚀🚀🚀🚀🚀]          |
|Hey 🚀 🚀 🚀 🚀 🚀                        |[ 🚀 🚀 🚀 🚀 🚀]      |
|💎🙌 💎🙌 💎🙌💎🙌                        |[ ,  ]                 |
+------------------------------------------+-----------------------+



In [490]:
eg_df.show(truncate=False)

+------------------------------------------+
|text                                      |
+------------------------------------------+
|Does doo-dad split? Does it split   right?|
|🚀                                        |
|Hey 🚀🚀🚀🚀🚀                            |
|Hey 🚀 🚀 🚀 🚀 🚀                        |
|💎🙌 💎🙌 💎🙌💎🙌                        |
+------------------------------------------+



In [443]:
eg_df.select(F.regexp_replace('text', r'((🚀\s*){2,})', '🚀').alias('reeeplaced')).show()

+--------------------+
|          reeeplaced|
+--------------------+
|Does doo-dad spli...|
|              Hey 🚀|
|              Hey 🚀|
|  💎🙌 💎🙌 💎🙌💎🙌|
+--------------------+



In [441]:
result.show(truncate=False)

+------------------------------------------+----------------------------------------------------+-----------------------------+
|text                                      |finished_tokenized                                  |finished_unigrams            |
+------------------------------------------+----------------------------------------------------+-----------------------------+
|Does doo-dad split? Does it split   right?|[Does, doo-dad, split, ?, Does, it, split, right, ?]|[doodad, split, split, right]|
|Hey 🚀🚀🚀🚀🚀                            |[Hey, 🚀🚀🚀🚀🚀]                                   |[hey, 🚀🚀🚀🚀🚀]            |
|Hey 🚀 🚀 🚀 🚀 🚀                        |[Hey, 🚀 🚀 🚀 🚀 🚀]                               |[hey, 🚀🚀🚀🚀🚀]            |
|💎🙌 💎🙌 💎🙌💎🙌                        |[💎🙌 💎🙌 💎🙌💎🙌]                                |[💎🙌💎🙌💎🙌💎🙌]           |
+------------------------------------------+----------------------------------------------------+-----------------------------+



In [396]:
F.regexp_extract?

[0;31mSignature:[0m [0mF[0m[0;34m.[0m[0mregexp_extract[0m[0;34m([0m[0mstr[0m[0;34m,[0m [0mpattern[0m[0;34m,[0m [0midx[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Extract a specific group matched by a Java regex, from the specified string column.
If the regex did not match, or the specified group did not match, an empty string is returned.

>>> df = spark.createDataFrame([('100-200',)], ['str'])
>>> df.select(regexp_extract('str', r'(\d+)-(\d+)', 1).alias('d')).collect()
[Row(d='100')]
>>> df = spark.createDataFrame([('foo',)], ['str'])
>>> df.select(regexp_extract('str', r'(\d+)', 1).alias('d')).collect()
[Row(d='')]
>>> df = spark.createDataFrame([('aaaac',)], ['str'])
>>> df.select(regexp_extract('str', '(a+)(b)?(c)', 2).alias('d')).collect()
[Row(d='')]

.. versionadded:: 1.5
[0;31mFile:[0m      ~/anaconda3/lib/python3.7/site-packages/pyspark/sql/functions.py
[0;31mType:[0m      function


# Topic modelling

In [87]:
normalizer.getCleanupPatterns()

['[^0-9A-Za-z$&%=\u200d♀♁⚥️🌀-🌡🌤-🎓🎞-🏰🐀-📽📿-🔽🕐-🕧🗺-🙏🚀-🛅\U0001f90c-🤺🥇-\U0001f978🥺-\U0001f9cb\U0001f9cd-🧿\U0001fa90-\U0001faa8]',
 'http.*']

In [64]:
df = spark.read.csv(data_path, 
                    header=True,
                    multiLine=True, 
                    quote="\"", 
                    escape="\"")

df = df.sample(withReplacement=False, fraction=0.05, seed=1)
df = (
    df.withColumn("text", 
               F.concat_ws(". ", df.title, df.body))
 .drop("title", "body", "url", "comms_num", "created")
)

texts = df.select("text")

# pipeline = lda_pipeline.build_pipeline()
processed_texts = pipeline.fit(texts).transform(texts)

In [66]:
%%time

from pyspark.ml.feature import CountVectorizer

# tfizer = CountVectorizer(inputCol='finished_normalized',
#                          outputCol='tf_features')

tfizer = CountVectorizer(inputCol='finished_ngrams',
                         outputCol='tf_features')

tf_model = tfizer.fit(processed_texts)
tf_result = tf_model.transform(processed_texts)

CPU times: user 6.09 ms, sys: 4.02 ms, total: 10.1 ms
Wall time: 4.21 s


In [67]:
%%time
from pyspark.ml.feature import IDF
idfizer = IDF(inputCol='tf_features', 
              outputCol='tf_idf_features')
idf_model = idfizer.fit(tf_result)
tfidf_result = idf_model.transform(tf_result)

CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 4.06 s


In [68]:
%%time
from pyspark.ml.clustering import LDA
num_topics = 5
max_iter = 10
lda = LDA(k=num_topics, 
          maxIter=max_iter, 
          featuresCol="tf_idf_features")

CPU times: user 238 µs, sys: 3.57 ms, total: 3.81 ms
Wall time: 23.9 ms


In [69]:
%%time
lda_model = lda.fit(tfidf_result)

CPU times: user 17.1 ms, sys: 4.91 ms, total: 22 ms
Wall time: 41.6 s


In [71]:
from pyspark.sql import types as T
vocab = tf_model.vocabulary
def get_words(token_list):
    return [vocab[token_id] for token_id in token_list]
udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))

In [77]:
num_top_words = 10

topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))
lda_result = topics.select('topicWords').toPandas()

In [84]:
for i in range(lda_result.shape[0]):
    print(f"Topic {i}:", 
          *lda_result.iloc[i].topicWords,
         "\n")

Topic 0: aoc_ted_cruz gme_bb_amc buy_gme_bb 🙌_💎_🙌 💎_🙌_💎 paper_trade_contest head_spce_🌕 definitely_nothing_suspicious 🦍_🚀🚀🚀_head spce_🌕_next 

Topic 1: hold_hold_hold buy_doge_buy doge_buy_doge buy_buy_buy fucking_wall_street disclaimer_financial_advice halt_amc_trade volatility_ig_people extreme_volatility_ig gamestop_amc_due 

Topic 2: amc_bb_nok buy_spy_put gme_amc_bb play_spy_put loss_porn_wifes leave_going_hold keep_nakd_strong dip_aint_fucking boyfriend_tell_tohold wifes_boyfriend_tell 

Topic 3: 🚀_🚀_🚀 want_play_dirty occupy_wall_street tendie_rangers_tohold government_infiltrate_disrupt movement_die_like like_occupy_wall past_indicator_government important_movement_die take_step_prevent 

Topic 4: robinhood_cancel_order class_action_lawsuit aint_much_honest much_honest_work history_strap_boy strap_boy_girl fukkkin_history_strap gme_gang_make gang_make_fukkkin make_fukkkin_history 



## Compare pipeline time usage with spaCy

In [1]:
%config Completer.use_jedi = False
data_path = "reddit_wsb.csv"

from typing import List, Dict, Union
from spacy.tokens import Doc, Token
from spacy.matcher import Matcher

class FilterTextPreprocessing:
    def __init__(self, nlp):
        Doc.set_extension('bow', default=[], force=True)
        Token.set_extension('keep', default=True, force=True)
        
        self.matcher = Matcher(nlp.vocab)
        
        patterns = [
            {"string_id": "stop_word", "pattern": [[{"IS_STOP": True}]]},
            {"string_id": "punctuation", "pattern": [[{"IS_PUNCT": True}]]},
        ]
        
        
        for patt_obj in patterns:
            string_id = patt_obj.get('string_id')
            pattern = patt_obj.get('pattern')
            self.matcher.add(string_id, pattern, on_match=self.on_match)
   
    def on_match(self, matcher, doc, i, matches):
        _, start, end = matches[i]
        for tkn in doc[start:end]:
            tkn._.keep = False
              
    def __call__(self, doc) :
        self.matcher(doc)
        doc._.bow = [tkn.lemma_ for tkn in doc if tkn._.keep]
        return doc
      
#     @classmethod
#     def from_pattern_file(cls, nlp, path) :
#         patterns = read_json(path)
#         return cls(nlp, patterns)

import spacy
from spacy.lang.en import English

nlp = spacy.load("en_core_web_sm")

@English.factory("preprocessor")
def create_preprocessor(nlp, name):
    return FilterTextPreprocessing(nlp)

# nlp.select_pipes(enable=["tagger", "attribute_ruler", "lemmatizer"])
nlp.add_pipe("preprocessor", last=True)
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f58d2553d10>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f58d2569590>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f58d2830c20>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f58d2830d70>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f58d24b7cd0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f58d24c6e60>),
 ('preprocessor', <__main__.FilterTextPreprocessing at 0x7f58d27acdd0>)]

In [27]:
%%time
import csv
import pandas as pd

def process(filename):
    with open(filename, "r") as fobj:
        datareader = csv.DictReader(fobj)
        for row in datareader:
            text = " ".join([row["title"],
                              row["body"]])
            yield nlp(text)
            
gen = process(data_path)

words = []
i=0
while True:
    try:
        doc = next(gen)
        words.append(doc._.bow)
    except StopIteration:
        break
    i += 1
    if i%1000 == 0:
        print(f"i = {i}")
    
words =  pd.Series(words)

i = 1000
i = 2000
i = 3000
i = 4000
i = 5000
i = 6000
i = 7000
i = 8000
i = 9000
i = 10000
i = 11000
i = 12000
i = 13000
i = 14000
i = 15000
i = 16000
i = 17000
i = 18000
i = 19000
i = 20000
i = 21000
i = 22000
i = 23000
i = 24000
i = 25000
CPU times: user 6min 7s, sys: 2.59 s, total: 6min 10s
Wall time: 6min 11s


In [28]:
words

0                          [money, send, message, 🚀, 💎, 🙌]
1        [Math, Professor, Scott, Steiner, say, number,...
2        [exit, system, CEO, NASDAQ, push, halt, tradin...
3             [new, SEC, filing, GME, retarded, interpret]
4              [distract, GME, think, AMC, brother, aware]
                               ...                        
25642                                               [sign]
25643                                 [hold, GME, 🚀, 🚀, 🚀]
25644                    [AMC, Yolo, Update, Feb, 3, 2021]
25645                                         [loss, sell]
25646     [post, curiosity, teem, know, store, 👀, 💎, 🖐, 🚀]
Length: 25647, dtype: object

In [42]:
df_post.finished_unigrams

0                                   [money, send, message]
1        [math, professor, scott, steiner, number, spel...
2        [exit, system, ceo, nasdaq, push, halt, trade,...
3               [new, sec, file, gme, retarded, interpret]
4              [distract, gme, think, amc, brother, aware]
                               ...                        
25642                                               [sign]
25643                                          [hold, gme]
25644                             [amc, yolo, update, feb]
25645                                         [loss, sell]
25646           [dont, post, curiosity, teem, know, store]
Name: finished_unigrams, Length: 25647, dtype: object

In [None]:
# %%time
# pipeline = lda_pipeline.build_pipeline()
# processed_texts = pipeline.fit(texts).transform(texts)
# print(processed_texts)

# for fair comparison with SpaCy below, should build pandas dataframe.
# will throw TaskSetManager:66 - Stage 4 contains a task of very large size
# df_post = processed_texts.toPandas()  
# df_post

In [5]:
371/18.6

19.946236559139784

Speed comparison: The sparknlp pipeline took 18.6 seconds, while spaCy took 371 second (20x as long).

## Playing around with Stanza and SpaCy

In [7]:
df_pd = pd.read_csv(data_path,
                 index_col="timestamp", 
                 parse_dates=True, 
                 keep_default_na=False)
# df_pd = df_pd.assign(timestamp=pd.to_datetime(df_pd.timestamp))
df_pd = df_pd[["id", "title", "body"]]
df_pd.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 25647 entries, 2021-01-28 21:37:41 to 2021-02-04 07:54:27
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      25647 non-null  object
 1   title   25647 non-null  object
 2   body    25647 non-null  object
dtypes: object(3)
memory usage: 801.5+ KB


In [12]:
bin_size = df.shape[0]//ddf.shape[0]
dfs = [df.iloc[bin_size*i : bin_size*(i+1)] for i in range(ddf.shape[0])]

In [13]:
df0 = dfs[0]
X = df0.iloc[2]
X.title, X.body

('I got in late on GME but I believe in the cause and am willing to lose it all.',
 "You guys are amazing. Thank you for sending GME to the moon! I know I'm going to lose most of my money here because I'll hold the line until the end. Let's send a clear message to wall street with GME, BB, AMC, and any others. I've never day traded before but I'm in it now. 🚀")

In [14]:
import stanza
# stanza.download("en")
nlp = stanza.Pipeline("en")
text = df.iloc[2].body
doc = nlp(text)
d_sent = {0:"-", 1:"Ⓝ", 2:"+"}
for sent in doc.sentences:
    print(d_sent[sent.sentiment], sent.text)

2021-02-14 15:42:21 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| depparse  | combined  |
| sentiment | sstplus   |
| ner       | ontonotes |

2021-02-14 15:42:21 INFO: Use device: cpu
2021-02-14 15:42:21 INFO: Loading: tokenize
2021-02-14 15:42:21 INFO: Loading: pos
2021-02-14 15:42:21 INFO: Loading: lemma
2021-02-14 15:42:21 INFO: Loading: depparse
2021-02-14 15:42:21 INFO: Loading: sentiment
2021-02-14 15:42:22 INFO: Loading: ner
2021-02-14 15:42:22 INFO: Done loading processors!


+ You guys are amazing.
+ Thank you for sending GME to the moon!
- I know I'm going to lose most of my money here because I'll hold the line until the end.
Ⓝ Let's send a clear message to wall street with GME, BB, AMC, and any others.
Ⓝ I've never day traded before but I'm in it now.
Ⓝ 🚀


In [15]:
%%time
for s in ["I just love it when the regulators step in.",
          "That was amazingly boring.",
          "That was amazingly tolerable.",
          "At least it wasn't boring."]:
    sentiment = nlp(s).sentences[0].sentiment
    print(d_sent[sentiment], s)

+ I just love it when the regulators step in.
- That was amazingly boring.
+ That was amazingly tolerable.
- At least it wasn't boring.
CPU times: user 1.43 s, sys: 21.1 ms, total: 1.45 s
Wall time: 731 ms


In [16]:
doc = nlp("I knew you were trouble when you walked in!")
for word in doc.sentences[0].words:
    print(word.lemma)

I
know
you
be
trouble
when
you
walk
in
!


In [35]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [129]:
import spacy
text = df.iloc[2].body
d_sent = {0:"-", 1:"Ⓝ", 2:"+"}
doc = nlp(text)
for sent in doc.sents:
    print(sent.sentiment, sent.text)

0.0 You guys are amazing.
0.0 Thank you for sending GME to the moon!
0.0 I know I'm going to lose most of my money here because I'll hold the line until the end.
0.0 Let's send a clear message to wall street with GME, BB, AMC, and any others.
0.0 I've never day traded before
0.0 but I'm in it now.
0.0 🚀


In [47]:
doc = nlp("I knew you were trouble when you walked in!")
for token in doc:
    print(token.lemma_)

I
know
you
be
trouble
when
you
walk
in
!


# Conclusion
This has notebook has grown to an unwieldy size. I accomplished a lot of what I wanted, but, after lots of playing around, I can't find a good way of normalizing repetitions. This if particular interest in the case of emojis. For instance, I would like to do replacement as in
 - 🚀       maps to 🚀
 - 🚀🚀...🚀 maps to 🚀🚀
 - 🚀\s    maps to 🚀
but I can't do this all simultaneously.

Some notes (re SparkNLP 2.7.3):
 - I can do F.regex_replace at the document level, but this has to be done for one output character at a time.
 - I could probably try to use  UDF, but I think that would be a significant slowdown.
 - I can set tokenizer rules to get rid of long strings of a single emoji. Probably can deal with space-separated emojis at the same time. Can also use normalizer rules to do the same thing?
 - It seems complicated to set your own rules for a stemmer or a lemmatizer. 
 
I tried lots of things, but in the end I have settled on just extracting a string of all of the emojis that appear as a feature separate from the texts. Texts that use the character as a standing for a word will be ruined, those that use them as punctuation will become ungrammatical, but those that use them as decoration will be well-preserved. Below is the final form of lda_pipeline.py representative of how the file looked during the writing of this notebook.

In [None]:
from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import (Tokenizer,
                                SentenceDetector,
                                Normalizer,
                                LemmatizerModel,
                                StopWordsCleaner,
                                # NGramGenerator,
                                # PerceptronModel
                                )
from pyspark.ml import Pipeline
import unicodedata

from download_pretrained import PretrainedCacheManager

emoji_ranges = [
        # ranges covering all emojis expressible
        # using only one unicode character.
        '\U0001f300-\U0001f321',
        '\U0001f324-\U0001f393',
        '\U0001f39e-\U0001f3f0',
        '\U0001f400-\U0001f4fd',
        '\U0001f4ff-\U0001f53d',
        '\U0001f550-\U0001f567',
        '\U0001f5fa-\U0001f64f',
        '\U0001f680-\U0001f6c5',
        '\U0001f90c-\U0001f93a',
        '\U0001f947-\U0001f978',
        '\U0001f97a-\U0001f9cb',
        '\U0001f9cd-\U0001f9ff',
        '\U0001fa90-\U0001faa8'
]


def get_unigram_pipeline_components():
    # get acche manager to avoid repeated downloads
    cache_manager = PretrainedCacheManager()
    cache_manager.get_pretrained_components()
    # this is a dict with entries as in
    # ('lemmatizer', path-to-downloaded-unzipped-lemmatizer)
    pretrained_components = cache_manager.pretrained_components

    # get document assembler
    assembler = DocumentAssembler()

    # get sentence detector
    sentence_detector = SentenceDetector()

    # build tokenizer
    tokenizer = Tokenizer()
    # add ['‘', '’', '“', '”'] as context characters
    # doing this in a verbose way because it may be clarifying.
    char_names = ['LEFT SINGLE QUOTATION MARK',
                  'RIGHT SINGLE QUOTATION MARK',
                  'LEFT DOUBLE QUOTATION MARK',
                  'RIGHT DOUBLE QUOTATION MARK']
    for name in char_names:
        tokenizer.addContextChars(unicodedata.lookup(name))
    # now set exceptions.
    # 1) to preserve word preceding "sell" and "hold",
    #    so that, e.g., "don't sell" is not normalized to "sell".
    # 2) don't split "game stop" (...or "game stonk")
    # 3) if an emoji is repeated with spaces, don't split
    #    so we can normalize later
    tokenizer.setExceptions(["\S+ sell",
                             "\S+ hold",
                             "game [stop|stonk]",
                             "\U0001f680\s+", # rocket ship
                             "\U0001f48e\U0001f64c\s+"
                            ])
    tokenizer.setCaseSensitiveExceptions(True)

    # built stopwords cleaner
    stopwords_cleaner = (
        StopWordsCleaner()
        # on second thought, maybe the larger list of stopwords that I
        # was downloading was too expansive. e.g., it has the word "example"
        # .load(pretrained_components["stopwords"])
        .setCaseSensitive(False)
    )
    # for each stopword involving an apostrophe ('), append
    # a version of the stopword using the character (’) instead,
    # and a version with the apostrophe missing
    char = unicodedata.lookup('APOSTROPHE')
    replacement = unicodedata.lookup('RIGHT SINGLE QUOTATION MARK')
    stopwords = stopwords_cleaner.getStopWords()
    stopwords += ["y'all", "yall"]
    for s in stopwords_cleaner.getStopWords():
        if char in s:
            stopwords.append(s.replace(char, replacement))
            stopwords.append(s.replace(char, ""))
    stopwords.sort()
    stopwords_cleaner.setStopWords(stopwords)

    # build lemmatizer
    lemmatizer = (
        LemmatizerModel().load(pretrained_components["lemmatizer"])
    )

    # build normalizer
    normalizer = (
        Normalizer()
        .setLowercase(True)
    )
    # this does not keep all emojis, but it keeps a lot of them.
    # for instance, it does not distinguish skin color, but it has
    # enough characters to express the Becky emoji.
    keeper_regex = ''.join([
        '[^0-9A-Za-z$&%=',
        # special characters for Becky
        '\u200d\u2640\u2641\u26A5\ufe0f',
        ''.join(emoji_ranges),
        ']'
    ])
    normalizer.setCleanupPatterns([keeper_regex,
                                   'http.*'])

    # build finisher
    finisher = Finisher()

    return (assembler, sentence_detector, tokenizer,
            stopwords_cleaner, lemmatizer, normalizer, finisher)


def get_Ngram_pipeline_components(N=2):
    return


def build_unigram_pipeline(pipeline_components=None):
    # get_pipeline_components
    if not pipeline_components:
        _ = get_unigram_pipeline_components()
    else:
        _ = pipeline_components
    (assembler,
     sentence_detector,
     tokenizer,
     stopwords_cleaner,
     lemmatizer,
     normalizer,
     finisher) = _

    # assemble the pipeline
    (assembler
     .setInputCol('text')
     .setOutputCol('document'))

    (sentence_detector
     .setInputCols(['document'])
     .setOutputCol('sentences'))

    (tokenizer
     .setInputCols(['sentences'])
     .setOutputCol('tokenized'))

    (stopwords_cleaner
     .setInputCols(['tokenized'])
     .setOutputCol('cleaned'))

    (lemmatizer
     .setInputCols(['cleaned'])
     .setOutputCol('lemmatized'))

    (normalizer
     .setInputCols(['lemmatized'])
     .setOutputCol('unigrams'))

    (finisher
     .setInputCols(['unigrams']))

    pipeline = (Pipeline()
                .setStages([assembler,
                            sentence_detector,
                            tokenizer,
                            stopwords_cleaner,
                            lemmatizer,
                            normalizer,
                            finisher]))

    # to do: try LightPipeline as in
    # https://nlp.johnsnowlabs.com/docs/en/concepts#lightpipeline

    return pipeline
