# Introduction to Spark and ML pipelines

This is OPTIONAL reading for your information. Nothing in this notebook is required for the rest of the project to run.

This notebook covers many of the basic functions of spark including:
- Spark User defined functions: udf()
- Spark Transformers
- Customized Transformers
- Spark Pipelines

In [1]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType, FloatType, IntegerType
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import Transformer
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import IDF

from spacy.en import English

from src.custom_transformers import SpacyTokenizer
from src.nlp_pipeline import get_pipeline

# from pyspark.ml.feature import Word2Vec
# from pyspark.ml.feature import NGram

%autoreload 2

In [2]:
# when starting jupyter with the sparkjupyter script, pyspark is already imported

print("sql session setup by script:\t", spark)
print("spark context setup by script:\t", sc)
print("pyspark imported by script:\t", str(pyspark)[:56], "...")

sql session setup by script:	 <pyspark.sql.session.SparkSession object at 0x109fb9fd0>
spark context setup by script:	 <pyspark.context.SparkContext object at 0x101adc278>
pyspark imported by script:	 <module 'pyspark' from '/usr/local/Cellar/apache-spark/2 ...


In [3]:
data_file = 'data/excerpts.json'
raw_df = spark.read.json(data_file)

raw_df.printSchema()
print(type(raw_df))
print("row count: ", raw_df.count())
raw_df.show(3)


# create copy of raw_df incase I mess things up :P
df = raw_df



root
 |-- author: string (nullable = true)
 |-- excerpt: string (nullable = true)
 |-- excerpt_number: long (nullable = true)
 |-- title: string (nullable = true)

<class 'pyspark.sql.dataframe.DataFrame'>
row count:  9050
+--------------+--------------------+--------------+---------------+
|        author|             excerpt|excerpt_number|          title|
+--------------+--------------------+--------------+---------------+
|CharlesDickens|A CHRISTMAS CAROL...|             0|AChristmasCarol|
|CharlesDickens|Mind! I don't mea...|             1|AChristmasCarol|
|CharlesDickens|Scrooge never pai...|             2|AChristmasCarol|
+--------------+--------------------+--------------+---------------+
only showing top 3 rows



In [4]:
# a tiny sample dataframe for testing


# Random Sample:

# tiny_df = df.sample(False, 1/1000).limit(5)
# print(type(tiny_df))
# print(tiny_df.count())
# tiny_df.show()


# One excerpt from each book

df.createOrReplaceTempView("df")
tiny_df = spark.sql("""
        SELECT author, title, excerpt, excerpt_number
        FROM df
        WHERE excerpt_number = 25
        ORDER BY author, title
        """).persist()


print(type(tiny_df))
print(tiny_df.count())
tiny_df.show()

<class 'pyspark.sql.dataframe.DataFrame'>
20
+--------------+--------------------+--------------------+--------------+
|        author|               title|             excerpt|excerpt_number|
+--------------+--------------------+--------------------+--------------+
|CharlesDickens|     AChristmasCarol|It was not an agr...|            25|
|CharlesDickens|    ATaleOfTwoCities|“So soon?” || Mis...|            25|
|CharlesDickens|    DavidCopperfield|‘Peggotty,’ says ...|            25|
|CharlesDickens|   GreatExpectations|“What’s in the bo...|            25|
|CharlesDickens|         OliverTwist|'Walk in,' said t...|            25|
|    JaneAusten|                Emma|She was so busy i...|            25|
|    JaneAusten|       MansfieldPark|Fanny was too muc...|            25|
|    JaneAusten|          Persuasion|But Mrs Clay was ...|            25|
|    JaneAusten|   PrideAndPrejudice|“Not as you repre...|            25|
|    JaneAusten| SenseAndSensibility|"It is but a cott...|         

## Spacy: a brief aside

Spacy is a production oriented Natural Language Processing package with (among other things) very nice tokenization options. I use spaCy here because it tokenizes punctuation and contractions better than spark's tokenizer.

Here we will wrap the tokenization in a Spark UDF. Later we will include it in our customized transformer.

In [5]:
%%time
# timing to ensure spaCy is set up properly (should take ~100ms)

parser = English()


CPU times: user 77.6 ms, sys: 20.9 ms, total: 98.4 ms
Wall time: 116 ms


In [6]:
# Grab a couple excerpts for testing

excerpt = df.take(100)[80]['excerpt']
excerpt2 = df.take(100)[99]['excerpt']

In [7]:
%%time
parsedData = parser(excerpt)

# sentences = [sent.string.strip() for sent in parsedData.sents]
# for s in sentences:
#     print(s, '\n')

tokens = [tok.lower_ for tok in parsedData]
# print(type(token_lower[1]))
print(tokens[:8])

['but', 'they', 'did', "n't", 'devote', 'the', 'whole', 'evening']
CPU times: user 10.7 ms, sys: 2.21 ms, total: 12.9 ms
Wall time: 13.3 ms


## UDF demonstration
A quick way to create a User Defined Function (UDF) in spark:

Get (or create a function) in python and use a lambda function to insert it in to "udf(  )".

Don't forget to define your Spark DataType!

```
Other excerpt metadata to include via UDF:
num_chars, num_words, num_sent, num_para
(use these to calc word_len, word_per_sent, word_per_para, sent_per_para . . . etc.
per excerpt, book and author)
```

In [8]:
%%time

def tokenize(text):
    parser = English()
    return [tok.lower_ for tok in parser(text)]

tokenize_udf = udf(lambda x: tokenize(x), ArrayType(StringType()))

df_tokens = tiny_df.withColumn("tokens", tokenize_udf(df.excerpt))
df_tokens.show(3)

+--------------+----------------+--------------------+--------------+--------------------+
|        author|           title|             excerpt|excerpt_number|              tokens|
+--------------+----------------+--------------------+--------------+--------------------+
|CharlesDickens| AChristmasCarol|It was not an agr...|            25|[it, was, not, an...|
|CharlesDickens|ATaleOfTwoCities|“So soon?” || Mis...|            25|[“, so, soon, ?, ...|
|CharlesDickens|DavidCopperfield|‘Peggotty,’ says ...|            25|[‘, peggotty,’, s...|
+--------------+----------------+--------------------+--------------+--------------------+
only showing top 3 rows

CPU times: user 21.3 ms, sys: 6.22 ms, total: 27.5 ms
Wall time: 7.34 s


# Transformers in Spark
A transformer is a function which takes a column from a dataframe, performs some action upon that column and attaches the result to the dataframe in a new column.

## Native Transformers

Many of the transformers in Spark's ML lib are great. Unfortunately Spark's tokenizer leaves punctuation attached to the adjacent word.

In [9]:
tokenizer = Tokenizer(inputCol="excerpt", outputCol="tokenized")
df_spark_tokens = tokenizer.transform(tiny_df)
df_spark_tokens.show()

+--------------+--------------------+--------------------+--------------+--------------------+
|        author|               title|             excerpt|excerpt_number|           tokenized|
+--------------+--------------------+--------------------+--------------+--------------------+
|CharlesDickens|     AChristmasCarol|It was not an agr...|            25|[it, was, not, an...|
|CharlesDickens|    ATaleOfTwoCities|“So soon?” || Mis...|            25|[“so, soon?”, ||,...|
|CharlesDickens|    DavidCopperfield|‘Peggotty,’ says ...|            25|[‘peggotty,’, say...|
|CharlesDickens|   GreatExpectations|“What’s in the bo...|            25|[“what’s, in, the...|
|CharlesDickens|         OliverTwist|'Walk in,' said t...|            25|['walk, in,', sai...|
|    JaneAusten|                Emma|She was so busy i...|            25|[she, was, so, bu...|
|    JaneAusten|       MansfieldPark|Fanny was too muc...|            25|[fanny, was, too,...|
|    JaneAusten|          Persuasion|But Mrs Clay 

## Customized Transformers
Luckily, we can make our own transformers as well.

Here we build the spaCy tokenizer (which treats punctuation as separate tokens) into a customized Spark transformer

### Spacy Transformer:

In [10]:
%%time
tokenizer = SpacyTokenizer(inputCol='excerpt', outputCol='words')

CPU times: user 854 µs, sys: 599 µs, total: 1.45 ms
Wall time: 1.68 ms


In [11]:
%%time

df_tokens = tokenizer.transform(tiny_df)
df_tokens.show(10)


+--------------+-------------------+--------------------+--------------+--------------------+
|        author|              title|             excerpt|excerpt_number|               words|
+--------------+-------------------+--------------------+--------------+--------------------+
|CharlesDickens|    AChristmasCarol|It was not an agr...|            25|[it, was, not, an...|
|CharlesDickens|   ATaleOfTwoCities|“So soon?” || Mis...|            25|[“, so, soon, ?, ...|
|CharlesDickens|   DavidCopperfield|‘Peggotty,’ says ...|            25|[‘, peggotty,’, s...|
|CharlesDickens|  GreatExpectations|“What’s in the bo...|            25|[“, what, ’s, in,...|
|CharlesDickens|        OliverTwist|'Walk in,' said t...|            25|[', walk, in, ,, ...|
|    JaneAusten|               Emma|She was so busy i...|            25|[she, was, so, bu...|
|    JaneAusten|      MansfieldPark|Fanny was too muc...|            25|[fanny, was, too,...|
|    JaneAusten|         Persuasion|But Mrs Clay was ...|   

# Pipelines in Spark

Pipelines allow for multiple transformers to be strung together efficiently.

By using ".getOutputCol( )" column names can be set in a single location.

Columns can then be added/dropped simply by adding or removing them from the "stages" list  in the Pipeline


```python
# Pipeline Example - 
# List all your transformers:
tokenizer = RegexTokenizer(inputCol="parsed_text", outputCol="raw_tokens"
            , pattern="\\W", minTokenLength=3)
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='tokens_stop')
stemmer = Stemming_Transformer(inputCol=remover.getOutputCol(), outputCol='tokens')
bigram = NGram(inputCol=stemmer.getOutputCol(), outputCol='bigrams'
         , n=2)
trigram = NGram(inputCol=stemmer.getOutputCol(), outputCol='trigrams'
          , n=3)
cv = CountVectorizer(inputCol=stemmer.getOutputCol(), outputCol='token_countvector'
     , minDF=10.0)
idf = IDF(inputCol=cv.getOutputCol(), outputCol='token_idf'
      , minDocFreq=10)
w2v_2d = Word2Vec(vectorSize=2, minCount=2, inputCol=stemmer.getOutputCol()
         , outputCol='word2vec_2d')
w2v_large = Word2Vec(vectorSize=250, minCount=2, inputCol=stemmer.getOutputCol()
            , outputCol='word2vec_large')

# include desired transformers in the "stages" list
pipe = Pipeline(stages=[tokenizer, remover, stemmer, cv, idf, w2v_2d, w2v_large])

# and Voila! an entire dataframe can now be created with a single line of code.
```

In [12]:
# Here is a small functional pipeline example:

# Set up transformers
tokenizer = SpacyTokenizer(inputCol='excerpt', outputCol='words')
countvec = CountVectorizer(inputCol=tokenizer.getOutputCol(), outputCol='termfreq')
idf = IDF(inputCol=countvec.getOutputCol(), outputCol='tfidf')

In [13]:
%%time
# Now create the pipeline and build the dataframe by calling .fit() and .transform()
pipeline = Pipeline(stages=[tokenizer, countvec, idf])
sample_data = pipeline.fit(tiny_df).transform(tiny_df)


CPU times: user 51.4 ms, sys: 10.9 ms, total: 62.3 ms
Wall time: 4.61 s


## Using nlp_pipeline.py

We can now put our entire pipeline in a script and apply it to new data with two lines of code.

In [14]:
%%time
nlp_pipeline = get_pipeline()
sample_data = nlp_pipeline.fit(tiny_df).transform(tiny_df)
sample_data.printSchema()

root
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- excerpt: string (nullable = true)
 |-- excerpt_number: long (nullable = true)
 |-- author_id: double (nullable = true)
 |-- title_id: double (nullable = true)
 |-- id_vector: vector (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- termfreq: vector (nullable = true)
 |-- tfidf: vector (nullable = true)
 |-- w2v: vector (nullable = true)
 |-- w2v_2d: vector (nullable = true)

CPU times: user 103 ms, sys: 20.8 ms, total: 124 ms
Wall time: 12.7 s


## Bonus Section: SparkSQL


Congratulation on reading this far! If you are familiar with SQL, here is a handy trick for viewing your data: Spark is fully SQL compatible! 

#### To view your dataframe as a SQL table run:

```python
df.createOrReplaceTempView("Table_Name")

spark.sql("""
        SELECT *
        FROM Table_Name
        LIMIT 5
        """).show()

```


#### If your dataframe is saved as a parquet file it can be querried directly from disk:

```python
T = "parquet.`path/to/dataframe.parquet`"

spark.sql("""
        SELECT *
        FROM {}
        LIMIT 5
        """.format(T)).show()
```


In [15]:
sample_data.createOrReplaceTempView("nlp")

spark.sql("""
        SELECT author, title
             , words, w2v_2d
        FROM nlp
        """).show()

+--------------+--------------------+--------------------+--------------------+
|        author|               title|               words|              w2v_2d|
+--------------+--------------------+--------------------+--------------------+
|CharlesDickens|     AChristmasCarol|[it, was, not, an...|[-0.0590827333764...|
|CharlesDickens|    ATaleOfTwoCities|[“, so, soon, ?, ...|[-0.1139507902954...|
|CharlesDickens|    DavidCopperfield|[‘, peggotty,’, s...|[-0.0822526921615...|
|CharlesDickens|   GreatExpectations|[“, what, ’s, in,...|[-0.0584115711364...|
|CharlesDickens|         OliverTwist|[', walk, in, ,, ...|[-0.0862761910674...|
|    JaneAusten|                Emma|[she, was, so, bu...|[-0.1213585160292...|
|    JaneAusten|       MansfieldPark|[fanny, was, too,...|[-0.0106421834240...|
|    JaneAusten|          Persuasion|[but, mrs, clay, ...|[-0.0279451959840...|
|    JaneAusten|   PrideAndPrejudice|[“, not, as, you,...|[-0.0371378898880...|
|    JaneAusten| SenseAndSensibility|[",