# Chapter 7.3 - Spark Streaming

Paul E. Anderson

## Ice Breaker

Best breakfast burrito in town?

In [1]:
%load_ext autoreload
%autoreload 2
    
import os
from pathlib import Path
home = str(Path.home())

import pandas as pd

## Problem Statement:
* You are approached by a company who has a machine learning pipeline that is trained and tested on historical data. 
* This pipeline is used by the company to sort tweets into one of three categories which also have a corresponding numerical label in parentheses.
    * Negative (0)
    * Positive (1)
    * Neutral (2)
    
The company has heard about your amazing skills as a Spark streaming expert. They would like you to take their pre-trained classifier and update it with new incoming data processed via Spark streaming.

## Detours

In order to implement our streaming approach, we need to take a couple of brief detours into machine learning. We need to answer the following questions:
* How do we represent text as a vector of numbers such that a machine can mathematically learn from data?
* How to use and evaluate an algorithm to predict numeric data into three categories (negative, positive, and neutral)? 

### Representing text as a vector using `scikit-learn`

scikit-learn is a popular package for machine learning.

We will use a class called `CountVectorizer` in `scikit-learn` to obtain what is called the term-frequency matrix. 

A couple famous book openings:

> The story so far: in the beginning, the universe was created. This has made a lot of people very angry and been widely regarded as a bad move - The Restaurant at the End of the Universe by Douglas Adams (1980)

> Whether I shall turn out to be the hero of my own life, or whether that station will be held by anybody else, these pages must show. — Charles Dickens, David Copperfield (1850)

How will a computer understand these sentences when computers can only add/mult/compare numbers?

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

famous_book_openings = [
    "The story so far: in the beginning, the universe was created. This has made a lot of people very angry and been widely regarded as a bad move",
    "Whether I shall turn out to be the hero of my own life, or whether that station will be held by anybody else, these pages must show."
]

vec = CountVectorizer()
vec.fit(famous_book_openings) # This determines the vocabulary.
tf_sparse = vec.transform(famous_book_openings)
tf_sparse

<2x46 sparse matrix of type '<class 'numpy.int64'>'
	with 48 stored elements in Compressed Sparse Row format>

## Printing in a readable format

In [3]:
import pandas as pd

pd.DataFrame(
    tf_sparse.todense(),
    columns=vec.get_feature_names()
)

Unnamed: 0,and,angry,anybody,as,bad,be,been,beginning,by,created,...,these,this,to,turn,universe,very,was,whether,widely,will
0,1,1,0,1,1,0,1,1,0,1,...,0,1,0,0,1,1,1,0,1,0
1,0,0,1,0,0,2,0,0,1,0,...,1,0,1,1,0,0,0,2,0,1


## Applying this process to our twitter data
We will do the following:
1. Load the tweets into a dataframe
2. Convert those tweets into a term-frequency matrix using the code from above

### Loading tweets into a dataframe

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

In [5]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([ \
    StructField("ItemID", IntegerType(), True), \
    StructField("Sentiment", IntegerType(), True), \
    StructField("SentimentText",StringType(),True)
  ])

spark_df = spark.read.schema(schema).csv(f'{home}/csc-369-student/data/twitter_sentiment_analysis/historical/xa*')
spark_df.first()

Row(ItemID=9000, Sentiment=0, SentimentText='&quot;Gravity is not always convenient!&quot;-Mr. Donde lol I just dropped my phone  but I thought of this saying.')

### Convert to Pandas DataFrame for sklearn

In [6]:
historical_training_data = spark_df.toPandas()
historical_training_data.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,9000,0,&quot;Gravity is not always convenient!&quot;-...
1,9001,1,&quot;Ha-ha!&quot; to the premature PSP Go! re...
2,9002,0,&quot;Hahah your just jealous your not as thin...
3,9003,0,"&quot; lesbian, gay and bisexual students are ..."
4,9004,1,&quot; mileycyrus i wanna perform with lady ga...


### Convert to a term frequency matrix

In [7]:
vec = CountVectorizer()
vec.fit(historical_training_data['SentimentText']) # This determines the vocabulary.
tf_sparse = vec.transform(historical_training_data['SentimentText'])

## Mathematical model for prediction
* We will use a multinomial Bayes classifier. 
* It is a statistical classifier that has good baseline performance for text analysis. 
* It's a classifier that you can update as new data arrives (i.e., online learning)

In [8]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(tf_sparse,historical_training_data['Sentiment'])

MultinomialNB()

**How well does this model predict on historical data?**

In [9]:
test_spark_df = spark.read.schema(schema).csv(f'{home}/csc-369-student/data/twitter_sentiment_analysis/historical/xb*')
historical_test_data = test_spark_df.toPandas()
test_tf_sparse = vec.transform(historical_test_data['SentimentText'])
print("Accuracy on new historical test data:",sum(model.predict(test_tf_sparse) == historical_test_data['Sentiment'])/len(historical_test_data))

Accuracy on new historical test data: 0.7223846153846154


**We've got a predictive model that does better than guessing!**

That's enough for this illustrative example. Now how would we update this using Spark Streaming?

### The usual SparkContext

In [10]:
from pyspark import SparkConf
from pyspark.context import SparkContext

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

### Grab a streaming context

In [11]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)

### Create a directory where we can add tweets

In [21]:
data_dir="/tmp/tweets"
!rm -rf {data_dir}
!mkdir {data_dir}
!chmod 777 {data_dir}
!ls {data_dir}

### Code from 7.2

```python
data_dir = "/tmp/add_books_here"


from pyspark.sql import SparkSession
from pyspark.sql import Row
import traceback

# Lazily instantiated global instance of SparkSession
def getSparkSessionInstance(sparkConf):
    if ("sparkSessionSingletonInstance" not in globals()):
        globals()["sparkSessionSingletonInstance"] = SparkSession \
            .builder \
            .config(conf=sparkConf) \
            .getOrCreate()
    return globals()["sparkSessionSingletonInstance"]

ssc = StreamingContext(sc, 1)
ssc.checkpoint("checkpoint")

lines = ssc.textFileStream(data_dir)

def process(time, rdd):
    print("========= %s =========" % str(time))
    if rdd.isEmpty():
        return
    # Get the singleton instance of SparkSession
    try:
        spark = getSparkSessionInstance(rdd.context.getConf())
        # Convert RDD[String] to RDD[Row] to DataFrame
        words = rdd.flatMap(lambda line: line.split(" ")).map(lambda word: word)
        rowRdd = words.map(lambda w: Row(word=w))
        wordsDataFrame = spark.createDataFrame(rowRdd)

        # Creates a temporary view using the DataFrame
        wordsDataFrame.createOrReplaceTempView("words")

        # Do word count on table using SQL and print it
        wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
        print(wordCountsDataFrame.show())
    except Exception:
        print(traceback.format_exc())

lines.foreachRDD(process)

ssc.start()
import time; time.sleep(30)
#ssc.awaitTerminationOrTimeout(60) # wait 60 seconds
ssc.stop(stopSparkContext=False)
```

### How would you modify this so it updates the model via Spark Streaming?

In [17]:
# Your solution here

ssc.start()
import time; time.sleep(10)
ssc.stop(stopSparkContext=False)

MultinomialNB()


**Help our algorithm by copying some of the data files in the directory!**

In [18]:
!ls {home}/csc-369-student/data/twitter_sentiment_analysis/streaming/

xca  xce  xci  xcm  xcq  xcu  xcy  xdc	xdg  xdk  xdo  xds  xdw
xcb  xcf  xcj  xcn  xcr  xcv  xcz  xdd	xdh  xdl  xdp  xdt
xcc  xcg  xck  xco  xcs  xcw  xda  xde	xdi  xdm  xdq  xdu
xcd  xch  xcl  xcp  xct  xcx  xdb  xdf	xdj  xdn  xdr  xdv


In [23]:
!echo cp \~/csc-369-student/data/twitter_sentiment_analysis/streaming/xca {data_dir}

cp ~/csc-369-student/data/twitter_sentiment_analysis/streaming/xca /tmp/tweets


**When we are ready we can check the accuracy again. In theory, we should get better with more data.**

In [24]:
print("Accuracy on new historical test data:",sum(model.predict(test_tf_sparse) == historical_test_data['Sentiment'])/len(historical_test_data))

Accuracy on new historical test data: 0.7566538461538461


Thank you! Don't forget to push.