## Recursive Least Square with Kafka and Spark streaming 

This notebook provides an example for estimating the coefficients of a linear model on streaming data coming from a Kafka producer. The coefficient estimation is achieved using the recursive least square (RLS) algorithm, using two different RLS models in parallel (with different forgetting factors).

The linear model has 10 parameters, with coefficients [1,0,0,0,0,0,0,0,0,1] (see notebook KafkaSendRLS).

This notebook uses 
* the [Python client for the Apache Kafka distributed stream processing system](http://kafka-python.readthedocs.io/en/master/index.html) to receive messages from a Kafka cluster. 
* [Spark streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html) for processing the streaming data



### General import

In [1]:
import time
import re, ast
import numpy as np
import os

### Start Spark session


In [2]:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 '+\
                                '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.1 '+\
                                '--conf spark.driver.memory=2g  pyspark-shell'

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("KafkaReceive") \
    .getOrCreate()

### Connect to Kafka cluster on topic dataLinearModel

In [3]:
#This function creates a connection to a Kafka stream
#You may change the topic, or batch interval
#The Zookeeper server is assumed to be running at 127.0.0.1:2181
#The function returns the Spark context, Spark streaming context, and DStream object
def getKafkaDStream(spark,topic='persistence',batch_interval=10):

    #Get Spark context
    sc=spark.sparkContext

    #Create streaming context, with required batch interval
    ssc = StreamingContext(sc, batch_interval)

    #Checkpointing needed for stateful transforms
    ssc.checkpoint("checkpoint")
    
    #Create a DStream that represents streaming data from Kafka, for the required topic 
    dstream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {topic: 1})
    
    return [sc,ssc,dstream]


In [4]:
def updateFunction(new_values, state): 
    ## RLS update function
    ## Only update with first value of RDD. You should transofrm new_values to array, and update models for all values 
    if (len(new_values)>0 ):
        
        key=new_values[0][0]
        yx=new_values[0][1]
        i=yx[0]
        y=yx[1]
        x=yx[2:]
        n=len(x)
        
        beta=state[1]
        beta.shape=(n,1)
        V=state[2]
        mu=state[3]
        sse=state[4]  ## sum of squared errors
        N=state[5]   ## number of treated samples
        x.shape=(1,n)
        err=y-x.dot(beta)
        sse=sse+pow(err,2.0)
        V=1.0/mu*(V-V.dot(x.T).dot(x).dot(V)/(1.0+float(x.dot(V).dot(x.T))))
        gamma=V.dot(x.T)
        beta=beta+gamma*err
        
        return (key,beta,V,mu,sse/(N+1.0),N+1)  ## update formula mod1
        
    else:
        return state

### Define streaming pipeline

* We define a stream with two states, for updating two RLS models in paralell. Each state contains a state of variables to keep the state of the model, as well as to keep track of MSE estimates. A state is a list of 5 elements:
    * The first three are beta, V and mu, and define the state of the model (see RLS formulas in course)
    * The last two are an estimate of the MSE of the model, and the number of treated samples 
* We create a DStream, flat map with the sensor ID as key, update state for the stream, and save MSE

In [5]:
import re, ast

n=10 # number of features

beta1=np.zeros(n)  ## initial parameter vector for model 1
V1=np.diag(np.zeros(n)+10) ## initial covariance matrix for model 1
mu1=1.0 # forgetting factor for model 1

state1=('mod1',beta1,V1,mu1,0,0)

beta2=np.zeros(n)  ## initial parameter vector for model 2
V2=np.diag(np.zeros(n)+1) ## initial covariance matrix for model 2
mu2=0.99 # forgetting factor for model 2

state2=('mod2',beta2,V2,mu2,0,0)

[sc,ssc,dstream]=getKafkaDStream(spark=spark,topic='dataLinearModel',batch_interval=1)

#Evaluate input (a list - see KafkaSendRLS notebook)
dstream = dstream.map(lambda x: np.array(ast.literal_eval(x[1])))
#dstream.pprint()

dstream=dstream.flatMap(lambda x: [('mod1',('mod1',1.0*np.array(x))),
                            ('mod2',('mod2',1.0*np.array(x)))])
#dstream.pprint()

initialStateRDD = sc.parallelize([(u'mod1', state1),
                                  (u'mod2', state2)])

dstream=dstream.updateStateByKey(updateFunction,initialRDD=initialStateRDD)

#Only display beta and error
#beta should converge to [1,0,0,0,0,0,0,0,0,1] (send KafkaSend notebook)
dstream.map(lambda x: x[1][0:2]+x[1][4:6]).pprint()
#dstream.pprint()

### Start streaming application

In [6]:
ssc.start()

-------------------------------------------
Time: 2018-04-11 10:24:24
-------------------------------------------
('mod1', array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]), 0, 0)
('mod2', array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]), 0, 0)

-------------------------------------------
Time: 2018-04-11 10:24:25
-------------------------------------------
('mod1', array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]), 0, 0)
('mod2', array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]), 0, 0)

-------------------------------------------
Time: 2018-04-11 10:24:26
-------------------------------------------
('mod1', array([[ 0.07345426],
       [ 0.17443528],
       [ 0.10719983],
       [ 0.2122113 ],
       [ 0.18384878],
       [ 0.19141452],
       [ 0.01971397],
       [ 0.24240384],
       [ 0.1413216 ],
       [ 0.06929117]]), array([[ 1.03321559]]), 1)
('mod2', array([[ 0.06080198],
       [ 0.14438931],
       [ 0.08873497],
       [ 0.17565852],
     

### Stop streaming

In [7]:
ssc.stop(stopSparkContext=False,stopGraceFully=False)

-------------------------------------------
Time: 2018-04-11 10:24:29
-------------------------------------------
('mod1', array([[ 0.22694852],
       [ 0.35408383],
       [ 0.00752181],
       [ 0.15257772],
       [ 0.01549361],
       [ 0.11455939],
       [ 0.41975984],
       [ 0.16142979],
       [ 0.38114068],
       [ 0.28878947]]), array([[ 0.07124946]]), 4)
('mod2', array([[ 0.1937677 ],
       [ 0.30240471],
       [ 0.07113931],
       [ 0.17484219],
       [ 0.11190291],
       [ 0.16065204],
       [ 0.29383222],
       [ 0.16671696],
       [ 0.29899335],
       [ 0.21592691]]), array([[ 0.0875207]]), 4)

