# Chapter 2. Architecture and Components of Spark and Spark Streaming

- 하둡의 장점
    - revolution in data processing and storage space
    - a low cost solution and reliable batch processing
    - http://en.wikipedia.org/wiki/MapReduce
- 하둡MapReduce의 한계점
    - Excessive and intensive use of disks for all intermediate stages
    - Only provides map and reduce operations and no other operations like joining/flattening, and grouping of datasets.
- Spark 장점 
    - enabled in-memory data storage and near real-time data processing.
    - operations such as joins, merging, grouping and many more
    - faster ( disk를 사용하는 hadoop app 보다 )
- Spark의 Core API
    - SQL for structured data processing
    - MLlib for iterative data processing—machine learning
    - GraphX for graph processing
    - Spark Streaming—real-time data processing of streaming data
- 이번장의 목표
    - Batch versus real-time data processing
    - Architecture of Spark
    - Architecture of Spark Streaming
    - Your first Spark Streaming program

## 01절 Batch versus real-time data processing
- batch 와 real-time에서 large dataset의 데이터 프로세싱의 다른점을 논의하자
- 그러면, Spark architecture를 이해하는데 많은 도움 됨.

### Batch processing
- Batch processing는 서로 다른 JOB이 연결되어 있거나 또는 다른 JOB이후에 순서적으로 또는 병렬로 실행되는 JOB들이 여러개로 이루어진 process 임.
- 입력데이터는 일정시간 동안 수집하고  batch의 결과가 다른 JOB의 입력이 될 수 있음.
- fast response time은 핵심이 아님.
- Job을 processing하기 위한 time window or batch window을 갖음.
- 다소 덜 민감한 온라인 활동(less intensive online activity)하는 주기를 갖음.
- examples of batch jobs
    - Log analysis
    - Billing applications
    - Backups
    - Data warehouses
- complexity involved in batch processing systems
    - Large data
    - Scalability
    - Distributed processing
    - Fault tolerant
    
### Real-time data processing
- Real-time data processing는 항공관제나 여행 예약시스템에서와 같이 끊임없이 변경되는 데이터를 수신받고, 데이터의 소스를 컨트롤할 수 있도록 적합한 속도로 이를 처리함.
- real time에서의 응답시간은 즉시적이고, 수 밀리초이내를 기대함.
- 데이터 수신시간과 응답시간 사이의 차이를  **latency** 라고 하며,  작을수록 좋음.
- real time은  latency 또는 서비스 요건이 있기 때문에 near real-time(준시실간)으로 자주 언급됨.
- examples of real-time systems
    - Bank ATMs
    - Real-time monitoring
    - Real-time business intelligence
    - Operational intelligence (OI) 
    - Point of Sale (POS) systems
    - Assembly lines ( 조립라인 )
- complexity involved in real-time data processing systems 
    - System responsiveness ( 시스템 민감도 ) : 지연없이 데이터를 처리해야 함.
    - Fault-tolerant
    - Scalable
    - In memory

## 02절 Architecture of Spark 

### Spark versus Hadoop
- Spark는 open source cluster computing framework로 hadoop과 비슷하지만, 더 좋음.
- 더 좋은 이유
    - Iterative and interactive computations and workloads : 예를 들면, 중간생산물을 재사용하는 machine learning algorith과 여러개의 병렬연산이 필요한 데이터 작업시에 좋음.
    - Real-time data processing : hadoop은 batch processing이고, real-time시에 인-메모리 프로세싱 능력이 부족함.
- RDD (Resilient Distributed Datasets)
    - cluster내에서 partitioned되어지고, 지연최소화를 위해서 메모리내에 캐싱되어지는 분산데이터셋의 새로운 추상화 레이어를 도입.
    - RDD는 immutable (read-only) collection 
    
### Layered architecture – Spark

![Spark](sparkstreaming02_01.jpg)

- High-level architecture of Spark
    - Data storage layer : local filesystems, HDFS , NoSQL database like HBase, Cassandra, MongoDB, S3, Elasticsearch
    - Resource manager APIs : YARN, Mesos, Standalone
    - Spark Core libraries : Spark general execution engine,  in-memory distributed data processing
    - Spark extensions/libraries 
    
![Spark](sparkstreaming02_02.jpg)    

- Interaction between the different layers of the Spark architecture.

## 03절 Architecture of Spark Streaming 

### What is Spark Streaming? 
- Spark Streaming은 streaming data 또는 빠르게 흐르는 데이터를 processing하는 기능을 제공하는 Spark 확장임.
- spam filtering, intrusion detection(침입탐지), clickstream data analysis 등에 사용됨.
- Spark가 in-memory processing이 지원되어서, spark streaming도 in-memory에서 live/streaming data 을 processing함.
- 데이터 소스로 HDFS, Flume, Kakfa, Twitter, TCP socket을 포함함.

### High-level architecture – Spark Streaming
- Spark Streaming은 live/streaming data를 미리 정해놓은 작은 작업의 일렬의 과정으로 나누는 **micro-batching**의 개념을 구현함. 
- 각각의 batch는 개별 레코드를 처리하고 batch의 결과물은 사용자가 정의해놓음 output stream으로 보내지고, HFDS, NoSQL, DB에 저장될 수 있고, live dashboard를 생성할 수 있음.
- batch size는 각각의 경우에 허용되는 latency를 반영해서 지정함.
- 몇 밀리초나 초 단위로 지정할 수 있으며, 지정된 batch size만큼의 데이터를 한번에 처리하는 방식임.

![Spark](sparkstreaming02_03.jpg)  
- High-level architecture of Spark Streaming
    - Input data streams 
        - Basic data sources : core Spark Streaming API에서 제공
        - Advanced data sources : external libraries 로 따로 다운받아서 설치가 필요. http://spark-packages.org/?q=tags%3A%22Streaming%22
    - Spark Streaming : Streamin API와 Spark API를 가지고 streaming data을 처리 및 프로그램 코딩 부분
    - Batch: DStreams (Discretized streams) 으로 알려진 RDD의 seies을 추상화 계층, 이 DStreams에는 input streams 데이터를 포함하고 있으며, 데이터의 변환을 처리
    - Spark Core engine : RDD형식으로 입력데이터를 받고, 처리해서 결과를 보내는 작업을 수행.
    - Output data streams : 각각의 처리된 batch의 결과물을 다음 동작으로 보내지기 위한 output stream( DStream 형식 ).    Raw file system, NoSQL, Queues or web sockets 으로 보내질 수 있으면 당연히 추가 코딩이 필요함.
   

## 04절 Your first Spark Streaming program


### Coding Spark Streaming jobs in Scala

In [None]:
package chapter.two

import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.dstream.ForEachDStream


object ScalaFirstSreamingExample {
  
  def main(args:Array[String]){
    
    println("Creating Spark Configuration")
    //Create an Object of Spark Configuration
    val conf = new SparkConf()
    //Set the logical and user defined Name of this Application
    conf.setAppName("My First Spark Streaming Application")
    
    println("Retreiving Streaming Context from Spark Conf")
    //Retrieving Streaming Context from SparkConf Object.
    //Second parameter is the time interval at which streaming data will be divided into batches  
    val streamCtx = new StreamingContext(conf, Seconds(2))

    //Define the the type of Stream. Here we are using TCP Socket as text stream, 
    //It will keep watching for the incoming data from a specific machine (localhost) and port (9087) 
    //Once the data is retrieved it will be saved in the memory and in case memory
    //is not sufficient, then it will store it on the Disk
    //It will further read the Data and convert it into DStream
    val lines = streamCtx.socketTextStream("localhost", 9087, MEMORY_AND_DISK_SER_2)
    
    //Apply the Split() function to all elements of DStream 
    //which will further generate multiple new records from each record in Source Stream
    //And then use flatmap to consolidate all records and create a new DStream.
    val words = lines.flatMap(x => x.split(" "))
    
    //Now, we will count these words by applying a using map()
    //map() helps in applying a given function to each element in an RDD. 
    val pairs = words.map(word => (word, 1))
    
    //Further we will aggregate the value of each key by using/ applying the given function.
    val wordCounts = pairs.reduceByKey(_ + _)
    
    //Lastly we will print all Values
    //wordCounts.print(20)
    
    printValues(wordCounts,streamCtx)
    //Most important statement which will initiate the Streaming Context
    streamCtx.start();
    //Wait till the execution is completed.
    streamCtx.awaitTermination();  
  
  }
  
  /**
   * Simple Print function, for printing all elements of RDD
   */
  def printValues(stream:DStream[(String,Int)],streamCtx: StreamingContext){
    stream.foreachRDD(foreachFunc)
    def foreachFunc = (rdd: RDD[(String,Int)]) => {
      val array = rdd.collect()
      println("---------Start Printing Results----------")
      for(res<-array){
        println(res)
      }
      println("---------Finished Printing Results----------")
    }
  }
  
}

### Coding Spark Streaming jobs in Java

In [None]:
package chapter.two;

import java.util.Arrays;

import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;

import scala.Tuple2;

public class JavaFirstStreamingExample {
	  
	public static void main(String[] s){
	    
	    System.out.println("Creating Spark Configuration");
	    //Create an Object of Spark Configuration
	    SparkConf conf = new SparkConf();
	    //Set the logical and user defined Name of this Application
	    conf.setAppName("My First Spark Streaming Application");
	    //Define the URL of the Spark Master. 
	    //Useful only if you are executing Scala App directly from the console.
	    //We will comment it for now but will use later
	    //conf.setMaster("spark://ip-10-237-224-94:7077")
	    conf.setMaster("local[2]");
	    
	    System.out.println("Retreiving Streaming Context from Spark Conf");
	    //Retrieving Streaming Context from SparkConf Object.
	    //Second parameter is the time interval at which streaming data will be divided into batches  
	    JavaStreamingContext streamCtx = new JavaStreamingContext(conf, Durations.seconds(2));

	    //Define the the type of Stream. Here we are using TCP Socket as text stream, 
	    //It will keep watching for the incoming data from a specific machine (localhost) and port (9087) 
	    //Once the data is retrieved it will be saved in the memory and in case memory
	    //is not sufficient, then it will store it on the Disk.  
	    //It will further read the Data and convert it into DStream
	    JavaReceiverInputDStream<String> lines = streamCtx.socketTextStream("localhost", 9087,StorageLevel.MEMORY_AND_DISK_SER_2());
	    
	    //Apply the x.split() function to all elements of JavaReceiverInputDStream 
	    //which will further generate multiple new records from each record in Source Stream
	    //And then use flatmap to consolidate all records and create a new JavaDStream.
	    JavaDStream<String> words = lines.flatMap( new FlatMapFunction<String, String>() {
	    			    @Override public Iterable<String> call(String x) {
	    			      return Arrays.asList(x.split(" "));
	    			    }
	    			  });
	    		
	    
	    //Now, we will count these words by applying a using mapToPair()
	    //mapToPair() helps in applying a given function to each element in an RDD
	    //And further will return the Scala Tuple with "word" as key and value as "count".
	    JavaPairDStream<String, Integer> pairs = words.mapToPair(
	    		  new PairFunction<String, String, Integer>() {
	    		    @Override 
	    		    public Tuple2<String, Integer> call(String s) throws Exception {
	    		      return new Tuple2<String, Integer>(s, 1);
	    		    }
	    		  });
	    		
	    
	    //Further we will aggregate the value of each key by using/ applying the given function.
	    JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
	    		  new Function2<Integer, Integer, Integer>() {
	    		    @Override public Integer call(Integer i1, Integer i2) throws Exception {
	    		      return i1 + i2;
	    		    }
	    		  });
	    		
	    
	    //Lastly we will print First 10 Words.
	    //We can also implement custom print method for printing all values,
	    //as we did in Scala example.
	    wordCounts.print(10);
	    //Most important statement which will initiate the Streaming Context
	    streamCtx.start();
	    //Wait till the execution is completed.
	    streamCtx.awaitTermination();  
	  
	  }

}


### The client application

In [None]:
package chapter.two;

import java.net.ServerSocket;
import java.net.Socket;
import java.io.*;

public class ClientApp {

	public static void main(String[] args) {
		try{
			System.out.println("Defining new Socket");
			ServerSocket soc = new ServerSocket(9087);
			System.out.println("Waiting for Incoming Connection");
			Socket clientSocket = soc.accept();

			System.out.println("Connection Received");
			OutputStream outputStream = clientSocket.getOutputStream();
			//Keep Reading the data in a Infinite loop and send it over to the Socket.		
			while(true){
				PrintWriter out =  new PrintWriter(outputStream, true);
				BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
				System.out.println("Waiting for user to input some data");
				String data = read.readLine();
				System.out.println("Data received and now writing it to Socket");
				out.println(data);
				
			}
			
		}catch(Exception e ){
			e.printStackTrace();
		}


	}

}