# Spark streaming using socket

### 1. Setup a stream source

I used a centos docker image to setup a stream data source.

#### A. Create a docker network

```bash
docker network create stream-net
```

#### B. Use a centos image to install netcat

```bash
docker run -t centos yum -y install nc
```

#### C. Save the active container as a new image

* Get the docker container id

```bash
docker ps -a|grep 'yum -y install nc'
```

* Save as docker image

```bash
docker commit CONTAINER_ID nc-image
```

* Kill the active container

```bash
docker kill CONTAINER_ID
```

### 2. Start the data stream

Start a new container of the new image from a new terminal.

```bash
docker run \
--rm \                   # remove container after stop
-it \                    # run it interactively from the terminal
-p 9999 \                # expose port 9999
--network stream-net \   # use the new docker network
--name nc-server \       # assign container name
nc-image \               
nc -lk nc-server 9999    # netcat command
```

Type new messages to the terminal when the netcat command is running.

`<Type some text>`

### 3. Start spark jupyter notebook

I used a docker image to run the jupyter notebook for pyspark. Following commands should be run from a new terminal.

#### A. Download jupyter notebook docker image
```bash
docker pull avikdatta/sparkjupyterdockerimage
```
#### B. Start jupyter notebook and link it to the stream source 
```bash
docker run    \
 --network stream-net \              # use new docker network
 --link nc-server:bash \             # link to the stream source
 -d \                                # run in detached mode
 -p 8887:8887 \                      # expose port 8887 for notebook
 -p 4040:4040 \                      # expose port 4040 for spark ui
 --name spark-client \               # name of container
 avikdatta/sparkjupyterdockerimage \
 jupyter-notebook \
 --ip=0.0.0.0 \                      # notebook ip
 --port=8887 \                       # notebook port
 --no-browser                        # run notebook without browser
```

#### C. Access notebook from a browser
Use following network address to connect to the notebook.

`http://<DOCKER HOST IP ADDRESS>:8887`
    
It will connect to the notebook server and ask for password.

## 4. Process data stream from notebook

Open a new notebook page and run following steps.

In [2]:
# Load pyspark package
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [3]:
# Create spark context
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 20)

In [4]:
# Connect to the stream data source
lines = ssc.socketTextStream("nc-server", 9999)

In [5]:
# Data transformation using Spark
words = lines.flatMap(lambda line: line.split(" "))
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
wordCounts.pprint()

In [6]:
# Start the stream processing
ssc.start() 
ssc.awaitTermination()

```
-------------------------------------------
Time: 2018-02-18 22:47:20
-------------------------------------------
('systems', 1)
('"one', 1)
('in', 2)
('is', 1)
('world".', 1)
('of', 2)
```

In [8]:
# Stop streaming and spark context
ssc.stop()