# W205 Project 2
## Priscilla Burity

#### Commands explained

Move to the project's directory.

````
cd ~/w205/project-2-burityp

````


Copy the .yml file from class. `cp` for copy and `-r` to copy recursively, so `cp` copies the contents of directories, and if a directory has subdirectories they are copied (recursively) too.

```
cp -r ~/w205/spark-with-kafka-and-hdfs/docker-compose.yml ~/w205/project-2-burityp
```

Spin-up the cluster. `docker-compose up` starts the container. `-d` for detached mode, i.e., starts the containers in the background.

```
docker-compose up -d
```

`docker-compose logs` to attach to the logs of all running services. Check if there are any error messages. `-f` means I follow the log. `kafka` because I'm interested in logs of this specific container.

```
docker-compose logs -f kafka
```

Got the error message: `There is insufficient memory for the Java Runtime Environment to continue.`

Increase the swap space of the hardisk:

```
sudo dd if=/dev/zero of=/var/myswap bs=1M count=2048

sudo mkswap /var/myswap

sudo swapon /var/myswap

```

Retype the commands and check if there are any error messages.

````
docker-compose up -d

docker-compose logs -f kafka

````



Sounds ok. Download the data
````
curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp
````


Check out Hadoop file system, using `cloudera` to talk to `hadoop`, `fs` for file system. `-ls` to check the content of the temporary (`/tmp`) directory. Chech if the automatically created folders are there. 

````
docker-compose exec cloudera hadoop fs -ls /tmp/
````



Sounds ok. Now I need to create a topic i kafka. `exec` so I can run arbitrary commands in the services; `kafka` because my topic lives in kafka, then `kafka-topics` to `create` a `topic` that is named `assessments`, with the usual options (`partitions 1`, `replication-factor 1`). Finally, I set the connection to Zookeeper with the appropriate port number. 

````
docker-compose exec kafka kafka-topics --create --topic assessments --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181
````


Next, I use kafkacat to produce messages to the `assessments` topic. 

Use pipes `|` - pipe takes what previous thing gave to it and passes it on. 

`mids` because we're in the mids container, `bash` to implement the shell in the container, `-c` so the commands are read from string right after `-c`. Inside the parenthesis, `cat` shows a file named `/w205/project-2-burityp/assessment-attempts-20180128-121051-nested.json`, `jq` to pretty-print it, `'.[]'` returns each element of the array returned in the response, one at a time, `-c` shows it inline and helps us to produce one message in Kafka per data record.

`kafkacat` allows us to produce, consume, and list topic and partition information for Kafka. In the current case, we're producing (thus `-P`). We should also supply kafkacat with a broker (`-b kafka:29092`) and a topic (`-t assessments`).  

```
docker-compose exec mids bash -c "cat /w205/project-2-burityp/assessment-attempts-20180128-121051-nested.json | jq '.[]' -c | kafkacat -P -b kafka:29092 -t assessments"
```


Next, spin up a `pyspark` process using the `spark` container.

````
docker-compose exec spark pyspark
````


We want to read in the assessments data. So at the pyspark prompt, we `read` from kafka a dataset named as `raw_assessments`, with the `format` `kafka`. If we want to read from Kafka, we need a `bootstrap.server` to connect to our `kafka` server with the appropriate port number. Then we `subscribe` to one topic (`assessments`) and set the `startingOffsets` and `endingOffsets` as `earliest` and `latest` respectively to have the whole dataset. And then `load` the data.  

````
raw_assessments = spark.read.format("kafka").option("kafka.bootstrap.servers", "kafka:29092").option("subscribe","assessments").option("startingOffsets", "earliest").option("endingOffsets", "latest").load()
````


`Cache` this to keep in memory and cut back on warnings later.

```
raw_players.cache()
```


Check the `Schema` of the data.

````
raw_assessments.printSchema()
````


In [None]:
import sys