# Project 3 Report
***Chandler Haukap and Hassan Saad***

In this project, we simulated several different events that could occur in Minesweeper. The `main.py` file contains the code that:
* Creates the playing board
* Simulates the location of mines specific to a game session
* Allows for actions such as clicking on a cell, flagging a cell, or checking the solution of the board
    * The flag and check actions allow for metadata to be created depending on if the specific cell contains an underlying mine or is a "safe spot"

## Explanation of Data Pipeline

* Before we were able to do anything, we had to modify an existing `docker-compose.yml` file and updated the "mids" image so that we can have the latest version of Redis running in our Minesweeper API. Here is the link to the image on DockerHub:\
    https://hub.docker.com/layers/180930452/hassansaadca/saad_project3/latest/images/sha256-e44f736674fb069e48aa8c2d1ecf072e5e5be26cfa2d1948b0284f59ddc5c6c2?context=repo \
    Now we're ready to run the API and store/access the data.
* First, the data gets created by the the Flask API, i.e. our `main.py` file, with the help of a python file that simulates gameplay (`event_generator.py`). With each event described above, data is generated in string format and fed into a Kafka queue. 
* Next, there are 3 ways in which we can log this data to a parquet file within the Hadoop environment. We explain these processes below, but for a summary:
    1. We generate the data, then open a Jupyter notebook within the pyspark environment, and we use Pyspark to create the parquet file.
    2. We generate the data, then run a Python file within the pyspark environment which takes a batch of data from the Kafka queue and writes it to a new parquet file.
    3. We run a python file that runs an infinite loop and has the ability to continuously stream data to the Hadoop environment ever few seconds. As we run the `generate_events.py` file, this data gets automatically fed to the same parquet file.
* Finally, we query the data within the Pyspark notebook (below)

## Part 1: Setup of the Data Pipeline

We do all of this within the working directory/ repository within our GCP VM.

`~w205/mids-205-project-3/`

**Spin up Docker Container:**

`docker-compose up -d`

**Create Kafka Topic, in this case called events:**

`docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --bootstrap-server kafka:29092`


In a new shell terminal, **set up to watch the incoming Kafka Queue:** Remember to navigate to the working directory first.

`docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning`

In [1]:
import json
from pyspark.sql import Row
from pyspark.sql.functions import udf

In [2]:
spark

In [1]:
raw_events = spark \
    .read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:29092") \
    .option("subscribe", "events") \
    .option("startingOffsets", "earliest") \
    .option("endingOffsets", "latest") \
    .load() 

NameError: name 'spark' is not defined

In [4]:
raw_events.show()

+----+--------------------+------+---------+------+--------------------+-------------+
| key|               value| topic|partition|offset|           timestamp|timestampType|
+----+--------------------+------+---------+------+--------------------+-------------+
|null|[7B 22 65 76 65 6...|events|        0|     0|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     1|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     2|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     3|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     4|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     5|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     6|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0|     7|2021-12-07 03:06:...|            0|
|null|[7B 22 65 76 65 6...|events|        0

In [5]:
raw_events.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [136]:
raw_events.count()

780

In [13]:
all_events = raw_events.select(raw_events.value.cast('string'))
all_events.show()

+--------------------+
|               value|
+--------------------+
|{"event_type": "a...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
|{"event_type": "c...|
+--------------------+
only showing top 20 rows



In [15]:
json.loads(all_events.collect()[6].value)

{'event_type': 'check',
 'neighboring_bombs': 35,
 'outcome': 'hit_mine',
 'session_id': 'ac796b7b-cf99-4aad-bb93-2c87a46e946a',
 'x_coord': 34,
 'y_coord': 98}

In [16]:
events_list= ['a_startup_event','check','flag','solution']

In [7]:
name = 'check'

@udf('boolean')
def test(event_as_json):
    event = json.loads(event_as_json)
    if event['event_type'] == name:
        return True
    return False

In [27]:
check_events = raw_events \
    .select(raw_events.value.cast('string').alias('stats'),\
            raw_events.timestamp.cast('string'))\
    .filter(test('stats'))

In [29]:
extracted_check_events = check_events \
    .rdd \
    .map(lambda r: Row(timestamp=r.timestamp, **json.loads(r.stats))) \
    .toDF()

In [30]:
extracted_check_events.show()

+----------+-----------------+--------+--------------------+--------------------+-------+-------+
|event_type|neighboring_bombs| outcome|          session_id|           timestamp|x_coord|y_coord|
+----------+-----------------+--------+--------------------+--------------------+-------+-------+
|     check|               38|hit_mine|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     29|     95|
|     check|               31|    safe|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     52|     37|
|     check|               45|hit_mine|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     35|     43|
|     check|               25|    safe|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     58|     57|
|     check|               12|    safe|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     76|     87|
|     check|               35|hit_mine|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     34|     98|
|     check|               14|    safe|ac796b7b-cf99-4aa...|2021-12-07 03:06:...|     56|     71|
|     check|        

In [31]:
extracted_check_events \
    .write \
    .mode('overwrite') \
    .parquet('/tmp/check_cell')

In [1]:
check_batch = spark.read.parquet('/tmp/check_cell')

In [2]:
check_batch.registerTempTable('check_event_table')

In [5]:
spark.sql("select * from check_event_table").toPandas()

Unnamed: 0,event_type,neighboring_bombs,outcome,session_id,timestamp,x_coord,y_coord
0,check,6,safe,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.741,87,77
1,check,42,safe,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.753,23,68
2,check,3,safe,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.764,91,43
3,check,14,hit_mine,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.777,75,4
4,check,6,safe,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.787,90,68
5,check,38,safe,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.799,36,61
6,check,39,hit_mine,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.812,9,27
7,check,39,safe,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.828,17,57
8,check,31,hit_mine,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.84,41,70
9,check,2,hit_mine,cd1bb2d0-539f-49f0-9df2-84ccfdf34afc,2021-12-07 21:10:37.851,89,90


In [2]:
check = spark.read.parquet('/tmp/check_stream_data')

In [3]:
check.show()

+---------+---------+---+-----+-----+---------+------+
|raw_event|timestamp|key|value|topic|partition|offset|
+---------+---------+---+-----+-----+---------+------+
+---------+---------+---+-----+-----+---------+------+

