# Week 11

#### copy in yml and game file:

`cp ~/w205/course-content/11-Storing-Data-III/docker-compose.yml .`

`cp ~/w205/course-content/11-Storing-Data-III/game_api.py .`

#### bring up cluster

`docker-compose up -d` 

`docker ps -a`

#### create spark jupyter connection

`docker-compose exec spark bash`

`ln -s /w205 w205`

`exit`

#### create topic
`docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181`

#### run flask and game events

`docker-compose exec mids env FLASK_APP=/w205/project-3-frankbruni/game_api.py flask run --host 0.0.0.0`

`docker-compose exec mids curl http://localhost:5000/`

`docker-compose exec mids curl http://localhost:5000/purchase_a_sword`

`docker-compose exec mids curl http://localhost:5000/buy_a_sword`

`docker-compose exec mids curl http://localhost:5000/join_guild`

#### open jupyter notebook

`docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark`

In [1]:
import json
from pyspark.sql import Row
from pyspark.sql.functions import udf

In [3]:
@udf('string')
def munge_event(event_as_json):
    event = json.loads(event_as_json)
    event['Host'] = "moe"
    event['Cache-Control'] = "no-cache"
    return json.dumps(event)

In [4]:
raw_events = spark \
        .read \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "kafka:29092") \
        .option("subscribe", "events") \
        .option("startingOffsets", "earliest") \
        .option("endingOffsets", "latest") \
        .load()

In [6]:
raw_events.show()

+----+--------------------+------+---------+------+--------------------+-------------+
| key|               value| topic|partition|offset|           timestamp|timestampType|
+----+--------------------+------+---------+------+--------------------+-------------+
|null|[7B 22 48 6F 73 7...|events|        0|     0|2020-12-06 00:29:...|            0|
|null|[7B 22 48 6F 73 7...|events|        0|     1|2020-12-06 00:29:...|            0|
|null|[7B 22 48 6F 73 7...|events|        0|     2|2020-12-06 00:29:...|            0|
|null|[7B 22 41 63 63 6...|events|        0|     3|2020-12-06 00:29:...|            0|
+----+--------------------+------+---------+------+--------------------+-------------+



In [5]:
munged_events = raw_events \
        .select(raw_events.value.cast('string').alias('raw'),
                raw_events.timestamp.cast('string')) \
        .withColumn('munged', munge_event('raw'))

In [9]:
munged_events.show()

+--------------------+--------------------+--------------------+
|                 raw|           timestamp|              munged|
+--------------------+--------------------+--------------------+
|{"Host": "localho...|2020-12-06 00:29:...|{"Host": "moe", "...|
|{"Host": "localho...|2020-12-06 00:29:...|{"Host": "moe", "...|
|{"Host": "localho...|2020-12-06 00:29:...|{"Host": "moe", "...|
|{"Accept": "*/*",...|2020-12-06 00:29:...|{"Accept": "*/*",...|
+--------------------+--------------------+--------------------+



In [10]:
extracted_events = munged_events \
        .rdd \
        .map(lambda r: Row(timestamp=r.timestamp, **json.loads(r.munged))) \
        .toDF()

In [12]:
sword_purchases = extracted_events \
        .filter(extracted_events.event_type == 'purchase_sword')


#### down
`docker-compose down`

`docker ps -a`

# Week 12

#### copy in yml and edit the week 11 game file to add in metadata (no need to copy in game_api again)

`cp ~/w205/course-content/12-Querying-Data-II/docker-compose.yml .`


#### bring up cluster

`docker-compose up -d` 

`docker ps -a`

#### create spark jupyter connection

`docker-compose exec spark bash`

`ln -s /w205 w205`

`exit`

#### create topic
`docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181`

#### run flask and game events this time using apache bench

`docker-compose exec mids env FLASK_APP=/w205/project-3-frankbruni/game_api.py flask run --host 0.0.0.0`

`docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/`

`docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_sword`

`docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/buy_a_sword`

`docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/join_guild`


`docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/`

`docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/purchase_a_sword`

`docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/buy_a_sword`

`docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/join_guild`

#### open jupyter notebook

`docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark`

#### code from filtered_writes.py

In [1]:
import json
from pyspark.sql import Row
from pyspark.sql.functions import udf

In [2]:
@udf('boolean')
def is_purchase(event_as_json):
    event = json.loads(event_as_json)
    if event['event_type'] == 'purchase_sword':
        return True
    return False

In [3]:
raw_events = spark \
        .read \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "kafka:29092") \
        .option("subscribe", "events") \
        .option("startingOffsets", "earliest") \
        .option("endingOffsets", "latest") \
        .load()

In [4]:
purchase_events = raw_events \
        .select(raw_events.value.cast('string').alias('raw'),
                raw_events.timestamp.cast('string')) \
        .filter(is_purchase('raw'))

In [5]:
extracted_purchase_events = purchase_events \
        .rdd \
        .map(lambda r: Row(timestamp=r.timestamp, **json.loads(r.raw))) \
        .toDF()

In [6]:
extracted_purchase_events.printSchema()

root
 |-- Accept: string (nullable = true)
 |-- Host: string (nullable = true)
 |-- User-Agent: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- sword_type: string (nullable = true)
 |-- timestamp: string (nullable = true)



In [7]:
extracted_purchase_events.show()

+------+-----------------+---------------+--------------+----------+--------------------+
|Accept|             Host|     User-Agent|    event_type|sword_type|           timestamp|
+------+-----------------+---------------+--------------+----------+--------------------+
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|us

In [8]:
extracted_purchase_events \
        .write \
        .mode('overwrite') \
        .parquet('/tmp/purchases')

#### pyspark code

In [9]:
purchases = spark.read.parquet('/tmp/purchases')

In [10]:
purchases.show()

+------+-----------------+---------------+--------------+----------+--------------------+
|Accept|             Host|     User-Agent|    event_type|sword_type|           timestamp|
+------+-----------------+---------------+--------------+----------+--------------------+
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|us

In [11]:
purchases.registerTempTable('purchases')

In [12]:
purchases_by_example2 = spark.sql("select * from purchases where Host = 'user1.comcast.com'")

In [13]:
purchases_by_example2.show()

+------+-----------------+---------------+--------------+----------+--------------------+
|Accept|             Host|     User-Agent|    event_type|sword_type|           timestamp|
+------+-----------------+---------------+--------------+----------+--------------------+
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|   knights|2020-12-06 01:12:...|
|   */*|us

In [15]:
df = purchases_by_example2.toPandas()

df

Unnamed: 0,Accept,Host,User-Agent,event_type,sword_type,timestamp
0,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.812
1,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.817
2,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.822
3,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.824
4,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.827
5,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.832
6,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.835
7,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.837
8,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.839
9,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.842


In [16]:
df.describe()

Unnamed: 0,Accept,Host,User-Agent,event_type,sword_type,timestamp
count,10,10,10,10,10,10
unique,1,1,1,1,1,10
top,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,knights,2020-12-06 01:12:03.839
freq,10,10,10,10,10,1


### Analytics

##### What percent of the sword types are knights?

In [29]:
query = spark.sql("select count(*) as knights from purchases where sword_type = 'knights'")
number_of_knights = query.toPandas()

query = spark.sql("select count(*) as total_swords from purchases")
total_swords = query.toPandas()

print('The percent of swords that are Knight type is {0} %'.format((number_of_knights['knights'][0] / total_swords['total_swords'][0])*100))

The percent of swords that are Knight type is 100.0 %


##### How many different hosts are there?

In [41]:
query = spark.sql("select count(distinct(Host)) as number_of_hosts from purchases")
number_of_hosts = query.toPandas()

print('There are {0} unique hosts'.format(number_of_hosts['number_of_hosts'][0]))

There are 2 unique hosts


##### Show the number of Knights for each Host

In [45]:
query = spark.sql("select * from purchases where sword_type = 'knights'")
knights = query.toPandas()

In [57]:
knights['Host'].value_counts()

user1.comcast.com    10
user2.att.com        10
Name: Host, dtype: int64

#### down
`docker-compose down`

`docker ps -a`