# Project 3: Understanding User Behavior - End to End Pipeline
### Carlos Moreno

#### Group Members: kris.junghee.lee and Lily S.

### W205 - Fall 2021


#### **This project is organized in the following sections:**

> #### I. Project Description  
> #### II. Data Pipeline Architecture  
> #### III. Steps for Implementing End-to-End Pipeline  
> #### IV. Business Questions  
> #### V. Appendix  

-------------------------------------------------------------------------------------------

## I. Project Description
As a data scientist at a game development company, your latest mobile game has several events the company is interested in tracking.  The events are as follows: 

> 1. `buy a sword` 
> 2. `buy a knife`
> 3. `buy a shield`
> 4. `join guild - The_Avengers, Game_of_Thrones, Justice_League`
> 5. `fight - users may engage in a fight with dragons, this event captures if the user won or lost the fight`

**Each event has the following metadata:**

(1) `buy a sword`;  
- userid: provided by the user generating the event.
- event_type: buy a sword
- name: excalibur
- strength: 1000 points
- number of purchases = provided by the user (integer)
- price = 2000 credits

(2) `buy a knife`;  
- userid: provided by the user generating the event.
- event_type: buy a knife
- name: kukri
- strength: 500 points
- number of purchases = provided by the user (integer)
- price = 1000 credits

(3) `buy a shield`;  
- userid: provided by the user generating the event.
- event_type: buy a shield
- name: parma.
- strength: 800 points,
- number of purchases = provided by the user (integer)
- price = 1500 credits

(4) `join guild`;  
- userid: provided by the user generating the event.
- event_type: Join Guild
- name: Three Guilds to Join in - The_Avengers, Game_of_Thrones, Justice_League
- strength: 800 points,
- number of purchases = provided by the user (integer)
- price = 1500 credits

(5) `fight event`;  
- userid: provided by the user generating the event.
- event_type: fight event
- score: -10 points if user lost, 100 point if user won
- win_status = win or lost.

## Purpose
The purpose of this Project is to create a data pipeline (end to end).  In this pipeline, a mobile device is running a game which is generating events.  These events are being consumed using Kafka (events associated to a topic). The events are being captured by a streaming application which is writing the events in a file.  As the events are being stored in files, we will be reading the files (tables), analyzing the data, and answering relevant business/game related questions. 

**Note:** This data is synthetically generated using Apache Bench included in the file "data_generator.sh".

### **Key Business Questions**

### Purpose: To understand the favorite item of customers

> ##### Q1. Which item is most popular and how much revenue generates?   
> ##### Q2. What items users purchased first? (Most interested item)  
> ##### Q3. In average, how many items a player purchases?(to understand the preference of item)

### Purpose: To understand heavy users

> ##### Q4. In average, how much money players spend in items?  
> ##### Q5. Which customers spent the most money in items (top 5)?  
> ##### Q6. who are the top 3 players spent the most money for item (sword, knife, shields) and guild?  
> ##### Q7. who are the top 3 players with the most power? (measured as sum of power of item and guild he/she is member of)  
> ##### Q8. who are the top 5 players fighting the most dragons?  

### Purpose: To understand the fight event status

> ##### Q9. What is the average score for all players?  
> ##### Q10. Who are the top 5 players with the highest fighting score?  
> ##### Q11. Who has the lowest score and what is is it?  
> ##### Q12. How many times a player fights with dragon in average?  
> ##### Q13. What is the winning rate for each player?  
> ##### Q14. Who are the top three playes with the most wins  

### Understanding Guild

> ##### Q15. What is the most popular guild (by number of members)?  


## Deliverables

> - docker-compose.yml file (**Apendix A**)

> - game_api.py file (**Apendix B**)

> - write_events_stream.py file (**Apendix C**)

> - game_event_generator.sh (**Apendix D**)

> - Jupyter Notebook (**this notebook**)

# II. Data Pipeline Architecture

The pipeline structure being implemented includes the following four steps:

> **Step 1** Use Apache Bench to generate game events (purchase, guild, fight events). Generated events will hit points in a Flask Application (game_api.py).

> **Step 2** The game event hitting the game_api are consumed using Kafka - as events hit the "game_api.py", they are logged to a kafka topic (event).

> **Step 3** Spark Streaming filters select event types from Kafka and land them on HDFS (write_events_stream.py).

> **Step 4** Querying HDFS tables using Presto query engine as well as PySpark to answer business questions.

The following sections present the specific steps and commands used to implement the pipeline.


# III. Steps for Implementing End-to-End Pipeline

## Step 1: Use Apache Bench to generate events (purchase, guild & fight events). 

Generated events will hit points in a Flas Application (gamie_api.py).

> **(1). Activate Docker Compose Cluster:**
```bash
> docker-compose up -d
```

> **(2). Run Game Application:**
```bash
> docker-compose exec mids env FLASK_APP=/w205/project-3-cmorenoUCB2021/game_api.py flask run --host 0.0.0.0
```

> **(3). Set up to Watch Kafka:** 
to observe how messages are being captured. Open a new terminal, and run the following (run it twice, first time to create the topic):
```bash
> docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning
```

> **(4). Use Apache Bench to generate data:**  

> To generate events, we created a "game event generator" **.sh** script which we called "**game_event_generator.sh**".  The details of the script are found in the Appendix section of this document.  This script has three required inputs and one optional. The descriptions of the required arguments are as follows:
- **u** = Number of Users (integer >1).  Use this argument to define the number of users that the game has.
- **e** = Number of endpoints (always set to 5) - this argument indicates the number of endpoints used in the game.  For this game we have five end points which are: purchase_a_sword, purchase_a_knife, purchase_a_shield, join_guild, and fight_event.
- **n** = Number of total requests (integer > 1). Use this argument to indicate how many events should be generated.

> The script includes one optional argument (**b**).  When **b** is included (**-b**), the script will make calls to the game_api using **"Apache Bench"**.  When **b** is not included, the script will make calls to the game_api suing **"CURL"**.

> The following lines present two examples of calls to the "data_generator.sh" script:

>> **(a) Example of a call including option for Apache Bench:**
```bash
> bash game_event_generator.sh -u 20 -e 5 -n 100 -b
```

>> **(b) Example of call without option for Apache Bench:**
```bash
> bash game_event_generator.sh -u 20 -e 5 -n 100
```

## Step 2: Consume events using Kafka

The game event hitting the game_api are consumed using Kafka - as events hit the "game_api.py", they are logged to a kafka topic (event).

The following sessions present the structure of the game_api.py application which reads the information from the game (data_generator), and logs the information into a kafka topic. The structure of the code is as follows:

**(a)** Import the required libraries, and set-up the variables.
```python
#!/usr/bin/env python
import json
from kafka import KafkaProducer
from flask import Flask, request

app = Flask(__name__)
producer = KafkaProducer(bootstrap_servers='kafka:29092')

```

**(b)**  Define routine that will log events to kafka (to topic called event)
```python
def log_to_kafka(topic, event):
    event.update(request.headers)
    producer.send(topic, json.dumps(event).encode())
    
```

**(c)** Define routines (functions) that dictate behavior depending on the api route being hit.  The api points are the following:
- default_response
- purchase_a_sword
- purchase_a_knife
- purchase_a_shield
- join_a_guild
- fight_event

The following code presents example for the **fight_event** api call: 

```python    
@app.route("/fight_event/", methods=['POST','GET'])
def fight_event():
    """
    @function: This function generate a Join Guild event from a user mobile device request or Apache Bench
    @param: User Request (via URL endpoint) 
    @return: Returns string of User Id and Event 
    """
    userid = request.args.get('userid', default='001', type=str)
    win_status = request.args.get('win_status', default="'won'", type=str)
    n = request.args.get('n',default=1,type=int)

    if win_status == "'lost'":
        score = -10
    else: 
        score = 100

    win_status = win_status.replace("'", '')
        
    fight_event = {'userid': userid,
                   'event_type': 'fight_event',
                   'win_status': win_status,
                   'score': score}
    log_to_kafka('events', fight_event)
    return "fight event - you:" +" "+ win_status +" "+ "Score: " + score+ "\n"
```

## Step 3: Spark Streaming/Save filtered data

Spark Streaming filters select event types from Kafka and land them on HDFS (write_events_stream.py)

#### **(5). Run Application to Read Messages from Kafka and write them to hdfs:** 
using separate terminals, run applications to read Messages from Kafka associated with the key events from the game as follows:  

```
> docker-compose exec spark spark-submit /w205/project-3-cmorenoUCB2021/write_events_stream.py
```

The streaming application (**write_events_stream.py**) has the following structure:

> **(a) Define the schema for the events:** below please see example for purchase_events_schema.  
```python
def purchase_events_schema():
    """
    @function: This function provides the table schema for purchase events (knife, sword, shield)
    @param: None 
    @return: Returns the table schema for purchase events
    """  
    return StructType(
    [
        StructField('Accept', StringType(), True),
        StructField('Host', StringType(), True),
        StructField('User-Agent', StringType(), True),
        StructField('price', StringType(), True),
        StructField('n_purchased', LongType(), True),
        StructField('strength', StringType(), True),
        StructField('name', StringType(), True),
        StructField('event_type', StringType(), True),
        StructField('userid', StringType(), True)
    ]
)
```
  
> **(b) Create a SparkSession:**   
```python
    spark = SparkSession \
        .builder \
        .appName("ExtractEventsJob") \
        .getOrCreate()
```  
  
> **(c) Read the Raw Events (from Kafka topic):**   
```python
    raw_events = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "kafka:29092") \
        .option("subscribe", "events") \
        .load()
```
> **(d) Filter Events for a specific Event:** the following code, filters the events related to purchases of items. 
```python
    purchases = raw_events \
        .filter(is_purchase(raw_events.value.cast('string'))) \
        .select(raw_events.value.cast('string').alias('raw_event'),
                raw_events.timestamp.cast('string'),
                from_json(raw_events.value.cast('string'),
                          purchase_events_schema()).alias('json')) \
        .select('timestamp', 'json.*') \
        .select( \
                  F.col('timestamp').alias('event_ts') \
                 ,F.col('userid') \
                 ,F.col('Host') \
                 ,F.col('event_type') \
                 ,F.col('name') \
                 ,F.col('strength') \
                 ,F.col('n_purchased') \
                 ,F.col('price') \
                ) \
        .distinct()    
# FUNCTION USED TO FILTER EVENTS RELATED TO PURCHASE OF OBJECTS (SWORD, KNIFE, SHIELD)
@udf('boolean')
def is_purchase(event_as_json):
    """
    @function: This function uses a json to filter out records by purchase event type (knife, sword, shield)
    @param: Takes in extracted json data as a string
    @return: Returns a boolean value
    """    
    event = json.loads(event_as_json)
    if 'purchase' in event['event_type']:
        return True
    return False
```

> **(e) Write filtered events to file:** the following code saves the filtered events (purchase_events) into a file **"purchase_events"**. 
```python
    sink_purchases = purchases \
        .writeStream \
        .format("parquet") \
        .option("checkpointLocation", "/tmp/checkpoints_for_purchase_events") \
        .option("path", "/tmp/purchase_events") \
        .trigger(processingTime="10 seconds") \
        .start()
```

#### **(6). Check what was written in Hadoop:** 

The following command would check what was written for sword_purchases. 
Note: check files for each event (sanity check).
```bash
> docker-compose exec cloudera hadoop fs -ls /tmp/purchase_events  
> docker-compose exec cloudera hadoop fs -ls /tmp/guild_events  
> docker-compose exec cloudera hadoop fs -ls /tmp/fight_events  
```


## Step 4 Querying HDFS tables using Presto query engine

#### **(7). Create tables scheme within hive:**

> **(a).** Run hive in hadoop container:*
```bash
> docker-compose exec cloudera hive
```

> **(b).** Create tables (schemes) within hive.
> **purchase_events**
```bash
create external table if not exists default.purchase_events (
    event_ts string,
    userid string,
    Host string,
    event_type string,
    name string,
    strength int,
    n_purchased int,
    price int
  )
  stored as parquet 
  location '/tmp/purchase_events'
  tblproperties ("parquet.compress"="SNAPPY");
```
```bash
create external table if not exists default.purchase_events (event_ts string, userid string, Host string, event_type string, name string, strength int, n_purchased int, price int) stored as parquet location '/tmp/purchase_events' tblproperties ("parquet.compress"="SNAPPY");
```

> **guild_events**  
```bash
create external table if not exists default.guild_events (
    event_ts string,
    userid string,
    Host string,
    event_type string,
    name string,
    strength int,
    n_purchased int,
    price int
  )
  stored as parquet 
  location '/tmp/guild_events'
  tblproperties ("parquet.compress"="SNAPPY");
```
```bash
create external table if not exists default.guild_events (event_ts string, userid string, Host string, event_type string, name string, strength int, n_purchased int, price int) stored as parquet location '/tmp/guild_events' tblproperties ("parquet.compress"="SNAPPY");
```

> **fight_events**  
```bash
create external table if not exists default.fight_events (
    event_ts string,
    userid string,
    Host string,
    event_type string,
    score int,
    win_status string
  )
  stored as parquet 
  location '/tmp/fight_events'
  tblproperties ("parquet.compress"="SNAPPY");
```
```bash
create external table if not exists default.fight_events (event_ts string, userid string, Host string, event_type string, score int, win_status string) stored as parquet location '/tmp/fight_events' tblproperties ("parquet.compress"="SNAPPY");
```

**Note:** `ctrl-D` to exit the hive shell.

#### **(8). Query Tables with Presto:**  

&nbsp;&nbsp;&nbsp;**(a). Run Presto:**

```bash
> docker-compose exec presto presto --server presto:8080 --catalog hive --schema default
```

&nbsp;&nbsp;&nbsp;**(b). Examples of Queries with Presto:**

&nbsp;&nbsp;&nbsp;&nbsp;-What tables are there in Presto?

```bash
> presto:default> show tables;
```

&nbsp;&nbsp;&nbsp;&nbsp;-Describe one of the tables (sword_purchases):
```bash
> presto:default> describe purchase_events;
> presto:default> describe guild_events;
> presto:default> describe fight_events;
```

&nbsp;&nbsp;&nbsp;&nbsp;-Query `purchases` table:  
```bash
> presto:default> select * from purchase_events;
> presto:default> select * from guild_events;
```
&nbsp;&nbsp;&nbsp;&nbsp;-Count the number of events in `purchases` table:  
```bash
> presto:default> select count(*) from purchase_events;
> presto:default> select count(*) from guild_events;
```

#### **(9). HOW TO RUN QUERIES USING PYSPARK?**
- ** (a) Run spark to make sure it is available when running Jupyter Notebook:**

```bash
> docker-compose exec spark ln -s /w205 w205
```

- **(b) Run Jupyter Notebook in Google Cloud to read kafka topic and explore data:**

```bash
> docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 7000 --ip 0.0.0.0 --allow-root' pyspark
```

- **(c) Get the token and include the address for your notebook instance in Google Cloud:** For example:

```bash
> http://34.139.108.62:7000/?token=7f33a1a57a80d7a1fe52c607a1670c9144baa0e1f97d953b

http://34.139.108.62:7000/?token=2fb0056fad1f43767f5fc5816f1a53fb963c0cfad9b65985
```
Replace 0.0.0.0 with the address associated to your Google Cloud Instance.

34.139.108.62

# IV. Business Questions:

### Business Recommendation:

Though the information generated by the game_event_generator (game) script is synthetic, we would like to provide examples of the type of anlytics that may be relevant for our business partners. As a game company we are interested in improving the "Customer Experience" which would lead to satisfied/engaged customer who are willing to invest in "in-game" items/purchases and activities.

#### How to improve revenue for game company? 

In our effort to improve the "Player Experience", we want to make sure that players are satisfied with items being offered.  We would like to offer more of what users want, and fix items that have less demand. Based on current analytics, we determined that 'shield' is the most popular item for customers based on the number of purchases, however 'swords' is the item that generates the most revenue as it is the most expensive item.  Players are willing to invest in 'swords' as it helps them win fights which improves their score. We should consider offering additional type of swords as players are willing to invest in them. This is confirmed by the fact that, for most players, the first item purchased is the 'sword'. In average, a player purchased 179 items so far and spent 45900 on items. Therefore, If we increase the line up of sword and other items with additional features, we can expect more customer's purchases.
    

####  Customer Engagement -  How to motivate engagement? 

We would like to identify customers who are engaged in fighting and earning points.  We seek to reward users who win the most battles with dragons, and the ones that have the best record (wins-losts) as reflected in the total score. Customers clearing the top scores will be offered special perks such as promoting them within the ranks of players (i.e. Elite Warrior) as well as unlocking special fighting items such as new swords, shields and knifes.  From this perspective we have identified the following customers who are eligible to this perks:

> - Users with the best scores: user-006 (580), user-0010 (460), and user-004 (390) -> Move to next level, unlock special sword.

> - User who spent the most money: user-003 (20500), user-001 (19500), user-009 (60500) -> Offer coupons

> - Users with the most power (based on items that they own): user-0012 (74300), user-006 (61000), user-0013 (53800) -> Give special Batch recognizing their power.

As fighting dragons is a popular activity within the game which earn points and status for the players, we are interested in ensuring players feel they have a fair chance to win fights.  We want to keep the chance of winning at 50% or more. Currently around 50% of players have a winning record (>=50% chance of winnin), thus no adjustment is needed.

Guild is another activity that helps increase engagment and provides a sense of community to the players. The most popular guild is "Game_of_Thrones" - however, membership to the guilds is comparable. Players are actively seeking to join guilds, and it seems that the current offering is enough for the number of players in the game.

The following session presents the Presto code used to answer questions which support analysis presented.

### A. Answering Business Questions Using Presto:

### Purpose: To understand the favorite item of customers
​
#### Q1. Which item is most popular and how much revenue generates? 
(to understand the preference of item and item purchasing behavior ) 

**Presto Query:**  
```sql
select event_type, sum(n_purchased) as number_purchase, sum(price) as revenue 
from purchase_events 
group by event_type 
order by revenue desc; 
```
```
   event_type    | number_purchase | revenue 
-----------------+-----------------+---------
 purchase_sword  |            1113 |  408000 
 purchase_shield |            1119 |  294000 
 purchase_knife  |            1343 |  216000 
(3 rows)
```

##### Q2. What items users purchased first? (Most interested item)

```sql
> select A.event_type,count(*) as first_purchased_item from
(select event_ts,userid ,event_type,
RANK() OVER(PARTITION BY userid ORDER BY event_ts DESC) Rank
from purchase_events) A where Rank=1 group by A.event_type order by first_purchased_item DESC;
```
```
   event_type    | first_purchased_item 
-----------------+----------------------
 purchase_sword  |                   11 
 purchase_shield |                    5 
 purchase_knife  |                    4 
(3 rows)
```

#### Q3. In average, how many items a player purchases?(to understand the preference of item)

**Presto Query:**
```sql
> Select round(AVG(A.number_of_items)) as average_item from(
Select userid,sum(n_purchased) as number_of_items
from purchase_events
group by userid ) A;
```
```
 average_item 
--------------
        179.0 
(1 row)
```

### Purpose: To understand heavy users

#### Q4. In average, how much money players spend in items?

```sql
select round(AVG(A.price_of_items)) as average_item 
from(
select userid,sum(price) as price_of_items 
from purchase_events 
group by userid 
) A;
```
```
 average_item 
--------------
      45900.0 
(1 row)
```

#### Q5. Which customers spent the most money in items (top 5)?

```sql
> select userid,sum(price) as price_of_items 
from purchase_events 
group by userid limit 5;
```
```
  userid   | price_of_items 
-----------+----------------
 user-003  |          20500 
 user-001  |          19500 
 user-009  |          60500 
 user-0019 |          49500 
 user-002  |          70000 
(5 rows)
```

#### Q6. who are the top 3 players spent the most money for item (sword, knife, shields) and guild?

```sql
> select userid, sum(price) as tot_spend
from
(
    select userid, price
    from purchase_events
    union all
    select userid, price
    from guild_events
) t
group by userid
order by tot_spend
desc
limit 3;
```
```
  userid   | tot_spend 
-----------+-----------
 user-0012 |    108000 
 user-006  |     86500 
 user-004  |     82000 
(3 rows)
```

#### Q7. who are the top 3 players with the most power? (measured as sum of power of item and guild he/she is member of)

```sql
select userid, sum(strength) as power
from
(
    select userid, strength
    from purchase_events
    union all
    select userid, strength
    from guild_events
) t
group by userid
order by power
desc
limit 3;
```

```
  userid   | power 
-----------+-------
 user-0012 | 74300 
 user-006  | 61000 
 user-0013 | 53800 
(3 rows)
```

#### Q8. who are the top 5 players fighting the most dragons?

```sql
select userid,count(*) as num_of_fights 
from fight_events 
group by userid 
order by num_of_fights DESC limit 5;
```
```
  userid   | num_of_fights 
-----------+---------------
 user-0011 |            10 
 user-001  |             9 
 user-0010 |             9 
 user-006  |             8 
 user-0018 |             7 
(5 rows)
```

### Purpose: To understand the fight event status

#### Q9. What is the average score for all players?

```sql
> select AVG(sum_of_score) as avg_score 
from (select userid,sum(score) as sum_of_score 
from fight_events 
group by userid);
```
> Output:

```
     avg_score      
--------------------
 177.22222222222223 
(1 row)
```

#### Q10. Who are the top 5 players with the highest fighting score?

```sql
select userid, sum(score) as sum_of_score 
from fight_events 
group by userid
order by sum_of_score DESC limit 5;
```
```
  userid   | sum_of_score 
-----------+--------------
 user-006  |          580 
 user-0010 |          460 
 user-004  |          390 
 user-0011 |          230 
 user-007  |          190 
(5 rows)
```


#### Q11. Who has the lowest score and what is is it?

```sql
select userid, sum(score) as sum_of_score 
from fight_events 
group by userid
order by sum_of_score ASC limit 1;
```
```
  userid   | sum_of_score 
-----------+--------------
 user-0019 |          -20 
(1 row)
```

#### Q12. How many times a player fights with dragon in average?

```sql
select AVG(num_of_fights) as avg_fight 
from (
select userid,count(*) as num_of_fights 
from fight_events 
group by userid 
order by num_of_fights);
```
```
     avg_fight     
-------------------
 4.888888888888889 
(1 row)
```

#### Q13. What is the winning rate for each player?

```sql
select userid, fights, num_of_win, (num_of_win/fights)*100 as winning_rate 
from 
(select A.userid, A.fights, B.num_of_win from
(select userid,count(*) as fights from fight_events group by userid) A
INNER JOIN
(select userid,count(*) as num_of_win from fight_events where win_status = 'won' group by userid) B
on A.userid=B.userid
) order by winning_rate desc;
```
```
  userid   | fights | num_of_win | winning_rate 
-----------+--------+------------+--------------
|user-009 |1     |1         |100.0             |
|user-004 |5     |4         |80.0              |
|user-006 |8     |6         |75.0              |
|user-007 |3     |2         |66.66666666666666 |
|user-0010|9     |5         |55.55555555555556 |
|user-0016|2     |1         |50.0              |
|user-005 |4     |2         |50.0              |
|user-0015|4     |2         |50.0              |
|user-002 |2     |1         |50.0              |
|user-0020|5     |2         |40.0              |
|user-0013|3     |1         |33.33333333333333 |
|user-0012|3     |1         |33.33333333333333 |
|user-003 |6     |2         |33.33333333333333 |
|user-0011|10    |3         |30.0              |
|user-001 |9     |2         |22.22222222222222 |
|user-008 |5     |1         |20.0              |
|user-0018|7     |1         |14.285714285714285|
(17 rows)
```

#### Q14. Who are the top three playes with the most wins

```sql
select userid, win_status, count(*) as wins 
from fight_events 
where win_status='won' 
group by userid, win_status 
order by wins desc limit 3;
```
```
  userid   | win_status | wins 
-----------+------------+------
 user-006  | won        |    6 
 user-0010 | won        |    5 
 user-004  | won        |    4 
(3 rows)
```

### Understanding Guilds

#### Q15. What is the most popular guild (by number of members)?


```sql
select name, count(*) as members 
from guild_events 
group by name 
order by members desc;
```
```
      name       | members 
-----------------+---------
 Game_of_Thrones |      36 
 The_Avengers    |      33 
 Justice_League  |      24 
(3 rows)
```

### B. Answering Business Questions Using PySpark:

**Import libraries**

In [1]:
import pandas as pd
import sqlalchemy
#from sqlalchemy import create_engine
from sqlalchemy.engine import create_engine

**Check tables**

In [2]:
#Purchase Events Table
purchase_events_df = spark.read.parquet("/tmp/purchase_events")
purchase_events_df.show(5,truncate=False)

+-----------------------+---------+---------------------+---------------+---------+--------+-----------+-----+
|event_ts               |userid   |Host                 |event_type     |name     |strength|n_purchased|price|
+-----------------------+---------+---------------------+---------------+---------+--------+-----------+-----+
|2021-12-11 14:55:57.335|user-001 |user-001.comcast.com |purchase_knife |kukri    |500     |2          |1000 |
|2021-12-11 14:56:05.702|user-005 |user-005.comcast.com |purchase_shield|parma    |800     |10         |1500 |
|2021-12-11 14:56:08.18 |user-006 |user-006.comcast.com |purchase_sword |excalibur|1000    |2          |2000 |
|2021-12-11 14:56:12.111|user-004 |user-004.comcast.com |purchase_sword |excalibur|1000    |3          |2000 |
|2021-12-11 14:56:14.632|user-0020|user-0020.comcast.com|purchase_shield|parma    |800     |5          |1500 |
+-----------------------+---------+---------------------+---------------+---------+--------+-----------+-----+
o

In [3]:
#Guild Events Table
guid_events_df = spark.read.parquet("/tmp/guild_events")
guid_events_df.show(5,truncate=False)

+-----------------------+---------+---------------------+----------+---------------+--------+-----------+-----+
|event_ts               |userid   |Host                 |event_type|name           |strength|n_purchased|price|
+-----------------------+---------+---------------------+----------+---------------+--------+-----------+-----+
|2021-12-11 14:56:28.843|user-007 |user-007.comcast.com |join_guild|Game_of_Thrones|1500    |1          |2000 |
|2021-12-11 14:56:36.03 |user-0011|user-0011.comcast.com|join_guild|Game_of_Thrones|1500    |1          |2000 |
|2021-12-11 14:56:29.819|user-0013|user-0013.comcast.com|join_guild|Game_of_Thrones|1500    |1          |2000 |
|2021-12-11 14:56:52.658|user-008 |user-008.comcast.com |join_guild|Game_of_Thrones|1500    |1          |2000 |
|2021-12-11 14:57:04.578|user-009 |user-009.comcast.com |join_guild|Game_of_Thrones|1500    |1          |2000 |
+-----------------------+---------+---------------------+----------+---------------+--------+-----------

In [4]:
#Fight Events Table
fight_events_df = spark.read.parquet("/tmp/fight_events")
fight_events_df.show(5,truncate=False)

+-----------------------+---------+---------------------+-----------+-----+----------+
|event_ts               |userid   |Host                 |event_type |score|win_status|
+-----------------------+---------+---------------------+-----------+-----+----------+
|2021-12-11 14:56:00.942|user-0010|user-0010.comcast.com|fight_event|100  |won       |
|2021-12-11 14:56:07.233|user-0010|user-0010.comcast.com|fight_event|100  |won       |
|2021-12-11 14:55:26.492|user-0011|user-0011.comcast.com|fight_event|-10  |lost      |
|2021-12-11 14:55:35.172|user-0013|user-0013.comcast.com|fight_event|-10  |lost      |
|2021-12-11 14:55:41.027|user-0018|user-0018.comcast.com|fight_event|-10  |lost      |
+-----------------------+---------+---------------------+-----------+-----+----------+
only showing top 5 rows



**Create View for Tables**

In [6]:
#Create Views to Access Dataframes with SQL queries
purchase_events_df.registerTempTable('purchase_events')
guid_events_df.registerTempTable('guild_events')
fight_events_df.registerTempTable('fight_events')

### Purpose: To understand the favorite item of customers

#### Q1. Which item is most popular and how much revenue generates? 
(to understand the preference of item and item purchasing behavior ) 

In [59]:
query = """
select event_type, sum(n_purchased) as number_purchase, sum(price) as revenue 
from purchase_events 
group by event_type 
order by revenue desc        
"""
spark.sql(query).show(truncate=False)

+---------------+---------------+-------+
|event_type     |number_purchase|revenue|
+---------------+---------------+-------+
|purchase_sword |1113           |408000 |
|purchase_shield|1119           |294000 |
|purchase_knife |1343           |216000 |
+---------------+---------------+-------+



##### Q2. What items users purchased first? (Most interested item)

In [8]:
query = """
select A.event_type,count(*) as first_purchased_item from
(select event_ts,userid ,event_type,
RANK() OVER(PARTITION BY userid ORDER BY event_ts DESC) Rank
from purchase_events) A where Rank=1 group by A.event_type order by first_purchased_item DESC
"""
spark.sql(query).show(truncate=False)

+---------------+--------------------+
|event_type     |first_purchased_item|
+---------------+--------------------+
|purchase_sword |11                  |
|purchase_shield|5                   |
|purchase_knife |4                   |
+---------------+--------------------+



#### Q3. In average, how many items a player purchases?(to understand the preference of item)

In [10]:
query = """
select round(AVG(A.number_of_items)) as average_item from(
select userid,sum(n_purchased) as number_of_items
from purchase_events
group by userid ) A
"""
spark.sql(query).show(truncate=False)

+------------+
|average_item|
+------------+
|179.0       |
+------------+



### Purpose: To understand heavy users

#### Q4. In average, how much money players spend in items?

In [11]:
query = """
select round(AVG(A.price_of_items)) as average_item 
from(
select userid,sum(price) as price_of_items 
from purchase_events 
group by userid 
) A
"""
spark.sql(query).show(truncate=False)

+------------+
|average_item|
+------------+
|45900.0     |
+------------+



#### Q5. Which customers spent the most money in items (top 5)?

In [13]:
query = """
select userid,sum(price) as price_of_items 
from purchase_events 
group by userid limit 5
"""
spark.sql(query).show(truncate=False)

+---------+--------------+
|userid   |price_of_items|
+---------+--------------+
|user-0013|54000         |
|user-005 |32000         |
|user-004 |74000         |
|user-006 |71500         |
|user-008 |26000         |
+---------+--------------+



#### Q6. who are the top 3 players spent the most money for item (sword, knife, shields) and guild?

In [69]:
query = """select userid, sum(price) as tot_spend
from
(
    select userid, price
    from purchase_events
    union all
    select userid, price
    from guild_events
) t
group by userid
order by tot_spend
desc
limit 3"""
spark.sql(query).show(truncate=False)

+---------+---------+
|userid   |tot_spend|
+---------+---------+
|user-0012|108000   |
|user-006 |86500    |
|user-004 |82000    |
+---------+---------+



#### Q7. who are the top 3 players with the most power? (measured as sum of power of item and guild he/she is member of)

In [17]:
query = """select userid, sum(strength) as power
from
(
    select userid, strength
    from purchase_events
    union all
    select userid, strength
    from guild_events
) t
group by userid
order by power
desc
limit 3"""
spark.sql(query).show(truncate=False)

+---------+-----+
|userid   |power|
+---------+-----+
|user-0012|74300|
|user-006 |61000|
|user-0013|53800|
+---------+-----+



#### Q8. who are the top 5 players fighting the most dragons?

In [31]:
query = """
select userid,count(*) as num_of_fights 
from fight_events 
group by userid 
order by num_of_fights DESC limit 5"""
spark.sql(query).show(truncate=False)

+---------+-------------+
|userid   |num_of_fights|
+---------+-------------+
|user-0011|10           |
|user-001 |9            |
|user-0010|9            |
|user-006 |8            |
|user-0018|7            |
+---------+-------------+



### Purpose: To understand the fight event status

#### Q9. What is the average score for all players?

In [28]:
query = """
select AVG(sum_of_score) as avg_score 
from (select userid,sum(score) as sum_of_score 
from fight_events 
group by userid)"""
spark.sql(query).show(truncate=False)

+------------------+
|avg_score         |
+------------------+
|177.22222222222223|
+------------------+



#### Q10. Who are the top 5 players with the highest fighting score?

In [53]:
query = """
select userid, sum(score) as sum_of_score 
from fight_events 
group by userid
order by sum_of_score DESC limit 5"""
spark.sql(query).show(truncate=False)

+---------+------------+
|userid   |sum_of_score|
+---------+------------+
|user-006 |580         |
|user-0010|460         |
|user-004 |390         |
|user-0011|230         |
|user-007 |190         |
+---------+------------+



#### Q11. Who has the lowest score and what is is it?

In [43]:
query = """
select userid, sum(score) as sum_of_score 
from fight_events 
group by userid
order by sum_of_score ASC limit 1"""
spark.sql(query).show(truncate=False)

+---------+------------+
|userid   |sum_of_score|
+---------+------------+
|user-0019|-20         |
+---------+------------+



#### Q12. How many times a player fights with dragon in average?

In [44]:
query = """
select AVG(num_of_fights) as avg_fight 
from (
select userid,count(*) as num_of_fights 
from fight_events 
group by userid 
order by num_of_fights)"""
spark.sql(query).show(truncate=False)

+-----------------+
|avg_fight        |
+-----------------+
|4.888888888888889|
+-----------------+



#### Q13. What is the winning rate for each player?

In [73]:
query = """
select userid, fights, num_of_win, (num_of_win/fights)*100 as winning_rate 
from 
(select A.userid, A.fights, B.num_of_win from
(select userid,count(*) as fights from fight_events group by userid) A
INNER JOIN
(select userid,count(*) as num_of_win from fight_events where win_status = 'won' group by userid) B
on A.userid=B.userid
) order by winning_rate desc"""
spark.sql(query).show(truncate=False)

+---------+------+----------+------------------+
|userid   |fights|num_of_win|winning_rate      |
+---------+------+----------+------------------+
|user-009 |1     |1         |100.0             |
|user-004 |5     |4         |80.0              |
|user-006 |8     |6         |75.0              |
|user-007 |3     |2         |66.66666666666666 |
|user-0010|9     |5         |55.55555555555556 |
|user-0016|2     |1         |50.0              |
|user-005 |4     |2         |50.0              |
|user-0015|4     |2         |50.0              |
|user-002 |2     |1         |50.0              |
|user-0020|5     |2         |40.0              |
|user-0013|3     |1         |33.33333333333333 |
|user-0012|3     |1         |33.33333333333333 |
|user-003 |6     |2         |33.33333333333333 |
|user-0011|10    |3         |30.0              |
|user-001 |9     |2         |22.22222222222222 |
|user-008 |5     |1         |20.0              |
|user-0018|7     |1         |14.285714285714285|
+---------+------+--

#### Q14. Who are the top three playes with the most wins

In [49]:
query1 = """
select userid, win_status, count(*) as wins 
from fight_events 
where win_status='won' 
group by userid, win_status 
order by wins desc limit 3
"""
spark.sql(query1).show(truncate=False)

+---------+----------+----+
|userid   |win_status|wins|
+---------+----------+----+
|user-006 |won       |6   |
|user-0010|won       |5   |
|user-004 |won       |4   |
+---------+----------+----+



### Understanding Guilds

#### Q15. What is the most popular guild (by number of members)?

In [60]:
query = """
select name, count(*) as members 
from guild_events 
group by name 
order by members desc
"""
spark.sql(query).show(truncate=False)

+---------------+-------+
|name           |members|
+---------------+-------+
|Game_of_Thrones|36     |
|The_Avengers   |33     |
|Justice_League |24     |
+---------------+-------+



## APPENDIX

### Appendix A: Docker Compose Content

```bash
version: '2'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 32181
      ZOOKEEPER_TICK_TIME: 2000
    expose:
      - "2181"
      - "2888"
      - "32181"
      - "3888"
    extra_hosts:
      - "moby:127.0.0.1"

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:32181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    expose:
      - "9092"
      - "29092"
    extra_hosts:
      - "moby:127.0.0.1"

  cloudera:
    image: midsw205/hadoop:0.0.2
    hostname: cloudera
    expose:
      - "8020" # nn
      - "8888" # hue
      - "9083" # hive thrift
      - "10000" # hive jdbc
      - "50070" # nn http
    ports:
      - "8888:8888"
    extra_hosts:
      - "moby:127.0.0.1"

  spark:
    image: midsw205/spark-python:0.0.6
    stdin_open: true
    tty: true
    volumes:
      - ~/w205:/w205
    expose:
      - "8888"
    ports:
      - "8889:8888" # 8888 conflicts with hue
    depends_on:
      - cloudera
    environment:
      HADOOP_NAMENODE: cloudera
      HIVE_THRIFTSERVER: cloudera:9083
    extra_hosts:
      - "moby:127.0.0.1"
    command: bash

  presto:
    image: midsw205/presto:0.0.1
    hostname: presto
    volumes:
      - ~/w205:/w205
    expose:
      - "8080"
    environment:
      HIVE_THRIFTSERVER: cloudera:9083
    extra_hosts:
      - "moby:127.0.0.1"

  mids:
    image: midsw205/base:0.1.9
    stdin_open: true
    tty: true
    volumes:
      - ~/w205:/w205
    expose:
      - "5000"
    ports:
      - "5000:5000"
    extra_hosts:
      - "moby:127.0.0.1"
```

### Appendix B: Game Application - game_api.py

```python
#!/usr/bin/env python
import json
from kafka import KafkaProducer
from flask import Flask, request

app = Flask(__name__)
producer = KafkaProducer(bootstrap_servers='kafka:29092')

def log_to_kafka(topic, event):
    event.update(request.headers)
    producer.send(topic, json.dumps(event).encode())
    

@app.route("/")
def default_response():
    default_event = {'event_type': 'default',
                     'name': 'doing_nothing',
                     'strength':'NA',
                     'price': 'NA'}
    log_to_kafka('events', default_event)
    return "What are you waiting for?\n"


@app.route("/purchase_a_sword/", methods=['POST','GET'])
def purchase_a_sword():
    """
    @function: This function generate a Purchase a Sword event from a user mobile device request or Apache Bench
    @param: User Request (via URL endpoint) 
    @return: Returns string of User Id and Event 
    """

    userid = request.args.get('userid', default='001', type=str)
    n = request.args.get('n',default=1,type=int)
    userid = userid.replace("'", '')

    purchase_sword_event = {'userid':userid,
                            'event_type': 'purchase_sword',
                            'name': 'excalibur',
                            'strength': 1000,
                            'n_purchased': n,
                            'price': 2000}
    log_to_kafka('events', purchase_sword_event)
    return "USER " + userid + ": "+ str(n)+" "+ " Sword(s) Purchased!\n"


@app.route("/join_guild/", methods=['POST','GET'])
def join_guild():
    """
    @function: This function generate a Join Guild event from a user mobile device request or Apache Bench
    @param: User Request (via URL endpoint) 
    @return: Returns string of User Id and Event 
    """
    userid = request.args.get('userid', default='001', type=str)
    guild_name = request.args.get('guild_name', default="'The_Avengers'", type=str)
    n = request.args.get('n',default=1,type=int)

    if guild_name == "'Game_of_Thrones'":
        price = 2000
        strength = 1500
    elif guild_name == "'Castle_of_Rock'":
        price = 1000
        strength = 1200
    else: 
        price = 3000
        strength = 5000
        
    userid = userid.replace("'", '')
    guild_name = guild_name.replace("'", '')
    join_guild_event = {'userid': userid,
                        'event_type': 'join_guild',
                        'name': guild_name,
                        'strength': strength,
                        'n_purchased': 1,
                        'price': price}
    log_to_kafka('events', join_guild_event)
    return "USER: " + userid +" Joined" +" "+ guild_name +" "+ "Guild!\n"


@app.route("/purchase_a_knife/", methods=['POST','GET'])
def purchase_a_knife():
    """
    @function: This function generate a purchase knife event from a user mobile device request or Apache Bench
    @param: User Request (via URL endpoint) 
    @return: Returns string of User Id and Event 
    """    
    userid = request.args.get('userid', default='001', type=str)
    n = request.args.get('n',default=1,type=int)
    userid = userid.replace("'", '')

    purchase_knife_event = {'userid': userid,
                            'event_type': 'purchase_knife',
                            'name': 'kukri',
                            'strength': 500,
                            'n_purchased': n,
                            'price': 1000}
    log_to_kafka('events', purchase_knife_event)
    return "USER " + userid + ": "+ str(n)+" "+ " Knife(s) Purchased!\n"


@app.route("/purchase_a_shield/", methods=['POST','GET'])
def purchase_a_shield():
    """
    @function: This function generate a purchase shield event from a user mobile device request or Apache Bench
    @param: User Request (via URL endpoint) 
    @return: Returns string of User Id and Event 
    """    
    userid = request.args.get('userid', default='001', type=str)
    n = request.args.get('n',default=1,type=int)
    userid = userid.replace("'", '')

    purchase_shield_event = {'userid': userid,
                             'event_type': 'purchase_shield',
                             'name': 'parma',
                             'strength': 800,
                             'n_purchased': n,
                             'price': 1500}
    log_to_kafka('events', purchase_shield_event)
    return "USER " + userid + ": "+ str(n)+" "+ " Shield(s) Purchased!\n"

@app.route("/fight_event/", methods=['POST','GET'])
def fight_event():
    """
    @function: This function generate a fight event from a user mobile device request or Apache Bench
    @param: User Request (via URL endpoint) 
    @return: Returns string of User Id and Event 
    """
    userid = request.args.get('userid', default='001', type=str)
    win_status = request.args.get('win_status', default="'won'", type=str)
    n = request.args.get('n',default=1,type=int)

    if win_status == "'lost'":
        score = -10
    else: 
        score = 100

    userid = userid.replace("'", '')
    win_status = win_status.replace("'", '')
        
    fight_event = {'userid': userid,
                   'event_type': 'fight_event',
                   'win_status': win_status,
                   'score': score}
    log_to_kafka('events', fight_event)
    return "Fight event: USER " + userid + " "+ win_status +" "+ ". Score: " + str(score) + "\n"

```

### Appendix C: (write_events_stream.py) Applications to Extract events from kafka and write them to hdfs

```python
#!/usr/bin/env python
"""Extract events from kafka and write them to hdfs
"""
import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

import pyspark.sql.functions as F
from datetime import datetime
from datetime import date


def purchase_events_schema():
    """
    @function: This function provides the table schema for purchase events (knife, sword, shield)
    @param: None 
    @return: Returns the table schema for purchase events
    """  
    return StructType(
    [
        StructField('Accept', StringType(), True),
        StructField('Host', StringType(), True),
        StructField('User-Agent', StringType(), True),
        StructField('price', LongType(), True),
        StructField('n_purchased', LongType(), True),
        StructField('strength', LongType(), True),
        StructField('name', StringType(), True),
        StructField('event_type', StringType(), True),
        StructField('userid', StringType(), True)
    ]
)

# LongType()

def guild_events_schema():
    """
    @function: This function provides the table schema for  guild events
    @param: None 
    @return: Returns the table schema for guild events 
    """  
    return StructType(
    [
        StructField('Accept', StringType(), True),
        StructField('Host', StringType(), True),
        StructField('User-Agent', StringType(), True),
        StructField('price', LongType(), True),
        StructField('n_purchased', LongType(), True),
        StructField('strength', LongType(), True),
        StructField('name', StringType(), True),
        StructField('event_type', StringType(), True),
        StructField('userid', StringType(), True)
    ]
)

def fight_events_schema():
    """
    @function: This function provides the table schema for  fight events
    @param: None 
    @return: Returns the table schema for fight events 
    """  
    return StructType(
    [
        StructField('Accept', StringType(), True),
        StructField('Host', StringType(), True),
        StructField('User-Agent', StringType(), True),
        StructField('score', LongType(), True),
        StructField('win_status', StringType(), True),
        StructField('event_type', StringType(), True),
        StructField('userid', StringType(), True)
    ]
)


@udf('boolean')
def is_purchase(event_as_json):
    """
    @function: This function uses a json to filter out records by purchase event type (knife, sword, shield)
    @param: Takes in extracted json data as a string
    @return: Returns a boolean value
    """    
    event = json.loads(event_as_json)
    if 'purchase' in event['event_type']:
        return True
    return False

@udf('boolean')
def is_join_guild(event_as_json):
    """
    @function: This function uses a json to filter out records by guild event type
    @param: Takes in extracted json data as a string
    @return: Returns a boolean value
    """   
    event = json.loads(event_as_json)
    if event['event_type'] == 'join_guild':
        return True
    return False

@udf('boolean')
def is_fight_event(event_as_json):
    """
    @function: This function uses a json to filter out records by fight event type
    @param: Takes in extracted json data as a string
    @return: Returns a boolean value
    """   
    event = json.loads(event_as_json)
    if event['event_type'] == 'fight_event':
        return True
    return False

def main():

    """
    @main function: This is a main function that executes a spark job - extracting string, parsing json using a provided schema, etc.
    @param: none, uses previously defined functions
    @return: none, lands tables via streaming on HDFS
    """ 
    
    spark = SparkSession \
        .builder \
        .appName("ExtractEventsJob") \
        .getOrCreate()

    raw_events = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "kafka:29092") \
        .option("subscribe", "events") \
        .load()

    purchases = raw_events \
        .filter(is_purchase(raw_events.value.cast('string'))) \
        .select(raw_events.value.cast('string').alias('raw_event'),
                raw_events.timestamp.cast('string'),
                from_json(raw_events.value.cast('string'),
                          purchase_events_schema()).alias('json')) \
        .select('timestamp', 'json.*') \
        .select( \
#                   F.from_utc_timestamp(F.col('timestamp'),'GMT').alias('event_ts') \
                  F.col('timestamp').alias('event_ts') \
                 ,F.col('userid') \
                 ,F.col('Host') \
                 ,F.col('event_type') \
                 ,F.col('name') \
                 ,F.col('strength') \
                 ,F.col('n_purchased') \
                 ,F.col('price') \
                ) \
        .distinct()
    
    guild = raw_events \
        .filter(is_join_guild(raw_events.value.cast('string'))) \
        .select(raw_events.value.cast('string').alias('raw_event'),
                raw_events.timestamp.cast('string'),
                from_json(raw_events.value.cast('string'),
                          guild_events_schema()).alias('json')) \
        .select('timestamp', 'json.*') \
        .select( \
#                   F.from_utc_timestamp(F.col('timestamp'),'GMT').alias('event_ts') \
                  F.col('timestamp').alias('event_ts') \
                 ,F.col('userid') \
                 ,F.col('Host') \
                 ,F.col('event_type') \
                 ,F.col('name') \
                 ,F.col('strength') \
                 ,F.col('n_purchased') \
                 ,F.col('price') \
                ) \
        .distinct()

    fight = raw_events \
        .filter(is_fight_event(raw_events.value.cast('string'))) \
        .select(raw_events.value.cast('string').alias('raw_event'),
                raw_events.timestamp.cast('string'),
                from_json(raw_events.value.cast('string'),
                          fight_events_schema()).alias('json')) \
        .select('timestamp', 'json.*') \
        .select( \
                  F.col('timestamp').alias('event_ts') \
                 ,F.col('userid') \
                 ,F.col('Host') \
                 ,F.col('event_type') \
                 ,F.col('score') \
                 ,F.col('win_status') \
                ) \
        .distinct()
    
    
    sink_purchases = purchases \
        .writeStream \
        .format("parquet") \
        .option("checkpointLocation", "/tmp/checkpoints_for_purchase_events") \
        .option("path", "/tmp/purchase_events") \
        .trigger(processingTime="10 seconds") \
        .start()
    
    sink_guild = guild \
        .writeStream \
        .format("parquet") \
        .option("checkpointLocation", "/tmp/checkpoints_for_guild_events") \
        .option("path", "/tmp/guild_events") \
        .trigger(processingTime="10 seconds") \
        .start()

    sink_fight = fight \
        .writeStream \
        .format("parquet") \
        .option("checkpointLocation", "/tmp/checkpoints_for_fight_events") \
        .option("path", "/tmp/fight_events") \
        .trigger(processingTime="10 seconds") \
        .start()
        
    sink_purchases.awaitTermination()
    sink_guild.awaitTermination()
    sink_fight.awaitTermination()


if __name__ == "__main__":
    main()
```

### Appendix D: Game Event Generator Script (game_event_generator.sh)

```bash
#! /usr/bin/bash

# usage: ./data_generator.sh -u 15 -e 5 -n 20 -b

helpFunction()
{
   echo ""
   echo "Welcome CARLOS 2020 to mids 205 Project 3 synthetic data generator"
   echo "Usage: $0 -u NOOFUSERS -e ENDPOINTS -n GENERATEREQS"
   echo -e "\t-u Number of users"
   echo -e "\t-e Number of endpoints"
   echo -e "\t-n Number of total requests"
   echo -e "\t-b (Optional) use Apache Bench to send the requests to flask, uses no args just the flag"
   exit 1 # Exit script after printing help
}

while getopts "u:e:n:b" opt
do
   case "$opt" in
      u ) NOOFUSERS="$OPTARG" ;;
      e ) ENDPOINTS="$OPTARG" ;;
      n ) GENERATEREQS="$OPTARG" ;;
      b ) ABFLAG="SET" ;;
      ? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
   esac
done

# Print helpFunction in case parameters are empty
if [ -z "$NOOFUSERS" ] || [ -z "$ENDPOINTS" ] || [ -z "$GENERATEREQS" ]
then
   echo "Some or all of the parameters are empty";
   helpFunction
fi


## ** Set limits to per user items
REQS=0
MAXNOOFSWORDS=10 # Max swords purchased at a time per user
MAXNOOFSHIELDS=10 # max shields purchased at a time per user
MAXNOOFKNIFES=10 # max potions purchased at a time per user
MAXGUILDS=3 # max number of guilds joined at a time per user
CONCURRENTUSERS=1 #max users accessing the flask API (* Cannot use concurrency level greater than total number of requests [CONCURRENTUSERS < GENERATEREQS] )
RANDVAR=2

## Event Types will be randomly assigned to a number between 1 and 9 based on endpoints specified.
# 1) Purchase a Sword
# 2) Purchase a Shield
# 3) Purchase a Knife
# 4) Join a Guild - this randomly generate 1 of three guild names Game of Thrones, Castle of Rock, and Knights of the Round table
# 5) fight_event

## ** Check if apache bench optional param is passed 
if [ "$ABFLAG" ]
then
    echo "Apache Bench flag is $ABFLAG";
    # docker exec -it project-3-elizkhan_mids_1 ab -n 2 -H "Host: liz.comcast.com" 'http://localhost:5000/purchase_a_potion/?userid=002&n=10' ## works
    # docker-compose exec mids ab -n 2 -H "Host: liz.comcast.com" http://localhost:5000/ #does not work in windows
    until [ $REQS -gt $GENERATEREQS ]; do
        ID=$(( ( RANDOM % $NOOFUSERS )  + 1 ))
        EP=$(( ( RANDOM % $ENDPOINTS )  + 1 ))
        NOOFSWORDS=$(( ( RANDOM % $MAXNOOFSWORDS )  + 1 ))
        NOOFSHIELDS=$(( ( RANDOM % $MAXNOOFSHIELDS )  + 1 ))
        NOOFKNIFES=$(( ( RANDOM % $MAXNOOFKNIFES )  + 1 ))
        case $EP in
            1)
            docker-compose exec mids ab -n 5 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/purchase_a_sword/?userid=%27user-00$ID%27&n=$NOOFSWORDS"
            ;;
        esac
        case $EP in
            2)
            docker-compose exec mids ab -n 4 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/purchase_a_shield/?userid=%27user-00$ID%27&n=$NOOFSHIELDS"
            ;;
        esac
        case $EP in
            3)
            docker-compose exec mids ab -n 4 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/purchase_a_knife/?userid=%27user-00$ID%27&n=$NOOFKNIFES"
            ;;
        esac    
        case $EP in
            4)
              GUILDID=$(( ( RANDOM % $MAXGUILDS )  + 1 ))
              case $GUILDID in
                1)
                docker-compose exec mids ab -n 2 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/join_guild/?userid=%27user-00$ID%27"
                ;;
                2)
                docker-compose exec mids ab -n 2 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/join_guild/?userid=%27user-00$ID%27&guild_name=%27Game_of_Thrones%27"
                ;;
                3)
                docker-compose exec mids ab -n 2 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/join_guild/?userid=%27user-00$ID%27&guild_name=%27Justice_League%27"
                ;;
              esac
        esac
        case $EP in
                5)
                WINFLAG=$(( ( RANDOM % $RANDVAR ) + 1))
                case $WINFLAG in
                   1)
                   docker-compose exec mids ab -n 2 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/fight_event/?userid=%27user-00$ID%27&win_status=%27lost%27"
                   ;;
                   2)
                   docker-compose exec mids ab -n 2 -H "Host: user-00$ID.comcast.com" "http://localhost:5000/fight_event/?userid=%27user-00$ID%27&win_status=%27won%27"               
                   ;;
                 esac
        esac
        let REQS=REQS+1
    done
else
    until [ $REQS -gt $GENERATEREQS ]; do
        ID=$(( ( RANDOM % $NOOFUSERS )  + 1 ))
        EP=$(( ( RANDOM % $ENDPOINTS )  + 1 ))
        NOOFSWORDS=$(( ( RANDOM % $MAXNOOFSWORDS )  + 1 ))
        NOOFSHIELDS=$(( ( RANDOM % $MAXNOOFSHIELDS )  + 1 ))
        NOOFKNIFES=$(( ( RANDOM % $MAXNOOFKNIFES )  + 1 ))
        case $EP in
            1)
            docker-compose exec mids curl "http://localhost:5000/purchase_a_sword/?userid=%27user-00$ID%27&n="$NOOFSWORDS
            ;;
        esac
        case $EP in
            2)
            docker-compose exec mids curl "http://localhost:5000/purchase_a_shield/?userid=%27user-00$ID%27&n="$NOOFSHIELDS
            ;;
        esac
        case $EP in
            3)
            docker-compose exec mids curl "http://localhost:5000/purchase_a_knife/?userid=%27user-00$ID%27&n="$NOOFKNIFES
            ;;
        esac
        case $EP in
            4)
            GUILDID=$(( ( RANDOM % $MAXGUILDS )  + 1 ))
              case $GUILDID in
                1)
                docker-compose exec mids curl "Host: user-00$ID.comcast.com" "http://localhost:5000/join_guild/?userid=%27user-00$ID%27"
                ;;
                2)
                docker-compose exec mids curl "Host: user-00$ID.comcast.com" "http://localhost:5000/join_guild/?userid=%27user-00$ID%27&guild_name=%27Game_of_Thrones%27"
                ;;
                3)
                docker-compose exec mids curl "Host: user-00$ID.comcast.com" "http://localhost:5000/join_guild/?userid=%27user-00$ID%27&guild_name=%27Justice_League%27"
                ;;
              esac
        esac
        case $EP in
            5)
             WINFLAG=$(( ( RANDOM % $RANDVAR ) + 1))
             case $WINFLAG in
               1)
               docker-compose exec mids curl "Host: user-00$ID.comcast.com" "http://localhost:5000/fight_event/?userid=%27user-00$ID%27&win_status=%27lost%27"
               ;;
               2)
               docker-compose exec mids curl "Host: user-00$ID.comcast.com" "http://localhost:5000/fight_event/?userid=%27user-00$ID%27&win_status=%27won%27"
               ;;
             esac
        esac
        let REQS=REQS+1
    done
fi
```


#### Useful References
- https://towardsdatascience.com/jupyter-magics-with-sql-921370099589