# W205 Project 3 Data Pipeline and Analytics Report
## Eric Liu, Ian Dela Cruz, Luke Schanne

## Data Pipeline Files, Components and Setup

The data pipe for this project consisted of the following components:

### Pipeline Files
- `docker-compose.yml`: details about the docker containers and environments that are needed to set-up and run the entire data pipeline.
- `stream` directory: stores the files used to stream, filter, and write to hive. 
    - `[action].py` files: sets up the [action]'s spark streaming job.
    - `template.py`: template file used to facilitate the setup of spark jobs.
- `app` directory: stores the files that generate the events, dictate business logic, and write to sqllite tables.
    - `game_api.py`: stores business logic for various types of actions (discussed below).
    - `models.py`: write event data to sqllite tables.
    - `events.py`: py script to generate a stream of random events.

### Pipeline Components:
- kafka topic: there is a single Kafka topic, `events` used in this data pipeline. From this topic, the various event types are filtered and written to hive tables using separate, action-specific spark streams.
    - Kafka observer creation: docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning
- sqllite tables: there are five tables that are created and used to store the data generated by the game API for reference for some light business logic application.
    - swords
    - transactions
    - guilds
    - players
    - guild_interactions
- spark jobs: there are action-specific spark jobs that filter the events into their respective hive tables.
- hive tables: there are five, action specific hive tables that are created for querying by Presto:
    - swords
    - sword_transactions
    - guilds
    - players
    - guild_membership

## Game API Details

Our group decided to expand on the initial project prompt by adding to the available actions within the game API and the data that would be generated with each transaction. In addition to the 'buy sword' action, we added 'add player', 'add guild', 'join guild', and 'add sword'. We felt that this would be an appropriate level of complexity to test our data pipeline creation, coding, and querying skills. As you will observe within the code, a few of these actions may also have additional parameters that need to be specified upon creation that we felt made sense to, given a standard understanding of typical game dynamics. These actions may also return additional data, such as data about interactions between players, between players and objects, as well as temporal data, providing us with a rich dataset from which we would then flex our Presto querying skills.

The noteworthy implications of this added complexity that we would like to highlight are as follows:

- Base event data written to the final hive tables are captured in a single string that is then parsed using regex, depending on the analytics question. See Basic Event Analysis section below.
- Creation of persistent sqllite.db file, which allows for querying in applying business logic prior to writing to hive.
- The resulting business logic (for example, how to vlaidate a transaction or guild interaction) became tricky to implement. The group is sure there are edge cases where the code will fail. Here, we assumed correcting for every possile case was outside the scope of the project, and instead we focus on ensuring a viable MVP.

## Basic Event Generation

We can generate events manually. Depending on the action, there may be parameters that will need to be defined. Examples of how to run each action from the terminal are provided below:

- add a guild: docker-compose exec mids curl "http://localhost:5000/add_guild?name=jokers"
- add a player: docker-compose exec mids curl "http://localhost:5000/add_playter?name=batman?money=999999"
- join a guild: docker-compose exec mids curl "http://localhost:5000/join_guild?player_id=1&guild_id=1&join=1"
- add a sword: docker-compose exec mids curl "http://localhost:5000/add_sword?cost=100"
- buy a sword: docker-compose exec mids curl "http://localhost:5000/purchase_sword?buyer_id=1&sword_id=1"

We can also generate events automatically, using the `events.py` file.

## Basic Event Analysis: Sample Preso Queries to answer simple business questions

Presto can be started from the terminal using the below code:

- docker-compose exec presto presto --server presto:8080 --catalog hive --schema default

From here, a variety of business questions can be explored. We provide a few examples of Presto queries that may be of business value to explore.

Some notes:

- The logic of these queries (especially those dealing with transaction data) may not robustly account for faulty business logic. The group felt that the proper place to enforce business logic would be further upstream in the application pipeline (not in the queries), and that the vast majority of this sort of robustness was outside the scope of this project.

- Because of how the event data are captured in the expanded game, some initial regex will need to be performed in order to isolate the paramters of interest. The group felt that it was beyond the scope of this project to create custom tables to service each of the numberous types of business questions that we could potentially explore.

- Finally, while the results of each query are not explicitly shown in this ipynb, we will run through some of these queries during our live demo, to demonstrate how they are used. 


### Who are my players?
SELECT
    distinct regexp_extract(event_body,'(?<=name\": \")(.*?)(?="}})') as distinct_players
FROM
    players;


### How many guilds are there?
SELECT
    count(distinct regexp_extract(event_body,'(?<=name\": \")(.*?)(?="}})') ) as number_of_guilds
FROM
    guilds;


### Which sword is exhanged the most?
SELECT
    regexp_extract(event_body,'(?<=sword_id\": )(.*?)(?=, \"trans)') as sword_id, 
    count( regexp_extract(event_body,'(?<=sword_id\": )(.*?)(?=, \"trans)') ) as count_of_transactions
FROM
    sword_transactions
GROUP BY
    regexp_extract(event_body,'(?<=sword_id\": )(.*?)(?=, \"trans)')
ORDER BY 
    count( regexp_extract(event_body,'(?<=sword_id\": )(.*?)(?=, \"trans)') ) DESC;


### Which guild has the longest membership?

SELECT c.guild_id, c.player_id, sum(duration) as membership_length
FROM (
    SELECT b.guild_id, b.player_id,
        CASE
            WHEN b.action = 'true' THEN date_diff('minute', b.timestamp, current_timestamp)
            WHEN b.action = 'false' THEN -date_diff('minute', b.timestamp, current_timestamp)
            END as duration
    FROM(
        SELECT *
        FROM (
            SELECT * , rank() over (
                PARTITION BY a.guild_id, a.player_id
                ORDER BY a.timestamp DESC) as rank
            FROM (
                SELECT
                    regexp_extract(event_body,'(?<=guild_id\": )(.*?)(?=}})') as guild_id,
                    regexp_extract(event_body,'(?<=player_id\": )(.*?)(?=, \"timestamp)') as player_id, 
                    regexp_extract(event_body,'(?<=join\": )(.*?)(?=, \"player)') as action,
                    date_add('year', 100, date_parse(regexp_extract(event_body,'(?<=timestamp\": \")(.*?)(?=\", \"guild)'), '%m/%d/%Y %T')) as timestamp
                FROM
                    guild_membership) a )
        WHERE rank = 1 OR rank = 2 ) b ) c
GROUP BY
    c.guild_id, c.player_id
ORDER BY 
    sum(duration) DESC;