# Project Title
### Data Engineering Capstone Project

#### Project Summary
The goal of this project is to make available an analytical database data about League of Legends matches so insights can be taken.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [None]:
# Do all imports and installs here
import getpass
import os
from datetime import datetime
import pandas as pd
import matplotlib as mat
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import from_json, col, udf, schema_of_json, lit, explode
from pyspark.sql import functions as  F
from pyspark.sql.types import TimestampType

pd.set_option("display.max_columns", 150)
pd.set_option("display.max_colwidth", 250)
pd.set_option("display.max_rows", 50)

In [14]:
aws_access_key_id = getpass.getpass()

········


In [15]:
aws_secret_access_key = getpass.getpass()

········


In [20]:
os.environ['AWS_ACCESS_KEY_ID'] = aws_access_key_id
os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_access_key

In [25]:
print(os.environ['AWS_ACCESS_KEY_ID'])

AKIA3MG5LSDI7AD4FKA7


### Step 1: Scope the Project and Gather Data

#### Scope 
> Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc

- This project aims to gather and make available detailed information about League of Legend's matches so analytical insights can serve as a guide for gameplans creation and strategies validation. Answering questions like the following ones:
    - What would be the best starting items given the curren champions and players' levels?
    - Given a specific players distribution accross lanes and players' levels, where would the first gank more likely to be successfull?
    - Given a specific game stats at a certain moment at the game, what would be the best team strategy? Push or retract?
    - Given the amount of lasthits of a player, what are their for kills/deaths/assists stats likely to be?
    
- The data used for this project is a report on game matches regarding their players' specific stats and team work details;
- The end solution of this project is a data pipeline that retrieves data from the API above, stages it into Amazon S3, processes data with Amazon EMR, and places it in an analytical database which I'd use Amazon Redshift;

#### Describe and Gather Data 
> Describe the data sets you're using. Where did it come from? What type of information is included?

- The data used for this project comes from [Riot's open API](https://developer.riotgames.com/apis);
- The information on `matches*.json` dataset is about game occorrences (fact table);
- The information on `A.json` dataset is about champions details (dimension table);
- [Data Dragon API](https://developer.riotgames.com/docs/lol)
- [Champions data](http://ddragon.leagueoflegends.com/cdn/10.13.1/data/en_US/champion.json)
    - [Champion details data](http://ddragon.leagueoflegends.com/cdn/10.13.1/data/en_US/champion/Aatrox.json)
        - `Aatrox` is the name of one champion;
- [Items](http://ddragon.leagueoflegends.com/cdn/10.13.1/data/en_US/item.json)

In [None]:
spark = SparkSession.builder\
.master("local[*]")\
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0")\
.config("spark.driver.memory", "10g")\
.getOrCreate()

spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

In [None]:
# Read in the data here
raw_data = spark.read.json('../riot-scraper/results/matches')
raw_data.count()

As table

In [4]:
raw_data.registerTempTable("matches")

In [5]:
sql_context = SQLContext(sparkContext=spark)

In [6]:
sql_context.sql("Select * from matches LIMIT 1").toPandas()

Unnamed: 0,gameCreation,gameDuration,gameId,gameMode,gameType,gameVersion,mapId,participantIdentities,participants,platformId,queueId,seasonId,teams
0,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha..."


In [7]:
participant_identities = raw_data.limit(1).withColumn(
    "participantIdentity",
    F.explode(F.col("participantIdentities"))
)

participant_identities.toPandas()
# raw_data.limit(1).select("gameId").union(participant_identities).toPandas()

Unnamed: 0,gameCreation,gameDuration,gameId,gameMode,gameType,gameVersion,mapId,participantIdentities,participants,platformId,queueId,seasonId,teams,participantIdentity
0,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1))"
1,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(2, (5_MnglWKGij6y8l7Y442WzlGxoZSj8DcY569YUKVMmLBsA, 5_MnglWKGij6y8l7Y442WzlGxoZSj8DcY569YUKVMmLBsA, NA1, /v1/stats/player_history/NA1/51360237, NA1, 23, pX0DANDsG8wcP4lXWURjDU4pfDI53Hu6b9Zc5cJkEswpoPU, KrazieKrush))"
2,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(3, (gVUCaZcrTMk-NGBMvFxnh8aqYv-22QgoL1lpquT_J92a1ew, gVUCaZcrTMk-NGBMvFxnh8aqYv-22QgoL1lpquT_J92a1ew, NA1, /v1/stats/player_history/NA1/237529792, NA1, 1666, 747QTgKa2MFyw6NPPLA_lNuWnmXLYm1q3tZneodMAKJh-H8, Rhythm7))"
3,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(4, (Rkx5N6XWQ6r0hgCZOUnkDHpokupCfNCEGK54BZ5n7ZOmAQ, Rkx5N6XWQ6r0hgCZOUnkDHpokupCfNCEGK54BZ5n7ZOmAQ, NA1, /v1/stats/player_history/NA1/50577866, NA1, 1109, NPo5z6MXbUM5jfkHCI3RI6PF2h3GlC4euQBg55zstZtd7gM, QUETIPP))"
4,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(5, (1WrmKAolmQuwJ0UXTqYo-9ctnscFhdd3L_786hQrFl2U_3s, 1WrmKAolmQuwJ0UXTqYo-9ctnscFhdd3L_786hQrFl2U_3s, NA1, /v1/stats/player_history/NA1/228948774, NA1, 1395, AJXIIezOsJNUUu9FRByRj4C_n66ZgCuhb46vZmoyZAV1ssw, AarKal07))"
5,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(6, (Z-bmVcKoruv7e5EbwcEHYsxtwLJWANBrAK5ledwz_oK8g0E, Z-bmVcKoruv7e5EbwcEHYsxtwLJWANBrAK5ledwz_oK8g0E, NA1, /v1/stats/player_history/NA1/211817424, NA1, 1666, _9s0uGliqmoNBTB_-00Dd7d-a8anvhDudQtL6zIBlqjYvIc, Cephalopodd))"
6,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(7, (mYZP7JV2XIwY0ySs_n8_Gq_cmoAKLW_rIB99jlfAtk26CZA, mYZP7JV2XIwY0ySs_n8_Gq_cmoAKLW_rIB99jlfAtk26CZA, NA1, /v1/stats/player_history/NA1/227754657, NA1, 745, du3ucDDU1mqooL_4AQ7mA75ZsBo6AbDs3asJefTxeHke1ZM, Oenonexus))"
7,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(8, (-vHa5Y8fI5OnlEsfhU4rCeS4T3vd20gt3r_pcKWVwoRM8hc, -vHa5Y8fI5OnlEsfhU4rCeS4T3vd20gt3r_pcKWVwoRM8hc, NA1, /v1/stats/player_history/NA1/219451056, NA1, 1588, UR-nVK1-Xqc902KQXQnGHIwK1iBiUZ5ieJIi2P8H-udNpDU, BeamoFailz))"
8,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(9, (x-kpth0IiWdvnw-w6GXetT_iKRmn1O8rLD48-0oekhR0-EQ, x-kpth0IiWdvnw-w6GXetT_iKRmn1O8rLD48-0oekhR0-EQ, NA1, /v1/stats/player_history/NA1/231722313, NA1, 1627, CFkCXwO0DuKuPDsKUGRJbCswGySfpL_kvP8xs30Lj81MSRk, ParanoiDD))"
9,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...","(10, (eoxaPYeYaI0VkcrNqGs1H78pjLl-5wnPJJYvjw9jjPXLnbw, eoxaPYeYaI0VkcrNqGs1H78pjLl-5wnPJJYvjw9jjPXLnbw, NA1, /v1/stats/player_history/NA1/215691122, NA1, 1606, EIC5qnqvpeH2_g63ugCkHhO5Uh-NI6cciIgB2uH5Hh79QFw, 6969wolf6969))"


#### Add timestamp field

In [None]:
@udf(TimestampType())
def get_datetime_from(long_value):
    return datetime.fromtimestamp(long_value/1000)

In [None]:
raw_data = raw_data\
.withColumn('ts', get_datetime_from(raw_data.gameCreation))\
.withColumn('year', F.year('ts'))\
.withColumn('month', F.month('ts'))

In [10]:
raw_data.limit(1).toPandas()

Unnamed: 0,gameCreation,gameDuration,gameId,gameMode,gameType,gameVersion,mapId,participantIdentities,participants,platformId,queueId,seasonId,teams,ts,year,month
0,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,7.12.190.9002,11,"[(1, (hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo, NA1, /v1/stats/player_history/NA1/200106415, NA1, 749, BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM, mohammadmatahen1)), (2, (5_MnglWKGij6y...","[(11, BRONZE, [Row(masteryId=6111, rank=5), Row(masteryId=6121, rank=1), Row(masteryId=6134, rank=5), Row(masteryId=6142, rank=1), Row(masteryId=6312, rank=5), Row(masteryId=6323, rank=1), Row(masteryId=6331, rank=5), Row(masteryId=6343, rank=1),...",NA1,420,9,"[([Row(championId=17, pickTurn=1), Row(championId=51, pickTurn=2), Row(championId=105, pickTurn=3), Row(championId=86, pickTurn=4), Row(championId=72, pickTurn=5)], 0, 0, 3, False, True, True, False, False, True, 0, 0, 100, 4, 0, Fail), ([Row(cha...",2017-06-15 20:08:42.350,2017,6


In [None]:
#write to parquet
raw_data = raw_data.where("year = 2020")
raw_data = raw_data.repartition('year', 'month')

data_destination = "data/parquet/matches_data"
# data_destination = "s3a://udacity-capstone-lol/lol_ready_data/match"
raw_data.write.partitionBy('year', 'month').mode('overwrite').parquet(data_destination)
raw_data_filtered = spark.read.parquet(data_destination)

In [None]:
raw_data_filtered.count()

## Data Assessment

### Data schema

In [None]:
raw_data_filtered.printSchema()

#### Useful columns

- `gameCreation`:;
- `gameDuration`:;
- `gameId`:;
- `gameType`:;
- `gameVersion`:;
- `mapId`:;

### Data sample

In [14]:
raw_data_filtered.where(raw_data_filtered.gameId == 3474165150).toPandas()

Unnamed: 0,gameCreation,gameDuration,gameId,gameMode,gameType,gameVersion,mapId,participantIdentities,participants,platformId,queueId,seasonId,teams,ts,year,month


### Step 2: Explore and Assess the Data
#### Explore the Data 
> Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
> Document steps necessary to clean the data

In [31]:
raw_data_filtered = raw_data.withColumn(
    "team", F.explode(F.col("teams"))
).withColumn(
    "participant", F.explode(F.col("participants"))
).withColumn(
    "participantIdentity", F.explode(F.col("participantIdentities"))
).select(
    "gameCreation",
    "gameDuration",
    "gameId",
    "gameMode",
    "gameType",
    "team.teamId",
    "participant.participantId",
    "team.win",
    "team.firstBlood",
    "team.firstTower",
    "team.firstInhibitor",
    "team.firstBaron",
    "team.firstDragon",
    "team.firstRiftHerald",
    "team.towerKills",
    "team.baronKills",
    "team.dragonKills",
    "team.vilemawKills",
    "team.riftHeraldKills",
    "participant.championId",
    "participant.spell1Id",
    "participant.spell2Id",
    F.col("participant.stats.win").alias("participant.stats.win"),
    "participant.stats.item0",
    "participant.stats.item1",
    "participant.stats.item2",
    "participant.stats.item3",
    "participant.stats.item4",
    "participant.stats.item5",
    "participant.stats.item6",
    "participant.stats.kills",
    "participant.stats.deaths",
    "participant.stats.assists",
    "participant.stats.largestKillingSpree",
    "participant.stats.largestMultiKill",
    "participant.stats.killingSprees",
    "participant.stats.longestTimeSpentLiving",
    "participant.stats.doubleKills",
    "participant.stats.tripleKills",
    "participant.stats.quadraKills",
    "participant.stats.pentaKills",
    "participant.stats.unrealKills",
    "participant.stats.totalDamageDealt",
    "participant.stats.magicDamageDealt",
    "participant.stats.physicalDamageDealt",
    "participant.stats.trueDamageDealt",
    "participant.stats.largestCriticalStrike",
    "participant.stats.totalDamageDealtToChampions",
    "participant.stats.magicDamageDealtToChampions",
    "participant.stats.physicalDamageDealtToChampions",
    "participant.stats.trueDamageDealtToChampions",
    "participant.stats.totalHeal",
    "participant.stats.totalUnitsHealed",
    "participant.stats.damageSelfMitigated",
    "participant.stats.damageDealtToObjectives",
    "participant.stats.damageDealtToTurrets",
    "participant.stats.totalDamageTaken",
    "participant.stats.magicalDamageTaken",
    "participant.stats.physicalDamageTaken",
    "participant.stats.trueDamageTaken",
    "participant.stats.goldEarned",
    "participant.stats.goldSpent",
    "participant.stats.turretKills",
    "participant.stats.inhibitorKills",
    "participant.stats.totalMinionsKilled",
    "participant.stats.neutralMinionsKilled",
    "participant.stats.neutralMinionsKilledTeamJungle",
    "participant.stats.neutralMinionsKilledEnemyJungle",
    "participant.stats.totalTimeCrowdControlDealt",
    "participant.stats.champLevel",
    "participant.stats.visionWardsBoughtInGame",
    "participant.stats.sightWardsBoughtInGame",
    "participant.stats.wardsPlaced",
    "participant.stats.wardsKilled",
    "participant.stats.firstBloodKill",
    "participant.stats.firstBloodAssist",
    "participant.stats.firstTowerKill",
    "participant.stats.firstTowerAssist",
    "participant.stats.firstInhibitorKill",
    "participant.stats.firstInhibitorAssist",
    F.col("participant.timeline.creepsPerMinDeltas.0-10").alias("creepsPerMinDeltas-0-10"),
    F.col("participant.timeline.creepsPerMinDeltas.10-20").alias("creepsPerMinDeltas-10-20"),
    F.col("participant.timeline.creepsPerMinDeltas.20-30").alias("creepsPerMinDeltas-20-30"),
    F.col("participant.timeline.creepsPerMinDeltas.30-end").alias("creepsPerMinDeltas-30-end"),
    F.col("participant.timeline.xpPerMinDeltas.0-10").alias("xpPerMinDeltas-0-10"),
    F.col("participant.timeline.xpPerMinDeltas.10-20").alias("xpPerMinDeltas-10-20"),
    F.col("participant.timeline.xpPerMinDeltas.20-30").alias("xpPerMinDeltas-20-30"),
    F.col("participant.timeline.xpPerMinDeltas.30-end").alias("xpPerMinDeltas-30-end"),
    F.col("participant.timeline.goldPerMinDeltas.0-10").alias("goldPerMinDeltas-0-10"),
    F.col("participant.timeline.goldPerMinDeltas.10-20").alias("goldPerMinDeltas-10-20"),
    F.col("participant.timeline.goldPerMinDeltas.20-30").alias("goldPerMinDeltas-20-30"),
    F.col("participant.timeline.goldPerMinDeltas.30-end").alias("goldPerMinDeltas-30-end"),
    F.col("participant.timeline.csDiffPerMinDeltas.0-10").alias("csDiffPerMinDeltas-0-10"),
    F.col("participant.timeline.csDiffPerMinDeltas.10-20").alias("csDiffPerMinDeltas-10-20"),
    F.col("participant.timeline.csDiffPerMinDeltas.20-30").alias("csDiffPerMinDeltas-20-30"),
    F.col("participant.timeline.csDiffPerMinDeltas.30-end").alias("csDiffPerMinDeltas-30-end"),
    F.col("participant.timeline.xpDiffPerMinDeltas.0-10").alias("xpDiffPerMinDeltas-0-10"),
    F.col("participant.timeline.xpDiffPerMinDeltas.10-20").alias("xpDiffPerMinDeltas-10-20"),
    F.col("participant.timeline.xpDiffPerMinDeltas.20-30").alias("xpDiffPerMinDeltas-20-30"),
    F.col("participant.timeline.xpDiffPerMinDeltas.30-end").alias("xpDiffPerMinDeltas-30-end"),
    F.col("participant.timeline.damageTakenPerMinDeltas.0-10").alias("damageTakenPerMinDeltas-0-10"),
    F.col("participant.timeline.damageTakenPerMinDeltas.10-20").alias("damageTakenPerMinDeltas-10-20"),
    F.col("participant.timeline.damageTakenPerMinDeltas.20-30").alias("damageTakenPerMinDeltas-20-30"),
    F.col("participant.timeline.damageTakenPerMinDeltas.30-end").alias("damageTakenPerMinDeltas-30-end"),
    F.col("participant.timeline.damageTakenDiffPerMinDeltas.0-10").alias("damageTakenDiffPerMinDeltas-0-10"),
    F.col("participant.timeline.damageTakenDiffPerMinDeltas.10-20").alias("damageTakenDiffPerMinDeltas-10-20"),
    F.col("participant.timeline.damageTakenDiffPerMinDeltas.20-30").alias("damageTakenDiffPerMinDeltas-20-30"),
    F.col("participant.timeline.damageTakenDiffPerMinDeltas.30-end").alias("damageTakenDiffPerMinDeltas-30-end"),
    "participant.timeline.role",
    "participant.timeline.lane",
    F.col("participantIdentity.participantId").alias("participantIdentity.participantId"),
    "participantIdentity.player.platformId",
    "participantIdentity.player.accountId",
    "participantIdentity.player.summonerId",
)

In [32]:
raw_data_filtered.count()

811040

In [33]:
raw_data_filtered.limit(1).toPandas()

Unnamed: 0,gameCreation,gameDuration,gameId,gameMode,gameType,teamId,participantId,win,firstBlood,firstTower,firstInhibitor,firstBaron,firstDragon,firstRiftHerald,towerKills,baronKills,dragonKills,vilemawKills,riftHeraldKills,championId,spell1Id,spell2Id,participant.stats.win,item0,item1,item2,item3,item4,item5,item6,kills,deaths,assists,largestKillingSpree,largestMultiKill,killingSprees,longestTimeSpentLiving,doubleKills,tripleKills,quadraKills,pentaKills,unrealKills,totalDamageDealt,magicDamageDealt,physicalDamageDealt,trueDamageDealt,largestCriticalStrike,totalDamageDealtToChampions,magicDamageDealtToChampions,physicalDamageDealtToChampions,trueDamageDealtToChampions,totalHeal,totalUnitsHealed,damageSelfMitigated,damageDealtToObjectives,damageDealtToTurrets,totalDamageTaken,magicalDamageTaken,physicalDamageTaken,trueDamageTaken,goldEarned,goldSpent,turretKills,inhibitorKills,totalMinionsKilled,neutralMinionsKilled,neutralMinionsKilledTeamJungle,neutralMinionsKilledEnemyJungle,totalTimeCrowdControlDealt,champLevel,visionWardsBoughtInGame,sightWardsBoughtInGame,wardsPlaced,wardsKilled,firstBloodKill,firstBloodAssist,firstTowerKill,firstTowerAssist,firstInhibitorKill,firstInhibitorAssist,creepsPerMinDeltas-0-10,creepsPerMinDeltas-10-20,creepsPerMinDeltas-20-30,creepsPerMinDeltas-30-end,xpPerMinDeltas-0-10,xpPerMinDeltas-10-20,xpPerMinDeltas-20-30,xpPerMinDeltas-30-end,goldPerMinDeltas-0-10,goldPerMinDeltas-10-20,goldPerMinDeltas-20-30,goldPerMinDeltas-30-end,csDiffPerMinDeltas-0-10,csDiffPerMinDeltas-10-20,csDiffPerMinDeltas-20-30,csDiffPerMinDeltas-30-end,xpDiffPerMinDeltas-0-10,xpDiffPerMinDeltas-10-20,xpDiffPerMinDeltas-20-30,xpDiffPerMinDeltas-30-end,damageTakenPerMinDeltas-0-10,damageTakenPerMinDeltas-10-20,damageTakenPerMinDeltas-20-30,damageTakenPerMinDeltas-30-end,damageTakenDiffPerMinDeltas-0-10,damageTakenDiffPerMinDeltas-10-20,damageTakenDiffPerMinDeltas-20-30,damageTakenDiffPerMinDeltas-30-end,role,lane,participantIdentity.participantId,platformId,accountId,summonerId
0,1497568122350,2833,2525196351,CLASSIC,MATCHED_GAME,100,1,Fail,True,True,False,False,True,False,4,0,3,0,0,11,11,4,False,1419,3087,3074,3072,3046,3031,3340,12,11,10,3,2,3,520,2,0,0,0,0,493632,82600,370237,40796,1084,35458,3914,27348,4195,9747,1,45204,22645,449,41854,3504,38016,334,22191,18875,0,0,257,90,58,32,474,18,0,0,18,2,False,False,False,False,False,False,1.1,1.2,2.3,11.866667,314.5,386.5,552.9,907.533333,237.7,410.0,440.2,627.6,-0.4,-0.6,-0.7,9.066667,-48.0,-29.4,52.1,451.533333,489.1,666.8,1088.5,949.266667,106.4,123.2,-319.2,-558.466667,NONE,JUNGLE,1,NA1,hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo,BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
> Map out the conceptual data model and explain why you chose that model

The goal of this data model is to provide a fact table with game stats and dimension tables with details about data referenced in the fact table.

![lol-conceptual-data-model](resources/img/lol-conceptual-data-model.png)

#### 3.2 Mapping Out Data Pipelines
> List the steps necessary to pipeline the data into the chosen data model
- First the data is persisted to S3 in parquet format for performance reasons;
- The raw data comes in a complex json structure in which would be very easy to work with once in Redshift. This goes in the opposite direction of having an easy-to-understand, and intuitive Data Warehouse.
- The transformations to achieve the final data model were performed with `pyspark.sql.functions.explode` function, which created different rows for items inside arrays, so data consumers would not have to navigate json structures. 

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
> Build the data pipelines to create the data model.

1. The data is extracted from riot's API (sourse);
1. Stages to S3 as-is;
1. A spark application consumes the data and performs transformations;
1. Loads it inti Lake Storage, S3, in parquet format;
1. Loads data from S3 into Redshift;

In [17]:
# Write code here

#### 4.2 Data Quality Checks
> Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
> * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
> * Unit tests for the scripts to ensure they are doing the right thing
> * Source/Count checks to ensure completeness
 
> Run Quality Checks

In [23]:
# Perform quality checks here

In [18]:
raw_data_filtered.printSchema()

root
 |-- gameCreation: long (nullable = true)
 |-- gameDuration: long (nullable = true)
 |-- gameId: long (nullable = true)
 |-- gameMode: string (nullable = true)
 |-- gameType: string (nullable = true)
 |-- teamId: long (nullable = true)
 |-- participantId: long (nullable = true)
 |-- win: string (nullable = true)
 |-- firstBlood: boolean (nullable = true)
 |-- firstTower: boolean (nullable = true)
 |-- firstInhibitor: boolean (nullable = true)
 |-- firstBaron: boolean (nullable = true)
 |-- firstDragon: boolean (nullable = true)
 |-- firstRiftHerald: boolean (nullable = true)
 |-- towerKills: long (nullable = true)
 |-- baronKills: long (nullable = true)
 |-- dragonKills: long (nullable = true)
 |-- vilemawKills: long (nullable = true)
 |-- riftHeraldKills: long (nullable = true)
 |-- dominionVictoryScore: long (nullable = true)
 |-- championId: long (nullable = true)
 |-- spell1Id: long (nullable = true)
 |-- spell2Id: long (nullable = true)
 |-- participant.stats.win: boolean (nu

#### 4.3 Data dictionary 
> Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

There are two types of data dictionary:
- Active data dictionary: Created and maintained within the database. Updated automatically based on real data;
- Passive data dictionary: Created and maintained separate from the database. This type of document tends to get out of date as changes are not automatically replicated;

The data dictionary presented here is a passive one, as it resides out of the database.

##### Table: `dim_champion`


##### Table: `dim_item`


##### Table: `dim_summoner`


##### Table: `fact_game_match`

Field Name | Data Type | Field size for display | Description | Example
-----------|-----------|------------------------|-------------|--------
gameCreation | BIGINT | 19 | Timestamp of the game creation | 1497568122350
gameDuration | SMALLINT | 5 | Game duration in seconds | 2833
gameId | BIGINT | 19 | Id of the match | 2525196351
gameMode | VARCHAR(10) | 10 | Mode of the game | CLASSIC
gameType | VARCHAR(10) | 10 | Type of the game | MATCHED_GAME
teamId | SMALLINT | 5 | Id of the team a participant is part of in the match [100, 200] | 100
participantId | SMALLINT | 5 | Id of the player in the game [1-10] | 3
win | BOOLEAN | 1 | Whether team won the match [0,1] | 1
firstBlood | BOOLEAN | 1 | Whether the team did the first blood [0,1] | 1
firstTower | BOOLEAN | 1 | Whether the team destroyed the first tower [0,1] | 1
firstInhibitor | BOOLEAN | 1 | Whether the team destroyed the first inhibitor [0,1] | 1
firstBaron | BOOLEAN | 1 | Whether the team did the first Baron [0,1] | 1
firstDragon | BOOLEAN | 1 | Whether the team did the first dragon [0,1] | 1
firstRiftHerald | BOOLEAN | 1 | WHether the team scored the first Rift Herald [0,1] | 1
towerKills | SMALLINT | 5 | Number of kills a tower got | 4
baronKills | SMALLINT | 5 | Number of kills baron of a team got | 0
dragonKills | SMALLINT | 5 | Number of kills dragon of a team got | 3
vilemawKills | SMALLINT | 5 | Number of times a team killed Vilemaw | 0
riftHeraldKills | SMALLINT | 5 | Number of times a team killed Rift Herald | 0
championId | INTEGER | 10 | Id of the champion the player chose | 11
spell1Id | SMALLINT | 5 | Summoner's spell 1 | 11
spell2Id | SMALLINT | 5 | Summoner's spell 2 | 4
participant.stats.win | BOOLEAN | 1 | Whether team participant the match [0,1] | 1
item0 | SMALLINT | 5 | Item 0 of the participant | 1419
item1 | SMALLINT | 5 | Item 1 of the participant | 3087
item2 | SMALLINT | 5 | Item 2 of the participant | 3074
item3 | SMALLINT | 5 | Item 3 of the participant | 3072
item4 | SMALLINT | 5 | Item 4 of the participant | 3046
item5 | SMALLINT | 5 | Item 5 of the participant | 3031
item6 | SMALLINT | 5 | Item 6 of the participant | 3340
kills | SMALLINT | 5 | Number of kills a participant scored | 12
deaths | SMALLINT | 5 | Number of deaths a participant suffered | 11
assists | SMALLINT | 5 | Number of assists a participant did | 10
largestKillingSpree | SMALLINT | 5 | Max number of largest killing spree a participant got | 3
largestMultiKill | SMALLINT | 5 | Max number of largest multiKill a participant got | 2
killingSprees | SMALLINT | 5 | Max number of killing sprees a participant got | 3
longestTimeSpentLiving | SMALLINT | 5 | Longest time a participant stayed alive | 520
doubleKills | SMALLINT | 5 | Number of double kills a participant got | 2
tripleKills | SMALLINT | 5 | Number of triple kills a participant got | 0
quadraKills | SMALLINT | 5 | Number of quadra kills a participant got | 0
pentaKills | SMALLINT | 5 | Number of penta kills a participant got | 0
unrealKills | SMALLINT | 5 | Number of unreal kills a participant got | 0
totalDamageDealt | INTEGER | 10 | Total damage dealt during the match a participant did | 493632
magicDamageDealt | INTEGER | 10 | Total magic damage dealt during the match a participant did | 82600
physicalDamageDealt | INTEGER | 10 | Total physical damage dealt during the match a participant did | 370237
trueDamageDealt | INTEGER | 10 | Total true damage dealt during the match a participant did | 40796
largestCriticalStrike | INTEGER | 10 | Max critical strike a participant did | 1084
totalDamageDealtToChampions | INTEGER | 10 | Total damage dealt to opponents | 35458
magicDamageDealtToChampions | INTEGER | 10 | Total magic damage dealt to opponents | 3914
physicalDamageDealtToChampions | INTEGER | 10 | Total physical damage dealt to opponents | 27348
trueDamageDealtToChampions | INTEGER | 10 | Total true damage dealt to opponents | 4195
totalHeal | INTEGER | 10 | Total heal a participant did | 9747
totalUnitsHealed | INTEGER | 10 | Total heal a participant did to units | 1
damageSelfMitigated | INTEGER | 10 | Total damage a participant mitigated | 45204
damageDealtToObjectives | INTEGER | 10 | Total damage a participant did to objectives | 22645
damageDealtToTurrets | INTEGER | 10 | Total damage a participant did to turrets | 449
totalDamageTaken | INTEGER | 10 | Total damage taken | 41854
magicalDamageTaken | INTEGER | 10 | Total magical damage taken | 3504
physicalDamageTaken | INTEGER | 10 | Total physical damage taken | 38016
trueDamageTaken | INTEGER | 10 | Total true damage taken | 334
goldEarned | INTEGER | 10 | Total gold earned during the match | 22191
goldSpent | INTEGER | 10 | Total gold spent during the match | 18875
turretKills | INTEGER | 10 | Total kills a turret got | 0
inhibitorKills | INTEGER | 10 | Total kills an inhibitor got | 0
totalMinionsKilled | SMALLINT | 5 | Total minions a participant killed  | 257
neutralMinionsKilled | SMALLINT | 5 | Total neutrals minions a participant killed | 90
neutralMinionsKilledTeamJungle | SMALLINT | 5 | Total minions a participant killed | 58
neutralMinionsKilledEnemyJungle | SMALLINT | 5 | Total neutrals minions a participant killed within own jungle | 32
totalTimeCrowdControlDealt | INTEGER | 10 | Total neutrals minions a participant killed within opponents' jungle | 474
champLevel | SMALLINT | 5 | Level of the champion of a participant | 18
visionWardsBoughtInGame | INTEGER | 10 | Number of vision wards bought by the participant during the match | 0
sightWardsBoughtInGame | INTEGER | 10 | Number of sight wards bought by the participant during the match | 0
wardsPlaced | SMALLINT | 5 | Number of wards placed by the participant during the match | 18
wardsKilled | SMALLINT | 5 | Number of wards destroyed by the participant during the match | 2
firstBloodKill | BOOLEAN | 1 | Whether the participant scored the first blood [0,1] | 1
firstBloodAssist | BOOLEAN | 1 | Whether the participant assisted on the first blood [0,1] | 1
firstTowerKill | BOOLEAN | 1 | Whether the participant scored the opponent's tower [0,1] | 1
firstTowerAssist | BOOLEAN | 1 | Whether the participant assisted on the opponent's tower [0,1] | 1
firstInhibitorKill | BOOLEAN | 1 | Whether the participant scored the opponent's inhibitor [0,1] | 1
firstInhibitorAssist | BOOLEAN | 1 | Whether the participant assisted on the opponent's inhibitor [0,1] | 1
creepsPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Deltas of creeps per min for the first 10 min | 1.1
creepsPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Deltas of creeps per min between 10 and 20 min | 1.2
creepsPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Deltas of creeps per min between 20 and 30 min | 2.3
creepsPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Deltas of creeps per min from 30 min on | 11.866667
xpPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Deltas of experience per min for the first 10 | 314.5
xpPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Deltas of experience per min between 10 and 20 min | 386.5
xpPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Deltas of experience per min between 20 and 30 min | 552.9
xpPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Deltas of experience per min from 30 min on | 907.533333
goldPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Deltas of gold per min for the first 10 | 237.7
goldPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Deltas of gold per min between 10 and 20 min | 410.0
goldPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Deltas of gold per min between 20 and 30 min | 440.2
goldPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Deltas of gold per min from 30 min on | 627.6
csDiffPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Differential of creeps per min for the first 10 | -0.4
csDiffPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Differential of creeps per min between 10 and 20 min | -0.6
csDiffPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Differential of creeps per min between 20 and 30 min | -0.7
csDiffPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Differential of creeps per min from 30 min on | 9.066667
xpDiffPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Differential of experience per min for the first 10 | -48.0
xpDiffPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Differential of experience per min between 10 and 20 min | -29.4
xpDiffPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Differential of experience per min between 20 and 30 min | 52.1
xpDiffPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Differential of experience per min from 30 min on | 451.533333
damageTakenPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Deltas of damage taken per min for the first 10 | 489.1
damageTakenPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Deltas of damage taken per min between 10 and 20 min | 666.8
damageTakenPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Deltas of damage taken per min between 20 and 30 min | 1088.5
damageTakenPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Deltas of damage taken per min from 30 min on | 949.266667
damageTakenDiffPerMinDeltas-0-10 | DECIMAL(18,15) | 10 | Differential of damage taken per min for the first 10 | 106.4
damageTakenDiffPerMinDeltas-10-20 | DECIMAL(18,15) | 10 | Differential of damage taken per min between 10 and 20 min | 123.2
damageTakenDiffPerMinDeltas-20-30 | DECIMAL(18,15) | 10 | Differential of damage taken per min between 20 and 30 min | -319.2
damageTakenDiffPerMinDeltas-30-end | DECIMAL(18,15) | 10 | Differential of damage taken per min from 30 min on | -558.466667
role | VARCHAR(10) | 10 | Role of the participant during the match | NONE
lane | VARCHAR(10) | 10 | Initial lane of the participant | JUNGLE
participantIdentity.participantId | SMALLINT | 5 | Id of the player in the game [1-10] | 3
platformId | VARCHAR(5) | 5 | Region of the match | NA1
accountId | VARCHAR(255) | 255 | Account of the participant | hmjECtXPbAJB9QhyMeeCUJQu6i6gp0HHNQVfxEv_4qkOHJo
summonerId | VARCHAR(255) | 255 | Summoner ID of the participant | BDP8U_QMNal824JGX-cmavXAO75ad8wVywPWmKV5jqZaGwM
ts | BIGINT | 19 | Timestamp of the match | 1497568122350
year | SMALLINT | 5 | Year of the match | 2017
month | SMALLINT | 5 | Month of the match | 3

#### Step 5: Complete Project Write Up

##### Steps of this project
1. There is no games match's id made available by Riot's API, so this data had to be crawled. Riot requires an API key to allow game matches data to be fetched, so one had to be generated;
1. Wrote a python application to download champions, items, and game matches data from Riot's API, and save it to a S3 bucket;
1. Used a spark application to extract raw data from a S3 bucket, transformed it to a suitable Relational format, and loaded it back to S3 in parquet format;
1. Ran Data Definition Language statements to setup Redshift tables;
1. Transfered data from S3 to staging tables in Redshift with COPY statement;
1. Ran Data Manipulation Language statements to create dimension and fact tables from staging tables;
1. Ran Quality checks to ensure data has successfully been copied to right tables;
1. Ran Data Query Language statements to get analytical insignts;

##### Technology choices
> Clearly state the rationale for the choice of tools and technologies for the project.

- Processing Engine: Spark
    - Heavy processing was performed on the raw data in order to transform it to the final data model. Such a process would take too long if executed in a serial manner instead of parallelized with spark.
- Analytical database: Redshift
    - Each gaming match results in ~200 rows. If we consider that League of Legends has about [27 million players](https://www.unrankedsmurfs.com/blog/players-2017), and supposing that each player plays an average of 1 match per day, we would have. `27000000/10*200 = 540000000`
        - 27.000.000: 27 million players;
        - 10: 10 players per match;
        - 200: Number of records generated per match record;
        - 540.000.000: Total records in redshift (Five Hundred Forty Million)
        - Considering we have only in the `fact_game_match` table 120 columns, and most are INTEGER  (4 bytes), we would need at least 260GB of disk storage per DAY;

##### Data update frequency
> Propose how often the data should be updated and why

- As the goal of the datasource is to provide an analytical base for gaming strategies based on historical facts, new data can be appended to the dataset in a daily basis;
- Due to riot's API rate limit (100 req/2min) the ETL will be started once a day and run until there's no more data left to be fetched;

#### Approaching the problem in a larger scale
>- Write a description of how you would approach the problem differently under the following scenarios:
- The data was increased by 100x.
- The data populates a dashboard that must be updated on a daily basis by 7am every day.
    - For this purpose a schedule interval (with a cron tab of `0 7 * * *`) would be configured in an workflow orchestration tool (airflow) so it would be automatically executed, and configured for retry and alerts;
- The database needed to be accessed by 100+ people.

#### References

- [Optimizing Performance](https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html)
- [Hadoop Scalability and Performance Testing in Heterogeneous Clusters](https://www.researchgate.net/publication/291356207_Hadoop_Scalability_and_Performance_Testing_in_Heterogeneous_Clusters)
- [Scaling Uber’s Apache Hadoop Distributed File System for Growth](https://eng.uber.com/scaling-hdfs)
- [Data dictionary](https://www.tutorialspoint.com/What-is-Data-Dictionary)
- [Building An Analytics Data Pipeline In Python](https://www.dataquest.io/blog/data-pipelines-tutorial)
- [Redshift data types](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html)
- [Redshift numeric types](https://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html)
- [Pyspark extension types](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-types.html)
- [Remotely submit emr spark job](https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/)
- [Terminate emr cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminateJobFlow.html)
- [Airflow ssh operator](https://airflow.readthedocs.io/en/stable/howto/connection/ssh.html)

In [56]:
# COPY listing
# FROM 's3://mybucket/data/listings/parquet/'
# IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
# FORMAT AS PARQUET;