Week 1 - Class 1

Week 1 - Class 2

Week 1 - Class 3

Week 2 - Class 1

Week 2 - Class 2

Week 2 - Class 3

Week 3 - Class 1- EMR Serverless

Create Roles

Create EMR Notebook Role

Open IAM and create the IAM role for the EMR notebook using the emr notebook role json

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "elasticmapreduce.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Attach AmazonElasticMapReduceEditorsRole policy
Attached AmazonS3FullAccess policy

Create EMR Servlerless Execution Role

Open IAM and create the IAM role for the EMR Servlerless Execution using emr serverless role

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "emr-serverless.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Attach policy for permisions

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadAccessForEMRSamples",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::*.elasticmapreduce",
                "arn:aws:s3:::*.elasticmapreduce/*"
            ]
        },
        {
            "Sid": "FullAccessToOutputBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::*",
                "arn:aws:s3:::*/*"
            ]
        },
        {
            "Sid": "GlueCreateAndReadDataCatalog",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDataBases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Create S3 Buckets

Create a new S3 bucket

Open S3 console
create S3 bucket to use for this class

Create Folders to use in S3 Bucket

Create a pyspark folder
Create a hive folder
Create a datasets folder (We use this to upload a CSV to)
Create a outputs folder
Create a results folder
Upload files to folders

EMR Studio

Naviagte to EMR home from the AWS Console and select EMR Studio from the left handside.
Select Get Started
Select Create Studio
Insert Studio name
Under Networking and Security select your default VPC and 3 public subnets.
Select the EMR Studio role emr-notebook-role created initially
Select the S3 bucket created initially.
Select the Studio access URL

Spark App

Select applications under serverless from the left handside menu
Select create application from the top right
Enter a name for the Application. Leave the type as Spark and click create application
Click into the application via the name
Click submit job
Name job and select the service role created in the set up steps.
Click Submit Job
job status will go from pending -> running -> (success or failed).

Hive App

Create Application from applications
Name and select Hive application
Open hive application
Submit the job
Name the hive job, select hive script (change bucket name in script),and select service role.
Copy and paste Hive config (change bucket name in json).
Submit Job and monintor. Job status will go from pending -> running -> success.
Navigate to Glue databases and click emrdb
Check the table created
Select data using AWS Athena and check the created table.

Week 3 - Class 2 - Dataframes

Week 3 - Class 3 - Project - Data Modelling and Planning

dataset:

Iowa-Liquor-Sales

Dataset exploration results

columns :

'Invoice/Item Number',
'Date',
'Store Number',
'Store Name',
'Address',
'City'
'Zip Code'
'Store Location'
'County Number'
'County'
'Category',
'Category Name'
'Vendor Number'
'Vendor Name',
'Item Number'
'Item Description'
'Pack',
'Bottle Volume (ml)'
'State Bottle Cost',
'State Bottle Retail'
'Bottles Sold',
'Sale (Dollars)',
'Volume Sold (Liters)'
'Volume Sold (Gallons)'

Tables

Dimensions (COUNTY, ITEMS, STORE, VENDOR, DATES, CATEGORY)
Facts (sales)

DATA MODEL (SnowFlake Schema)

ETL Plan

Create a new schema for the large csv dataset using StructType y StructField
Read the .csv file from S3, and load the dataset using a dataframe using .Cache() or .Persist() with the already defined schema.
Be careful with date columns and columns with currency symbols
Write 6 queries in order to create 6 DIMENSIONS tables using the dataframe already persisted.
Write a query in order to create the fact table: FACT, the query will use the dataframe already persited.
Additionally you could add another job that works as check data quality to verify the data
After this exercise please delete the glue catalog tables, delete the created workgroup, delete the applications in the emr-studio, delete the s3 bucket folders and delete the created roles.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Week 1 - Class 1

Week 1 - Class 2

Week 1 - Class 3

Week 2 - Class 1

Week 2 - Class 2

Week 2 - Class 3

Week 3 - Class 1- EMR Serverless

Create Roles

Create S3 Buckets

EMR Studio

Spark App

Hive App

Week 3 - Class 2 - Dataframes

Week 3 - Class 3 - Project - Data Modelling and Planning

dataset:

Dataset exploration results

columns :

Tables

DATA MODEL (SnowFlake Schema)

ETL Plan

About

Languages

License

Wittline/apache-spark-course

Folders and files

Latest commit

History

Repository files navigation

Week 1 - Class 1

Week 1 - Class 2

Week 1 - Class 3

Week 2 - Class 1

Week 2 - Class 2

Week 2 - Class 3

Week 3 - Class 1- EMR Serverless

Create Roles

Create S3 Buckets

EMR Studio

Spark App

Hive App

Week 3 - Class 2 - Dataframes

Week 3 - Class 3 - Project - Data Modelling and Planning

dataset:

Dataset exploration results

columns :

Tables

DATA MODEL (SnowFlake Schema)

ETL Plan

About

Topics

Resources

License

Stars

Watchers

Forks

Languages