- Create EMR Notebook Role
- Open IAM and create the IAM role for the EMR notebook using the emr notebook role json
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "elasticmapreduce.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
- Attach AmazonElasticMapReduceEditorsRole policy
- Attached AmazonS3FullAccess policy
- Create EMR Servlerless Execution Role
- Open IAM and create the IAM role for the EMR Servlerless Execution using emr serverless role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
- Attach policy for permisions
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadAccessForEMRSamples",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::*.elasticmapreduce",
"arn:aws:s3:::*.elasticmapreduce/*"
]
},
{
"Sid": "FullAccessToOutputBucket",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::*",
"arn:aws:s3:::*/*"
]
},
{
"Sid": "GlueCreateAndReadDataCatalog",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:GetDataBases",
"glue:CreateTable",
"glue:GetTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:GetUserDefinedFunctions"
],
"Resource": [
"*"
]
}
]
}
- Create a new S3 bucket
- Open S3 console
- create S3 bucket to use for this class
- Create Folders to use in S3 Bucket
- Create a
pyspark
folder - Create a
hive
folder - Create a
datasets
folder (We use this to upload a CSV to) - Create a
outputs
folder - Create a
results
folder - Upload files to folders
-
Naviagte to EMR home from the AWS Console and select EMR Studio from the left handside.
-
Select
Get Started
-
Select
Create Studio
-
Insert Studio name
-
Under
Networking and Security
select your default VPC and 3 public subnets. -
Select the EMR Studio role
emr-notebook-role
created initially -
Select the S3 bucket created initially.
-
Select the
Studio access URL
-
Select
applications
underserverless
from the left handside menu -
Select
create application
from the top right -
Enter a name for the Application. Leave the type as
Spark
and clickcreate application
-
Click into the application via the
name
-
Click
submit job
-
Name job and select the service role created in the set up steps.
-
Click
Submit Job
-
job status will go from pending -> running -> (success or failed).
-
Create Application from applications
-
Name and select Hive application
-
Open hive application
-
Submit the job
-
Name the hive job, select hive script (change bucket name in script),and select service role.
-
Copy and paste Hive config (change bucket name in json).
-
Submit Job and monintor. Job status will go from pending -> running -> success.
-
Navigate to Glue databases and click emrdb
-
Check the table created
-
Select data using AWS Athena and check the created table.
- 'Invoice/Item Number',
- 'Date',
- 'Store Number',
- 'Store Name',
- 'Address',
- 'City'
- 'Zip Code'
- 'Store Location'
- 'County Number'
- 'County'
- 'Category',
- 'Category Name'
- 'Vendor Number'
- 'Vendor Name',
- 'Item Number'
- 'Item Description'
- 'Pack',
- 'Bottle Volume (ml)'
- 'State Bottle Cost',
- 'State Bottle Retail'
- 'Bottles Sold',
- 'Sale (Dollars)',
- 'Volume Sold (Liters)'
- 'Volume Sold (Gallons)'
- Dimensions (COUNTY, ITEMS, STORE, VENDOR, DATES, CATEGORY)
- Facts (sales)
- Create a new schema for the large csv dataset using StructType y StructField
- Read the .csv file from S3, and load the dataset using a dataframe using .Cache() or .Persist() with the already defined schema.
- Be careful with date columns and columns with currency symbols
- Write 6 queries in order to create 6 DIMENSIONS tables using the dataframe already persisted.
- Write a query in order to create the fact table: FACT, the query will use the dataframe already persited.
- Additionally you could add another job that works as check data quality to verify the data
- After this exercise please delete the glue catalog tables, delete the created workgroup, delete the applications in the emr-studio, delete the s3 bucket folders and delete the created roles.