## Data Lake Vs. Data Warehouse

One of the fundamental differences between data lakes and data warehouses is that while you ingest and store huge amounts of raw, unprocessed data in your data lake, you normally only load some fraction of your recent data into your data warehouse. Depending on your business and analytics use case, this might be data from the past couple of months, a year, or maybe the past 2 years. 

Let’s assume we want to have the past 2 years of our `Amazon Customer Reviews` data in a data warehouse to analyze customer behavior and review trends. We will use [Amazon Redshift](https://aws.amazon.com/redshift/) as our data warehouse. 

## Setup IAM Access To Read From S3 and Athena

[AWS Identity and Access Management (IAM)](https://aws.amazon.com/iam/) is a service that helps you to manage access to AWS resources. IAM controls who is authenticated and authorized to use resources. 

You can create individual IAM users for people accessing your AWS account. Each user will have a unique set of security credentials. You can also assign IAM users to IAM groups with defined access permissions (i.e. for specific job functions) and the IAM users inherit those permissions. 

A more preferred way to delegate access permissions is via IAM roles. In contrast to an IAM user which is uniquely associated with one person, a role can be assumed by anyone who needs it, and provides you with only temporary security credentials for the duration of the role session. AWS Service Roles control which actions a service can perform on your behalf. 

Access permissions are defined using IAM policies. It’s a standard security best practice to only grant least privilege, in other words- only grant the permissions required to perform a task. 

In [72]:
import json
import boto3
from botocore.exceptions import ClientError

iam = boto3.client('iam')
sts = boto3.client('sts')

In [73]:
# Create AssumeRolePolicyDocument
assume_role_policy_doc = {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In [74]:
# Create Role
iam_redshift_role_name = 'COEAWS_Redshift'

In [75]:
try:
    iam_role_redshift = iam.create_role(
        RoleName=iam_redshift_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),
        Description='COEAWS Redshift Role'
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("Role already exists")
    else:
        print("Unexpected error: %s" % e)

Role already exists


In [76]:
# Get the Role ARN
role = iam.get_role(RoleName='COEAWS_Redshift')
iam_role_redshift_arn = role['Role']['Arn']
print(iam_role_redshift_arn)

arn:aws:iam::533787958253:role/COEAWS_Redshift


In [77]:
# Create Self-Managed Policies
# Define Policies
# arn:aws:iam::aws:policy/AmazonS3FullAccess
my_redshift_to_s3 = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

In [78]:
# arn:aws:iam::aws:policy/AmazonAthenaFullAccess
my_redshift_to_athena = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:*"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:BatchDeleteTable",
                "glue:UpdateTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload",
                "s3:CreateBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::aws-athena-query-results-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::athena-examples*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sns:ListTopics",
                "sns:GetTopicAttributes"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricAlarm",
                "cloudwatch:DescribeAlarms",
                "cloudwatch:DeleteAlarms"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

In [79]:
# Create Policy Objects
try:
    policy_redshift_s3 = iam.create_policy(
      PolicyName='COEAWS_RedshiftPolicyToS3',
      PolicyDocument=json.dumps(my_redshift_to_s3)
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("Policy already exists")
    else:
        print("Unexpected error: %s" % e)

Policy already exists


If the above has an unexpected error such as when calling the CreatePolicy Operations you may have to create an inline policy.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "jsr5",
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```


In [80]:
# Get ARN

account_id = sts.get_caller_identity()['Account']
policy_redshift_s3_arn = f'arn:aws:iam::{account_id}:policy/COEAWS_RedshiftPolicyToS3'
print(policy_redshift_s3_arn)


arn:aws:iam::533787958253:policy/COEAWS_RedshiftPolicyToS3


In [81]:
try:
    policy_redshift_athena = iam.create_policy(
      PolicyName='COEAWS_RedshiftPolicyToAthena',
      PolicyDocument=json.dumps(my_redshift_to_athena)
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("Policy already exists")
    else:
        print("Unexpected error: %s" % e)

Policy already exists


In [82]:
# Get ARN

account_id = sts.get_caller_identity()['Account']
policy_redshift_athena_arn = f'arn:aws:iam::{account_id}:policy/COEAWS_RedshiftPolicyToAthena'
print(policy_redshift_athena_arn)

arn:aws:iam::533787958253:policy/COEAWS_RedshiftPolicyToAthena


In [83]:
# attached policy to role
# Attach COEAWS_RedshiftPolicyToAthena policy
try:
    response = iam.attach_role_policy(
        PolicyArn=policy_redshift_athena_arn,
        RoleName=iam_redshift_role_name
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("Policy is already attached. This is ok.")
    else:
        print("Unexpected error: %s" % e)

If the above has an unexpected error such as when calling the AttachRolePolicy operations you may have to create an inline policy.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "jsr6",
            "Effect": "Allow",
            "Action": [
                "iam:AttachRolePolicy"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

In [84]:
# Attach COEAWS_RedshiftPolicyToS3 policy
try:
    response = iam.attach_role_policy(
        PolicyArn=policy_redshift_s3_arn,
        RoleName=iam_redshift_role_name
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("Policy is already attached. This is ok.")
    else:
        print("Unexpected error: %s" % e)

### Get Security Group ID 

* Make sure the Redshift VPC is the same this notebook is running within
* Make sure the VPC has the following 2 properties enabled
 *     DNS resolution = Enabled
 *     DNS hostnames = Enabled
* This allows private, internal access to Redshift from this SageMaker notebook using the fully qualified endpoint name.

In [88]:
import sagemaker
sm = boto3.client('sagemaker')

notebook_instance_name = sm.list_notebook_instances()['NotebookInstances'][0]['NotebookInstanceName']
notebook_instance_name

'CoE-AI-Notebook-Server'

In [89]:
#TODO: fix KeyError: 'SecurityGroups'
notebook_instance = sm.describe_notebook_instance(NotebookInstanceName=notebook_instance_name)
notebook_instance
#security_group_id = notebook_instance['SecurityGroups'][0]
#print(security_group_id)

{'NotebookInstanceArn': 'arn:aws:sagemaker:us-east-2:533787958253:notebook-instance/coe-ai-notebook-server',
 'NotebookInstanceName': 'CoE-AI-Notebook-Server',
 'NotebookInstanceStatus': 'InService',
 'Url': 'coe-ai-notebook-server.notebook.us-east-2.sagemaker.aws',
 'InstanceType': 'ml.t2.medium',
 'RoleArn': 'arn:aws:iam::533787958253:role/CoE_AI_SageMaker_Notebook',
 'LastModifiedTime': datetime.datetime(2020, 4, 27, 12, 49, 1, 157000, tzinfo=tzlocal()),
 'CreationTime': datetime.datetime(2020, 4, 27, 12, 46, 37, 446000, tzinfo=tzlocal()),
 'DirectInternetAccess': 'Enabled',
 'VolumeSizeInGB': 5,
 'RootAccess': 'Enabled',
 'ResponseMetadata': {'RequestId': '987bfab6-3032-48c6-a039-57dd2c729331',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '987bfab6-3032-48c6-a039-57dd2c729331',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '498',
   'date': 'Thu, 07 May 2020 17:21:39 GMT'},
  'RetryAttempts': 0}}

### Create Secret in Secrets Manager

AWS Secrets Manager is a service that enables you to easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. Using Secrets Manager, you can secure and manage secrets used to access resources in the AWS Cloud, on third-party services, and on-premises.

In [87]:
#TODO: Fix this Unexpected Error

secretsmanager = boto3.client('secretsmanager')

try:
    response = secretsmanager.create_secret(
        Name='coeaws_redshift_login',
        Description='COEAWS Redshift Login',
        SecretString='[{"username":"coeaws"},{"password":"xxxxxx"}]',
        Tags=[
            {
                'Key': 'name',
                'Value': 'coeaws_redshift_login'
            },
        ]
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceExistsException':
        print("Secret already exists. This is ok.")
    else:
        print("Unexpected error: %s" % e)

Unexpected error: An error occurred (AccessDeniedException) when calling the CreateSecret operation: User: arn:aws:sts::533787958253:assumed-role/CoE_AI_SageMaker_Notebook/SageMaker is not authorized to perform: secretsmanager:CreateSecret on resource: arn:aws:secretsmanager:us-east-2:533787958253:secret:coeaws_redshift_login-L6IqLZ


If the above has an unexpected error such as when calling the CreateSecret operations you may have to create an inline policy.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "jsr7",
            "Effect": "Allow",
            "Action": [
                "iam:CreateSecret"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```