# Amazon Redshift - Create Cluster

TODO: Describe scenario

<img src="img/redshift_setup.png" width="45%" align="left">

## Setup Amazon Redshift

To create an Amazon Redshift cluster, follow these steps:


### Collect Configuration Parameters (VPC ID, Security Group ID etc.)

#### Get VPC ID

* Make sure this VPC is the same this notebook is running within
* Make sure this VPC has the following 2 properties enabled
 *     DNS resolution = Enabled
 *     DNS hostnames = Enabled
* This allows private, internal access to Redshift from this SageMaker notebook using the fully qualified endpoint name

In [None]:
%%bash

export vpc_id=$(aws ec2 describe-vpcs  --query "Vpcs[0].VpcId" --output text)
export sec_group_id=$(aws ec2 describe-security-groups --filters "Name=vpc-id,Values=${vpc_id}" --query "SecurityGroups[0].GroupId" --output text)
echo $sec_group_id

#### SET Security Group ID
#### (TODO: This security group might need to have port 5349 open)
COPY FROM `sec_group_id` ABOVE

In [1]:
security_group_id='xxxxxx'

### Define Redshift Parameters

In [2]:
# Redshift configuration parameters
redshift_cluster_identifier = 'dsoaws'
database_name = 'dsoaws'
cluster_type = 'multi-node'

# Note that only some Instance Types support Redshift Query Editor 
# (https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor.html)
node_type = 'dc2.large'
number_nodes = '2' 

master_user_name = 'dsoaws'
master_user_pw = '<password>'


In [4]:
import boto3
iam = boto3.client('iam')

# TODO: Setup IAM Role with at least S3 Access to your data bucket that you are loading into Redshift
redshift_role = iam.get_role(RoleName='DSOAWS_Redshift')
redshift_role_arn = redshift_role['Role']['Arn']
print(redshift_role_arn)


arn:aws:iam::806570384721:role/DSOAWS_Redshift


### Create Redshift Cluster

In [None]:
import boto3
redshift = boto3.client('redshift')

response = redshift.create_cluster(
        DBName=database_name,
        ClusterIdentifier=redshift_cluster_identifier,
        ClusterType=cluster_type,
        NodeType=node_type,
        NumberOfNodes=int(number_nodes),       
        MasterUsername=master_user_name,
        MasterUserPassword=master_user_pw,
        IamRoles=[redshift_role_arn],
        VpcSecurityGroupIds=[security_group_id],
        Port=5439,
        PubliclyAccessible=False
)

print(response)
