# Get started with Redshift and the Feature Store

This tutorial notebook will help you get started with working with the Hopsworks feature store and Redshift.


* [Create security group for Redshift cluster](#sg_redhsift)
* [Create a sample Amazon Redshift cluster](#setup_redhsift)
* [Create AIM role for EC2 instance to access Redshift cluster](#aim_ec2_redhsift)
* [Attach AIM role to your hopsworks cluster](#attach_aim_ec2)
* [Load sample data into Redshift cluster](#load_data_redhsift)

# Create security group for Redshift cluster <a name="sg_redhsift"></a>


### From AWS management console go to VPC

![1.jpg](images/VPC_steps/1.jpg)


## Choose security groups
![2.jpg](images/VPC_steps/2.jpg)


## Create security groups

![3.jpg](images/VPC_steps/3.jpg)


## Add inbound rules for Redshift cluster traffic 

![5.jpg](images/VPC_steps/5.jpg)



# Create a sample Amazon Redshift cluster<a name="setup_redhsift"></a>

### Step 1) From AWS management console go to Redshift and select Create Cluster 


![1.jpg](images/redshift/1.png)


### Step 2) From cluster configuration decide size of your Redshift cluster 


![2.jpg](images/redshift/2.png)


### Step 3) Scroll down to Database Configuration and enter username and password

![3.jpg](images/redshift/3.png)


### Step 4) Scroll down to Cluster permissions. This is optional. However, you want to load data from S3 then you must give Redshift cluster AIM role that has S3 read policy


![4.jpg](images/redshift/4.png)


### Step 5) Scroll down Additional Configurations and add security group we created above 

![5.jpg](images/redshift/5.png)


### Step 6) Scroll down and Create Cluster 

![6.jpg](images/redshift/6.png)



# Create AIM role for EC2 instance to access Redshift cluster <a name="aim_ec2_redhsift"></a>


## Step1 ) From AWS management console go to AIM Section

![1.jpg](images/EC2_AIM/1.jpg)


## Step2 ) Select roles

![2.jpg](images/EC2_AIM/2.jpg)


## Step3 ) Choose use case EC2

![3.jpg](images/EC2_AIM/3.jpg)


## Step4 ) Select necessary policy. For demo purposes we will select full access policy 

![4.jpg](images/EC2_AIM/4.jpg)


## Step5 ) This step is optional. You tags empty

![5.jpg](images/EC2_AIM/5.jpg)


## Step6 ) In the final step create AIM role

![6.jpg](images/EC2_AIM/6.jpg)


# Attach AIM role to your hopsworks cluster <a name="attach_aim_ec2"></a>


## Step 1) From AWS management console go to EC2

![1.jpg](images/EC2_change_AIM/1.jpg)


## Step 2) Select instances

![2.jpg](images/EC2_change_AIM/2.jpg)


## Step 3) Go to Actions and select Instance Settings and then Attach/Replace IAM Role

![3.jpg](images/EC2_change_AIM/3.jpg)

# Additional dependencies
[Download an Amazon Redshift JDBC driver](https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver)

## Load  sample data into Redshift cluster <a name="load_data_redhsift"></a>

### Import a CSV in Redshift from s3 bucket

Importing a CSV into Redshift requires you to create a table first. 

<code>
    CREATE TABLE telcom (
        customer_id VARCHAR primary key 
        gender VARCHAR,   
        senior_citizen VARCHAR, 
        partner VARCHAR,
        dependents VARCHAR,  
        tenure INTEGER, 
        phone_service VARCHAR,      
        multiple_lines VARCHAR, 
        internet_service VARCHAR, 
        online_security VARCHAR, 
        online_backup VARCHAR, 
        device_protection VARCHAR, 
        tech_support VARCHAR, 
        streaming_tv VARCHAR,
        streaming_movies VARCHAR,        
        contract VARCHAR,
        paperless_billing VARCHAR,            
        payment_method INTEGER, 
        monthly_charges INTEGER, 
        total_charges INTEGER, 
        churn VARCHAR    
    );
</code>

and then copy

<code>
    COPY telcom
        FROM 's3://<your-bucket-name>/load/file_name.csv'
        credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>'
    CSV;
</code>
    
please refer to the [Redshift COPY Command Specification](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) for a complete list of options for COPY,     

    
### Import a telcom data in Redshift from hopsworks

In [None]:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrameWriter;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import io.hops.util.Hops
import org.apache.spark.sql._
import spark.implicits._
import org.apache.spark.sql.types._

In [None]:
val jdbcUsername = "YOUR_REDSHIFT_USER_NAME"
val AIMrole = "AIM_name_of_EC2_with_redshift_access"
val jdbcHostname = "redshift-cluster-1.citpxgaovgkr.eu-north-1.redshift.amazonaws.com"
val jdbcPort = 5439
val jdbcDatabase = "telcom"
//val jdbcUrl = s"jdbc:redshift://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
val jdbcUrl = s"jdbc:redshift:iam://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"

In [None]:
telcom = spark.read.csv("path")

telcom
  write.
  format("jdbc").
  option("driver", "com.amazon.redshift.jdbc42.Driver").
  option("url",jdbcUrl).
  option("dbtable", jdbcDatabase).
  option("user", jdbcUsername).
  option("aws_iam_role", AIMrole).
  mode("append"). 
  save()