# Amazon Redshift

Amazon Redshift is a AWS service for Data Warehousing. For some, this definition might sound awkward, because Data Warehousing is more of a concept than a tool. However, even though we can theoretically use regular databases as a unified data repository (Data Warehouse), it gets harder and harder to scale them as data volumn increases.

That is where Amazon Redshift comes in. As [stated by amazon](https://docs.aws.amazon.com/de_de/redshift/index.html), the service scales well from Gigabytes to Petabytes of data while being completly managed (that means, we get the infrastructure, but don't need to worry about maintenance, updates, configuration and so on).

# How does Amazon Redshift Work

To put in a simple way, a redshift is a cluster of computing nodes. Whenever we communicate with the database (to insert or query data, for example), the master node splits the task among the worker nodes and takes care of unifying every fraction of the result that each worker gives back. This enables Redshift to handle a huge data volume.

Another thing that increases the Redshift performance is the column-oriented-storage. While regular databases are "row-oriented" (they sort of store data in rows), Redshift stores data in columns. This makes it easier to retrieve a lot of data from a column at once (think: fact tables and dimension tables).

# Creating a Redshift Cluster

A redshift cluster can become expensive, however there is a free trial of two months in AWS for the first redshift cluster we create. So I went ahead and created one.

![redshift_start](Redshift/redshift_start.png)

# Loading Data Into Redshift

Redshift comes with a query editor, which makes it very easy to load and query the data.

![query_editor](Redshift/redshift_query_editor.png)

Notice that I created the database olist (dev and sample data dev come out-of-the-box with redshift)

**Manually Creating a Table and Loading Data From S3**

The first option we have is to manually create a table and insert the data from S3 (notice that I'm putting the table in the staging schema, inside of the olist database).

![manual](Redshift/redshift_manual.png)

When we then click in "load data", we can pick a file from a S3 bucket to load into it.

![s3](Redshift/redshift_s3_import.png)

*Note: for this to work, you have to have a role attached to the redshift cluster. To do this, we have to select an IAM role under properties at the start page of the redshift cluster. Of course, for this, we already have to have a IAM role with the correct permissions.*

![iam](Redshift/redshift_iam.png)

Redshift does the heavy lifting for us. We can now query the data we just loaded:

![query_result](Redshift/redshift_query_result.png)

**Automatically Creating A Table**

Let's create another table, this time with the "Create" option in Redshift.

![create_table](Redshift/redshift_create_table.png)

We can import the schema from a local CSV File and redshift create the table for us:

![table_successful](Redshift/redshift_products.png)

Now we can simply repeat the same process from above (point redshift to an S3 bucket) and load the data.

Another smart way of doing it is copy the generated statement that Redshift creates when we load a table and execute them all at once:

![bulk](Redshift/redshift_bulk.png)

And that's it. We've loaded data into Redshift.

# Loading Data With Manifest

There is another thing that helps loading data into redshift (specially when we want to load data from multiple files), which is using the [manifest option to specify files](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html). Since this is a bit more advance, I will leave it for now. Another interesting option is [load data from json files](https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html), which does not apply in our case. I'm mentioning them here because it's nice to know that they exist, even though there is no need for using it all here. 

# Conclusion

Redshift is a very powerful data warehousing tool. It offers some interesting features that helps handle larger data volumns like massive paralell processing and column storage.  On top of this, over the years AWS has added more and more features to it, what has made it easier to load and query data from their service.

Now that the data is loaded into Redshift, we can remodell it into a Star-Schema, but that is the content of [this other project of mine](link to project)