# Cloud Based Snowflake

### Introduction

In the last lesson, we spoke about the three different layers of snowflake -- the storage layer, the compute layer, and the services layer.  As we saw, unlike previous architectures, with snowflake our storage layer is completely isolated from the compute layer. 

<img src="./snowflake-storage-compute.jpg" width="60%">

And we saw that, our services layer stores metadata, which keeps track of where data in our storage layer lives.  When the compute layer makes a query, it may first consult the compute layer to determine which files to query to find the relevant data. 

<img src="./snowflake-with-services.jpg" width="60%">

Now under the hood, snowflake is relying on either AWS or Google Cloud services to store and query the data.  We can decide to use one or the other when we sign up for snowflake -- but for our purposes, we'll just use AWS.

### Seeing the Storage Layer

In snowflake, when we store data in the storage layer, really we are storing it in Amazon S3.

<img src="./aws-s3.png" width="70%">

Now S3 is just a file storage service -- and we can think of it like storing files in Dropbox, or icloud.  Of course, it does have capabilities designed programming purposes.  For example, it ensures fast access to the data (that is low latency), high availability (the service rarely has our data unavailable), and easily scalable.

Snowflake uses S3 to store our data -- and takes advantage of S3 allowing us to have low latency, high availability and scalable storage.  But most importantly, there are *no* computing resources here.  We are simply storing our data, using S3 as a cloud based hard drive -- and there is no compute available to us at this layer. 

> So when we see a diagram of the storage layer, just remember that really this just represents files being stored in S3.

<img src="./snowflake-storage.jpg" width="20%"> $= $   <img src="./aws-s3.png" width="40%">

### Seeing the Compute Layer

Now because the storage layer is just a hard drive, to perform a query on this data, we need to add one or more virtual warehouses, which we use to query data from our S3 storage.  We refer to these one or more virtual warehouses as our compute layer. 

<img src="./snowflake-storage-compute.jpg" width="60%">

Now each virtual warehouse is just one or more EC2 machines.  If S3 is a hard drive that exists in the cloud, then an EC2 machine is a cloud based computer that AWS makes available to us.  

<img src="./ec2-dashboard.png" width="80%">

Remember, when we think of a computer really this is machine that has CPU, memory and hard drive on it.  And so the EC2 machine will read the data from S3, and then query and process that data. 

Now when we say virtual warehouse in snowflake, this can actually be more than one EC2 machine.

<img src="./dw-3.jpg" width="30%">

So, we can scale up our computing power to allow for faster queries.  And these queries can occur faster with more computing power, because snowflake will process the data in parallel just like we saw with our shared nothing architecture.

<img src="./distributed-data.jpg" width="40%">

So each compute can query a different partition of the data in parallel, and then these results can be aggregated to return the result to the user.  

> Remember, that the distribution of the these queries is called mapping, and when we aggregate the results we are reducing.  Finally, this is also referred to as massively parallel processing, or MPP.

Then of course, we may have multiple virtual warehouses -- oftentimes there may be a separate virtual warehouse for each team (marketing, data science, etc.).  This way an organization can better keep track as to the types and amount of queries that each team is performing.

<img src="./data-warehouses.jpg" width="40%">

And each virtual warehouse is just a separate cluster of EC2 machines.

### Summary

In this lesson, we learned about how snowflake uses cloud based services, and how this powers both the storage and compute layers of snowflake.  The storage layer just consists of S3, which is a service that stores files, and which we use to store the data in our database.

<img src="./snowflake-storage.jpg" width="20%"> $= $   <img src="./aws-s3.png" width="40%">

The important thing is that there are no compute resources available at this layer.  Instead, we to perform queries, we will create a virtual warehouse, which consists of one more EC2 machines.  And an EC2 machine is simply a computer, made available through AWS.

<img src="./dw-3.jpg" width="20%">

By having a cluster of computers, a virtual warehouse can use map reduce to query different parititions of data in parallel when performing a query.

<img src="./data-warehouses.jpg" width="30%">

### Resources

[Micropartitions](https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html)

[FoundationDB and Metadata](https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/)