# Snowflake Architecture

### Introduction

So far, we've learned about two different kinds of architectures for data warehouses.  

1. Shared everything 

We learned about shared everything architecture, where multiple different nodes connect to a storage layer, where data in permanently stored.

<img src="./share-storage.jpg" width="60%">

Because some processing occurs in the storage layer, this shared everything architecture can have difficulty scaling as more nodes are added.

2. Shared nothing architecture

In the shared-nothing architecture, our data is partitioned across multiple nodes.

<img src="./distributed-query.jpg" width="100%">

And each partition of data is given allocated a set of memory and CPU resources.  This way, our computing resources scale up as our data scales up.

### From shared nothing to isolated compute

While it may seem sensible to scale CPU and memory resources (ie. compute resources) as our data scales, it may turn out that the amount of compute resources does not depend on the amount of stored data.  For example, we may simply need more computing power because more users are querying the data, and thus want to increase the computing power.  And if at a later time the amount of queries reduce, we may need less computing power than the amount dedicated in a shared nothing architecture.   

Snowflake recognizes that computing needs may vary from storage needs.  And that computing needs may even increase or decrease over time, as consumers of the data change.  Because of this, with snowflake, the storage layer is completely isolated from the compute layer.

> In snowflake, each of the four compute groups are referred to as a separate virtual warehouse.  We no longer refer to them as a node, because each virtual warehouse may actually consist of a cluster of nodes, if needed.

<img src="./snowflake-storage-compute.jpg" width="60%">

Notice that the architecture above is almost like our shared storage architecture -- but the main difference with snowflake, is that there is no compute at the storage layer.  And because of this, there is no a bottleneck as different requests are issued. 

So by completely isolating compute from storage we avoid the issues of the previous two systems:

1. Concurrency - Concurrency issues occur when a system has difficulty accessing more simultaneous requests.  This occurred with our shared everything system as when more nodes issued requests, the storage layer's compute could not keep up.

2. Inflexible Compute and Storage Allocations - The shared nothing approach of databases like Amazon's redshift scaled compute along with the size of the database.  With snowflake, compute can be scaled up or down independently of storage.  And storage capacity can be scaled independently of compute.    

### Services Layer

Now let's take another look at our diagram of snowflake's architecture below.  A user may issue a query through one of the virtual warehouses in the compute layer and then that node is responsible for finding the relevant data in the storage layer.  

<img src="./snowflake-storage-compute.jpg" width="60%">

Now remember that there may be a massive amount of files that hold our data in the storage layer.  So it would be nice if a virtual warehouse did not have to search through all of these files to find the relevant data.  For this purpose (and others) snowflake also has a services layer.

<img src="./snowflake-with-services.jpg" width="60%">

In the services layer, snowflake stores metadata information -- like the files that certain tables live, and the range of values within those files.  So when a query is issued through a virtual warehouse, the virtual warehouse may first find which files the relevant data lives through the services layer, before then performing the query.

The services layer performs other useful functions. It contains a query optimizer to help plan queries, and also performs authentication and authorization services.

### Summary

In this lesson we learned about the different layers in snowflake.  We learned that snowflake has a complete separation of the storage layer from the compute layer.  This allows us to scale up or down the compute layer irrespective of the amount of data stored.  This differs from the shared nothing approach, where each partition of data is allocated it's own compute resources.

We also learned that unlike in the shared everything approach, there are no compute resources that live at the storage layer.  Instead, snowflake employs a service layer that keeps track of metadata -- like what tables live in what files -- this way the virtual warehouses can efficiently search for data.