# Understanding Hardware

### Computer Hardware

When we think of a computer, there are three main components to consider: (1) the hard drive (or disk), (2) the CPU, and (3) the memory.  Let's go through each of these in turn.


<img src="./cpu.jpg" width="40%">

1. **Hard drive**
The hard drive is a data storage device. We need this to store any long term data, which is stored in various files.  Generally our hard drive is built into our computer, but we may also add additional storage with an external hard drive.

2. **CPU**

    The CPU is the central processing unit, and this is what performs tasks on our computer.  The CPU executes a series of stored instructions that is a program.

3. **Memory** 

    Now we use hard drive to permanently store data, but reading from and writing to disk can take a significant amount time.  Because of this, data we may readily need is stored in memory for faster access.  We can see this in Python.  

When we first read data stored on our hard drive, like a CSV file, it may take some time.

In [3]:
%%timeit

import pandas as pd
url = "https://raw.githubusercontent.com/analytics-engineering-jigsaw/snowflake/main/0-hardware/houston_claims.csv"
df = pd.read_csv(url)
df[:2]

24.5 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


But once we read CSV file, it is stored in memory, and we can access it with our `df` variable.  Notice that accessing the data in memory takes less than half the time of reading it from disk.

In [6]:
%%timeit

df
df[:2]

10.7 µs ± 91.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


With these time differences, we may think that we should perhaps only store data in memory.  But there are two downsides to using memory:

1. It's more expensive

The first is simply that it's more expensive to store data in memory than to store it on a disk.  Because of this, computers generally have less space for memory than hard drive space.

2. It isn't permanent

The second downside is that storage in memory is not persisted.  When we shut down the computer, the information stored in memory is lost. 

### Database Hardware

As you may guessed, to perform any SQL operations, we'll need a hard drive memory and CPU.  Let's go through each.

1. Hard Drive

The most obvious is the hard drive.  When we store data from a database, we are storing in long term storage, so we need to store this information in the hard drive.  

* each table is stored in a separate file.  

* And each file consists of multiple pages.  A page is just a fixed amount of space in the file.  
* the smallest unit it can read from is one page at a time (essentially 1 paragraph of text)

2. CPU 

What performs the task to read these pages of data from disk?  Well this is the CPU.  When we perform a SELECT query, the CPU will perform a task to find the relevant pages on disk and then load their contents into memory.

<img src="./loaded-to-mem.jpg" width="40%">

3. Memory

Once the data is in memory, the CPU will perform further manipulations of the data.  For example, remember that an entire *page* of data is read from disk at a time into memory.  So this may consist of an entire row of data when only certain columns from that row are needed.

```SQL
SELECT users.name from users ORDER BY city;
```

<img src="./loaded-data.png" width="40%">

> Above, postgres all columns from each row, even columns like id that are not needed, because it reads each page of relevant data and id is in that page.  

Once the data is loaded from disk into memory, the CPU can then order the data as specified, and only return certain columns of data.

<img src="./order-data.png" width="40%"> 

<img src="./final-return.png" width="20%">

### Summary

In this lesson, we learned about the hardware that powers both computers and databases.  As we saw computers consist of a hard drive for long term storage, memory for impermanent storage with fast retrieval, and the CPU to read data both from memory and from disk, and to perform tasks.

We then saw how a databases make use of this hardware.  Data is stored on disk for long term storage, and then when a query is performed the CPU finds *pages* -- small chunks of a file -- that contain the relevant data and load those pages into memory.  Then, once the relevant data is in memory, the CPU continues to operate on the data, performing sorts, group bys, and only returning specified columns of data to the user.

### Resources

[Postgres Data Organization](http://etutorials.org/SQL/Postgresql/Part+I+General+PostgreSQL+Use/Chapter+4.+Performance/How+PostgreSQL+Organizes+Data/)

[Postgres Physical Storage](https://www.postgresql.org/docs/8.0/storage.html)

[Databases CPU](https://severalnines.com/database-blog/how-high-cpu-utilization-effects-database-performance)

[Columnar Stores - Towards DS](https://towardsdatascience.com/columnar-stores-when-how-why-2d6759914319)