## Lesson 9 - SQL and BigQuery

So you have just learnt about how to use python to carry out some basic functionality. There is a lot more to learn about writing code, especially when it comes to the analysis of the OpenPrescribing data. But before we do that, we thought you might be itching to actually get some of this knowledge rich data straight away. And why not. Hopefully by now you have your BigQuery credentials from one of the admin team members. If you do, then please follow the step-by-step guide in the [README.md](../README.md) file of this repo. Once set up, try running the code below.

In [2]:
import os
from ebmdatalab import bq
from pathlib import Path

DATA_FOLDER = Path("data")

sql = """
    SELECT
        code,
        name
    FROM
        ebmdatalab.hscic.ccgs
    WHERE
        name IS NOT NULL
    GROUP BY
        code,
        name
    """

ccg_names = bq.cached_read(sql, os.path.join(DATA_FOLDER, "ccg_names.csv"), use_cache=False)
ccg_names

Downloading: 100%|[32m██████████[0m|


Unnamed: 0,code,name
0,13H,"BRISTOL, NORTH SOMERSET, SOMERSET AND SOUTH GL..."
1,13K,KENT AND MEDWAY COMMISSIONING HUB
2,13L,SURREY AND SUSSEX COMMISSIONING HUB
3,13M,THAMES VALLEY COMMISSIONING HUB
4,12V,DERBYSHIRE AND NOTTINGHAMSHIRE COMMISSIONING HUB
...,...,...
452,01K,NHS MORECAMBE BAY
453,06Q,NHS MID ESSEX
454,11X,NHS SOMERSET
455,36J,NHS BRADFORD DISTRICT AND CRAVEN


Now what happened here!

On a high level plain, the above code got some data from BigQuery, and presented to you the data in a table format. There is a lot to unravel here, so let's do just that!



## Introduction to databases and Google BigQuery

Most people have used a `spreadsheet`, such as Excel or Google Sheets. A spreadsheet has rows and columns. Each row is a single record — perhaps one prescription, one patient visit, or one payment. Each column is a type of information — the patient name, the date, the medicine, or the amount. In this sense, a spreadsheet is already behaving like a simple database.

| Patient | Date       | Medicine     | Quantity |
|---------|------------|--------------|----------|
| Alice   | 2024-01-05 | Omeprazole   | 28       |
| Ben     | 2024-01-06 | Paracetamol  | 16       |
| Carla   | 2024-01-06 | Amoxicillin  | 21       |
| Dan     | 2024-01-07 | Omeprazole   | 56       |

An issue that we come across however is that spreadsheets only go so far. They work okay for relatively small numbers of rows, around the hundred to thousands mark, but when you reach millions of rows, a spreadsheet can slow down, become difficult to share, and sometimes simply cannot open. This is where `databases` come in. A database is built to handle large amounts of information, keep it structured, and let you ask questions about it.

`Google BigQuery` is one of these databases, but it is designed for `truly massive amounts of data`. Instead of storing the information on your own laptop, it lives in the cloud on Google’s servers.  You do not scroll through it in the way you would with a spreadsheet. Instead, you ask it questions using the `SQL language`, which stands for `Structured Query Language`. 

By the way, a `cloud` or `server` is basically a computer connected to the internet with massive amounts of storage and computing power.

So, imagine your spreadsheet of prescriptions has grown so large that it will no longer open. With BigQuery, you can still work with it. You could ask, “How many prescriptions were written in 2024?” and BigQuery would return the number in seconds. You could go further and ask, “Break that total down by GP practice” and BigQuery would return a neat table with one row per practice and the prescription totals beside it.

BigQuery is, in short, a database that looks and feels like a supercharged spreadsheet in the cloud. It can handle billions of rows, answer questions in seconds, and allow many people to work with the same information all at the same time.

## Structured Query Language (SQL)

We have already seen that a spreadsheet is like a database. Rows are records (one prescription, one patient, one order). Columns are fields (the patient name, the medicine, the date, the quantity). With a spreadsheet you scroll, filter, and add formulas. With a database you do something similar, but instead of clicking around you use a language called `SQL`.

SQL is how you ask questions about the information inside a database. Instead of saying “filter this column” or “sum that column” with your mouse, you write a short text command. The database then reads the command, searches through the rows, and gives you the result.


### Give me everything!

Let's say, to keep things very simple, we put the data that is in the spreadsheet above into a database, either on your laptop or on Google BigQuery. We put this data into an area in the database called a table. Think of a table as the individual spreadsheet tabs on your home computer spreadsheet. We decide on a name for this table of "prescriptions" (it could be anything you like really). Now let's say you want to get hold of all of the data in the "prescriptions" table. What you would say, in the SQL (pronounced "sequel") language is "from the prescriptions table, give me all the data you have". The exact syntax for this message to the database in the SQL langauge looks like this:

```sql
SELECT *
FROM prescriptions;
```

The asterix `*` means `everything`. The word `syntax` just means the `wording` that we use for the SQL language.

The results of the above SQL question would look like this:

| Patient | Date       | Medicine     | Quantity |
|---------|------------|--------------|----------|
| Alice   | 2024-01-05 | Omeprazole   | 28       |
| Ben     | 2024-01-06 | Paracetamol  | 16       |
| Carla   | 2024-01-06 | Amoxicillin  | 21       |
| Dan     | 2024-01-07 | Omeprazole   | 56       |

### No names please

Now how about if you did not want to know the date or quantities of drugs prescribed, but only how many drugs are being prescribed. This would be really useful if you want to remove personal details about who is actually prescribed what. If that was the case, then we could ask to only get the `Medicine` and `Quantity` columns. So the SQL for this question looks like this:

```sql
SELECT Medicine, Quantity
FROM prescriptions;
```

By the way, when we ask the database a question using the SQL langauge, we say we `query` the database.

The above SQL query would return the below result:

| Medicine     | Quantity |
|--------------|----------|
| Omeprazole   | 28       |
| Paracetamol  | 16       |
| Amoxicillin  | 21       |
| Omeprazole   | 56       |

### Only a single type of drug please

Perhaps you only want the rows with Omeprazole in them. We would ask in SQL of the database "look at all of the data from the prescription table and return only the rows that have Omeprazole in the Medicine column:

```sql
SELECT *
FROM prescriptions
WHERE Medicine = 'Omeprazole'
```

This SQL query would give you the below result:

| Patient | Date       | Medicine     | Quantity |
|---------|------------|--------------|----------|
| Alice   | 2024-01-05 | Omeprazole   | 28       |
| Dan     | 2024-01-07 | Omeprazole   | 56       |

### Count them all up!

Perhaps you want to know how many drugs you have prescribed for the data you have in front of you. You can use the `SUM` function for that!

```sql
SELECT SUM(Quantity)
FROM prescriptions;
```

You would get a result of:


```bash
121
```

### Put them in groups!

How about you want to sum up all tablets of each type, eg all omeprazole, and then all paracetamol, etc? Well, to do this you need to ask the database to `group` the data by the Medicine's column, and then return the results. You can do this as below:

```sql
SELECT Medicine, SUM(Quantity)
FROM prescriptions
GROUP BY Medicine
```

And your SQL query returns:

| Medicine     | SUM(Quantity) |
|--------------|---------------|
| Omeprazole   | 84            |
| Paracetamol  | 16            |
| Amoxicillin  | 21            |

### How about two groups?



### Put it all together!

So, look again at the SQL query that we sent to OpenPrescribing's BigQuery database.

```sql
SELECT
    code,
    name
FROM
    ebmdatalab.hscic.ccgs
WHERE
    name IS NOT NULL
GROUP BY
    code,
    name
```

Can you work out what it is trying to say?

If you have been following along with the above examples, you should be able to translate it into something like this:

```text
Get the "code" and "name" columns from the "ebmdatalab.hscic.ccgs" table. Do not give me any rows where the name column is empty (eg NULL). Group the results by the code and then name columns
```