# Lesson 11 - Coding 103

Let's remind ourselves of the code we used to fetch some data from OpenPrescribing's BigQuery data platform.

In [None]:
from ebmdatalab import bq
from pathlib import Path

DATA_FOLDER = Path("data")

sql = """
    SELECT code, name
    FROM ebmdatalab.hscic.ccgs
    WHERE name IS NOT NULL
    GROUP BY code, name
    """

ccg_names = bq.cached_read(sql, DATA_FOLDER / "ccg_names.csv", use_cache=False)
ccg_names

So you have learnt what the SQL part does, eg get a list of CCG names and codes. But what are these `import` parts and this `bq.cached_read` parts? 

Let's jump in and find out!

## Imports

As we have mentioned before, we are using the `python` programming language to do all of our database retrieval work. You can use any language you like, but `python` works well with data. Another language you may come across in the data analytics world is `R`.

Most languages are shipped with certain functionality, like `print` to screen, or `store` in memory. A lot of these functionalities just work by default when you write them out (eg `print("Hello World!")`).

There is a lot of default functionality that the python language can do out of the box. Sometimes, however, you want to use other, more advanced, functionalities. Some of these other functionalities are shipped with python, and just need to be added to your code to be usable. Other functionalities, written by other people (or yourself) can also be added. To bring in extra functionalities we use the `import` keyword.

## Package for Mr Robinson!

If you have seen snippets of code elsewhere, you might have seen something like this:

```python
import os
```

This line of code tells the python language interpreter to "go and fetch this os package". The `os` package, short for `operating system`, gives your code the power to work with files and folders on your computer (or in Codespace). The `os` package is used less now in modern code, but you will still see it from time to time. Most coders now prefer the `pathlib` package to manage files and folders.

NB: people use the words `package` and `library` interchangeably, but `package` is the more correct term for what you are importing when you write `import os`.

## Home grown goodness!

So let's look at the first line of code in our BigQuery sample above:

```python
from ebmdatalab import bq
```

You will often see `import` on its own or the `from ... import ...` combo. What the combo of `from` and `import` is saying is "from the ebmdatalab package, import the bq module". Remember the terms `package` and `module` from the previous lesson? If not, please have a quick readthrough again.

> By the way, if you have been looking through some of the Bennett Institute websites and GitHub repositories (look at [https://github.com/ebmdatalab](https://github.com/ebmdatalab) for example), you may have seen the term `ebmdatalab` used quite a bit. This actually stands for the `Evidence Based Medicine Datalab`. This was the name of the Bennett Institute before 2021. So the `ebmdatalab` library is a library created by the Bennett Institute! The `bq` module has been created to help users access and analyse the OpenPrescribing BigQuery data.


## Class is in session - again

We’ve already talked about pathlib, as a more modern alternative to `os` for file and folder management. The main class in the `pathlib` module (again, look at lesson 10 for a refresher on these terms) is called `Path`, and we can import it like this:

```python
from pathlib import Path
```

## Wait a minute!

If you have been paying close attention, you would have noticed something interesting: sometimes we import from a `package` (like ebmdatalab) and sometimes we import from a `module` (like pathlib). And what we get can also be different - sometimes we get `another module` (bq) and sometimes it’s a `class` (Path).

So it looks like Python lets you mix and match: different starting points (package or module) and different things at the end (module, class, function, variable).

How can this be? Well, because in Python, the from ... import ... statement is really just saying: 

> Open this box, and grab something inside it.

The “box” can be a folder of code (a package) or a single file (a module), and the “something inside” can be almost anything that’s been defined there. So really we have boxes of boxes in boxes. It just depends on which size box you start with and which boxes are inside of those!

## Constants

So we have this line of code:

```python
DATA_FOLDER = Path("data")
```

Why all the caps? Well, a standard in python (and a lot of other languages) is that when you store a variable that is not going to change, eg a `constant`, you use all capitals.

The next question naturally is, what does this code do. Well, what is happening is that we use the Path class and we create an object using this class. We give it the starting information of "data". In the case of Path, the starting information, aka argument, of "data" tells Path to create an object with associated variables and functions relating to the folder called "data". This object is then stored in the variable `DATA_FOLDER`.

## Let's get some data!

So you know what the SQL code does now, so we will skip over that. Let's talk about this bit of code:

```python
ccg_names = bq.cached_read(sql, DATA_FOLDER / "ccg_names.csv", use_cache=False)
```

So far we have imported the `bq` module by using `from ebmdatalab import bq`. Now what we are saying with `bq.cached_read` is to use the `bq` module (remember this is a file) and the `cached_read()` function (a bunch of code) inside of said module. The `dot operator` here links `bq` and `cached_read`.

The `cached_read()` function takes the arguments of `sql`, `csv_path` and `use_cache`, meaning `SQL code`, `place to store the code (in csv format)` and `shall I use the data we have stored from last time or not` respectively. 

So we give the `cached_read` function the SQL code we have written for the first argument, a `destination to store the downloaded data` and then we say, `please download everytime we run this code and ignore what we have already saved`. We then save the data in memory as well as in a variable called `ccg_names`.

## How much?

The reason that we have the cached option with `cached_read` is that it can take time to run your SQL query and also it does cost a little bit of money to run. Not to scare you to not try and get the data you need, but when you can just use the data you have already downloaded, then please do. If have cached_read set as true and the SQL query has not change between data analysis runs, then only the saved data is used, and BigQuery is not searched again. However, if you change the SQL query, even by just one character, and cached_read is still true, then BigQuery WILL be searched.

If you want to find out an estimate of how much it will cost to run your query, try out the code below:

In [None]:
import sys
sys.path.append("..")   
from utils.bq_costings import how_much

sql = """
    SELECT code, name
    FROM ebmdatalab.hscic.ccgs
    WHERE name IS NOT NULL
    GROUP BY code, name
"""

how_much(sql)

## And can you show us the results

And the final icing on the cake is to just print out the downloaded contents. We do just that by typing out `ccg_names`.