# Lesson 14 - what does our data look like?

In lesson 13 you printed out a percentage for how well the NHS is prescribing 5 days of 500 mg Amoxicillin. However, this only gave you an average over all of the data available. The results from lesson 13 did not `show` you how things were improving, worsening or even static over time. For this, you need to plot out the data over time. Let's do just that.

First let's import some components you have seen before and create a Path object to the "data" folder. We wont add comments to code that you have seen before!

P.S. If you click on the `Run` menu at the top of JupyterLabs and then on `Run All Cells`, then all of the code below will be run and you will hopefully get a nice plot. Check it out. Afterwards, come back up to the top here and we can walk through the code together.

In [None]:
from pathlib import Path
from ebmdatalab import bq

DATA_FOLDER = Path("data")

In [None]:
denominator_sql = """
    -- get the month and sum of items, with an alias of denominator_items
    SELECT month, SUM(items) AS denominator_items
    -- from the raw_prescribing_normalised table
    FROM `ebmdatalab.hscic.raw_prescribing_normalised`
    --  where the amoxicillin bnf_code is 
    WHERE bnf_code LIKE '0501013B0%AB'
    -- group by month
    GROUP BY month
    -- order by month
    ORDER BY month
"""

denominator = bq.cached_read(denominator_sql, DATA_FOLDER / "amoxicillin_denominator_by_month.csv", use_cache=True)
denominator

In [None]:
numerator_sql = """
    SELECT month, SUM(items) AS numerator_items
    FROM `ebmdatalab.hscic.raw_prescribing_normalised`
    -- Only difference here is that we look for order quantities of 15.
    WHERE bnf_code LIKE '0501013B0%AB' AND quantity_per_item = 15
    GROUP BY month
    ORDER BY month
"""

numerator = bq.cached_read(numerator_sql, DATA_FOLDER / "amoxicillin_numerator_by_month.csv", use_cache=True)
numerator

In [None]:
import pandas as pd

# Here we are tidying up the data

# First we convert dates and times into a standard pandas data and time
denominator["month"] = pd.to_datetime(denominator["month"])

# Then we loop through the columns in the data
for col in denominator.columns:
    #  We look only for the denominator_items column
    if col == "denominator_items":
        # We then convert the items in this column into numbers. If the item cannot be converted, then we store an NaN value.
        # The "coerce" pushes for NaN if there is a conversion error.
        denominator[col] = pd.to_numeric(denominator[col], errors="coerce")

denominator

In [None]:
# We do the same here, as above, with the numerators
numerator["month"] = pd.to_datetime(numerator["month"])
for col in numerator.columns:
    if col == "numerator_items":
        numerator[col] = pd.to_numeric(numerator[col], errors="coerce")

numerator

## Hold your horses!

What is this new keyword `for`?

What is this `NaN` thing?

### for loops

When we doing things in code, like life, we often need to do things over and over: we loop over things. Much like brushing our teeth in the mornings, back and forth, back and forth, there is a repeating loop. Or perhaps mixing your eggs for your morning omelet, mix, mix mix, until your eggs are completely mixed. Well we use the same kind of logic for loops in code. We can say "loop over these things until some state is obtained" or we can say "for these things here, do this other thing". The python `for loop` uses the second idea. To use a for loop, you state what items you want to loop through, and you assign each item to a temporary variable. You then under take some task on said item. So for 

```python
for col in denominator.columns:
```
    
you are saying, for all of the items in `denominator's columns`, store each item as `col`. Then we do something for each item (eg an `if statement` and a `number conversion`).


### NaN

`Not a Number`. That is all it stands for. We use it in computing when we have some value that is not number, or cannot be converted to a number. Simples!

## Moving on

Now we are going to combine the data from the two pandas, denominator and numerator, with `pd.merge`:

In [None]:
df = pd.merge(denominator, numerator, on="month", how="outer")
df

And now we convert any `NaN`s into zeros `0` by using `fillna`:

In [None]:
df[["denominator_items", "numerator_items"]] = df[["denominator_items", "numerator_items"]].fillna(0)
df

In [None]:
def safe_fraction(r):
    if r["denominator_items"] != 0:
        return r["numerator_items"] / r["denominator_items"]
    else:
        return 0

df["fraction"] = df.apply(safe_fraction, axis=1)

df["percentage"] = df["fraction"] * 100

## What did we do there?

So we want to get the fraction of `numerator` divided by `denominator` for each row in the panda dataset. We do this by using hte `df.apply` function, that states, use a function (safe_function in our case) and apply it to each row (axis=1)

*NB: axis=0 would apply our function to the columns instead.*

We have build a function called `safe_fraction` to make sure we are not dividing numbers by zero. This causes a huge error in most programming language. Can you work out how it does this? Hint `!=` means "not equal to" and the "else" keyword states "do this other thing if the `is statement` is false".

## Now plot it!

In [None]:
# Here we import the plotting matplotlib module pyplot and store it under the alias `plt`
import matplotlib.pyplot as plt

#  Here we import PercentFormatter
from matplotlib.ticker import PercentFormatter

# Let's create a plot in the computer's memory
# Let's make this plot 10 inches wide and 5 inches tall
plt.figure(figsize=(10, 5))

# Now we plot month on the x-axis and percentage on the y-axis
# We use the circular marker for our date (via marker="o"
plt.plot(df["month"], df["percentage"], marker="o")

# We need a title
plt.title("Graph of percentage of 5 days Amoxicillin perscriptions vs date")

# We need to label the axes
plt.xlabel("Month")
plt.ylabel("Percentage of 5 days Amoxicillin prescriptions")

# Let's make the plot fit nicely in JupyterLabs
plt.tight_layout()

# Now show the plot!
plt.show()