### Helpers

In [1]:
import pandas as pd
from functools import partial

In [2]:
def get_data(area_type, release, earliest=None, latest=None, area=None):
    data = pd.read_csv(
        f'https://api.coronavirus.data.gov.uk/v2/data?areaType={area_type}'
        f'&metric=newCasesBySpecimenDate&format=csv&release={release}',
    )
    if earliest:
        data = data[data['date'] >= earliest]
    if latest:
        data = data[data['date'] <= latest]
    if area:
        data = data[data['areaName'] == area]
    return data.groupby('date').sum().sum()

In [3]:
def diff(area_type, new_release, old_release, earliest=None, latest=None, area=None):
    new_data = get_data(area_type, new_release, earliest, latest=latest, area=area)
    old_data = get_data(area_type, old_release, earliest, latest=latest, area=area)
    return new_data-old_data

In [4]:
diff_ = partial(diff, new_release='2022-01-31', old_release='2022-01-30')

In [5]:
def diff__(earliest):
    return (diff_('nation', earliest=earliest) - diff_('region', earliest=earliest)).squeeze()

### Narrative
The difference comes from where the series is truncated. For the `nation` series, the data was truncated at 25th July '2021: 

In [6]:
diff_('nation',  earliest='2021-07-25')

newCasesBySpecimenDate    731988
dtype: int64

Meanwhile, the `region` series went all the way back to the 1st April 2020:

In [7]:
region_diff = diff_('region', earliest='2020-04-01')
region_diff

newCasesBySpecimenDate    814099
dtype: int64

If we consider the same time ranges then there are still differences between the `nation` and `region` *areaType*, but they're still significant:

In [8]:
print(f"whole series: {diff__(earliest=None):,}")
print(f"to 1st April 20: {diff__(earliest='2020-01-04'):,}")
print(f"to 25th July 21: {diff__(earliest='2021-07-25'):,}")

whole series: 37,163
to 1st April 20: 37,163
to 25th July 21: 25,250


However, when I remember that `region` only includes England and filter the `nation` data appropriately, 
it does become less dramatic:

In [9]:
def diff_england(earliest):
    return (diff_('nation', earliest=earliest, area='England') - diff_('region', earliest=earliest)).squeeze()

In [10]:
print(f"whole series: {diff_england(earliest=None):,}")
print(f"to 1st April 20: {diff_england(earliest='2020-01-04'):,}")
print(f"to 25th July 21: {diff_england(earliest='2021-07-25'):,}")

whole series: 8,407
to 1st April 20: 8,407
to 25th July 21: 4,383


### Simplest possible

In [11]:
def simple_get_data(filter, release):
    return pd.read_csv(
        f'https://api.coronavirus.data.gov.uk/v2/data?{filter}'
        f'&metric=newCasesBySpecimenDate&format=csv&release={release}',
        usecols=['newCasesBySpecimenDate']
    ).sum().squeeze()

nation_new = simple_get_data('areaType=nation&areaName=England', '2022-01-31')
region_new = simple_get_data('areaType=region', '2022-01-31')
nation_old = simple_get_data('areaType=nation&areaName=England', '2022-01-30')
region_old = simple_get_data('areaType=region', '2022-01-30')

Interesting that the difference between case counts for the `nation` of *England* and the sum of its `region` level data is still around 100k:

In [12]:
print(f"{nation_new - region_new:,}")
print(f"{nation_old - region_old:,}")

112,002
103,595


### Cases still being removed...

Interesting that the 31st Jan release appears to have removed some cases that had been reported for specimen dates earlier than April 1st: 601 cases at a national level and 326 cases at a regional level!

In [13]:
diff_('nation', latest='2020-04-01')

newCasesBySpecimenDate   -601
dtype: int64

In [14]:
diff_('region', latest='2020-04-01')

newCasesBySpecimenDate   -326
dtype: int64