# Exploring the July dip in SSEN data

We've been working on getting our hands on more smart meter data than just February 2024, so that we can look for seasonal trends and make other analyses over a longer time period. When we did so, we noticed a weird "dip" in consumption around mid July. 

At the time we were only look at a small random sample of feeders, so first we checked if there was a bug in our data pipeline. Maybe we'd missed some files to download? It seems not - the original raw files on [the data portal](https://data.ssen.co.uk/@ssen-distribution/ssen_smart_meter_prod_lv_feeder/r/1cce1fb4-d7f4-4309-b9e3-943bd4d18618) are noticeably smaller for the period roughly between 10/07 -> 25/07.

To get to the bottom of it, we loaded two days into Pandas dataframes and compared them:

In [13]:
import pandas as pd
sixteenth = pd.read_csv("https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-07-16.csv", storage_options={'User-Agent': 'Mozilla/5.0'})
thirtyfirst = pd.read_csv("https://ssen-smart-meter-prod.portaljs.com/LV_FEEDER_USAGE/2024-07-31.csv", storage_options={'User-Agent': 'Mozilla/5.0'})

Take a general look at the data, note that 16/07 is noticeably smaller (394MB vs 539MB)

In [4]:
sixteenth.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3043011 entries, 0 to 3043010
Data columns (total 17 columns):
 #   Column                               Non-Null Count    Dtype  
---  ------                               --------------    -----  
 0   dataset_id                           3043011 non-null  int64  
 1   dno_name                             3043011 non-null  object 
 2   dno_alias                            3043011 non-null  object 
 3   secondary_substation_id              3043011 non-null  int64  
 4   secondary_substation_name            2983468 non-null  object 
 5   lv_feeder_id                         3043011 non-null  int64  
 6   lv_feeder_name                       791008 non-null   object 
 7   substation_geo_location              0 non-null        float64
 8   aggregated_device_count_active       2720582 non-null  float64
 9   primary_consumption_active_import    2720582 non-null  float64
 10  secondary_consumption_active_import  0 non-null        float64
 11

In [5]:
thirtyfirst.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4157612 entries, 0 to 4157611
Data columns (total 17 columns):
 #   Column                               Non-Null Count    Dtype  
---  ------                               --------------    -----  
 0   dataset_id                           4157612 non-null  int64  
 1   dno_name                             4157612 non-null  object 
 2   dno_alias                            4157612 non-null  object 
 3   secondary_substation_id              4157612 non-null  int64  
 4   secondary_substation_name            4075437 non-null  object 
 5   lv_feeder_id                         4157612 non-null  int64  
 6   lv_feeder_name                       1049872 non-null  object 
 7   substation_geo_location              0 non-null        float64
 8   aggregated_device_count_active       4155444 non-null  float64
 9   primary_consumption_active_import    4155444 non-null  float64
 10  secondary_consumption_active_import  0 non-null        float64
 11

Try to work out why that is, are we missing some half-hours of the day?

In [9]:
thirtyfirst["data_collection_log_timestamp"].nunique()

48

In [8]:
sixteenth["data_collection_log_timestamp"].nunique()

48

Doesn't look like it, maybe we are missing values for some feeders?

In [6]:
sixteenth["dataset_id"].nunique()

63420

In [7]:
thirtyfirst["dataset_id"].nunique()

86637

## Conclusion
We're missing values for quite a lot of feeders (23,000) in the middle of July, where have they gone?