# Missing GDPS data audit

I found that some files are missing.
Here I compute the list of missing files.

We have two places where we store GDPS files: `data/gdps/` and `thinned_data/gdps` under `DATA_DIR`.

For the parquet files (which are generated from the GDPS files), we store them in `interpolated/2021-12-20-gdps-metar/` under `DATA_DIR`.

In [None]:
import os
import pathlib
import pandas as pd
import tqdm.notebook as tqdm
import plotly.express as px

## Audit after data transfer, 2022-03-30

We have 3 sources of data:

- The data in `data`
- The data in `thinned_data`
- The incoming data that was recently transferred

### Data

In [None]:
DATA_DIR = pathlib.Path(os.getenv('DATA_DIR'))
not_thinned_path = DATA_DIR / 'data/gdps/'

In [None]:
not_thinned_dirs = sorted(list(not_thinned_path.iterdir()))

In [None]:
not_thinned_labels = set([p.stem for p in not_thinned_dirs])

In [None]:
rows = []
for p in tqdm.tqdm(not_thinned_dirs):
    rows.append({
        'path': p,
        'count': len(list(p.iterdir())),
    })
    
not_thinned_file_counts = pd.DataFrame(rows)

In [None]:
not_thinned_file_counts

### Thinned Data

In [None]:
thinned_path = pathlib.Path(DATA_DIR / 'thinned_data/gdps')

In [None]:
thinned_runs = sorted(list(thinned_path.iterdir()))

In [None]:
rows = []
for p in tqdm.tqdm(thinned_runs):
    rows.append({
        'path': p,
        'count': len(list(p.iterdir())),
    })
    
file_counts = pd.DataFrame(rows)

In [None]:
file_counts['count'].value_counts()

In [None]:
file_counts[file_counts['count'] < 81]

In [None]:
thinned_labels = set([p.stem for p in thinned_runs])

### Incoming data

In [None]:
incoming_path = DATA_DIR / 'incoming/'

In [None]:
incoming_files = sorted(list(incoming_path.iterdir()))


In [None]:
incoming_labels = set([f.stem[:10] for f in incoming_files])

In [None]:
rows = []
for f in incoming_files:
    rows.append({
        'file': str(f),
        'size': f.stat().st_size
    })
    
incoming_files_df = pd.DataFrame(rows)

In [None]:
incoming_files_df

In [None]:
px.histogram(data_frame=incoming_files_df, x='size', marginal="rug", hover_name='file')

Cet histogramme me semble conforme. Les plus vieilles dates ont des tailles plus petites à cause du dégraissage.

## Compare against desired dataset

In [None]:
should_be_available = pd.date_range('2019-01-01', '2022-01-01', freq='12H')

In [None]:
expected_labels = set([f'{x.year:02}{x.month:02}{x.day:02}{x.hour:02}' for x in should_be_available])

In [None]:
((expected_labels - incoming_labels) - thinned_labels) - not_thinned_labels

In summary:

- 2021013000 to 2021013112 are missing and were demanded previously.
- 2021071100, 2021102012 and 2021120812 were supposed to have been acquired through Sarracenia but are only partly available missing.

In [None]:
incoming_labels & thinned_labels

In [None]:
thinned_labels & not_thinned_labels

## Check for missing files

In [None]:
file_counts['label'] = file_counts['path'].astype(str).str[-10:]

In [None]:
file_counts[file_counts['count'] < 81]

In [None]:
not_thinned_file_counts['label'] = not_thinned_file_counts['path'].astype(str).str[-10:]

In [None]:
not_thinned_file_counts[not_thinned_file_counts['label'] == '2021102012']

In [None]:
incoming_files_df['label'] = incoming_files_df['file'].astype(str).str[29:39]

In [None]:
incoming_files_df[incoming_files_df['label'] == '2021071100']