# Filename parsing

## Introduction

piece of code that delves into the intricacies of space weather monitoring using data from the China Seismo-Electromagnetic Satellite (CSES). 

## Project Setup

The project directory, named CSES_files, serves as the repository for our data files. These files, stored in `HDF5` format, contain valuable measurements from various instruments aboard the CSES satellite. The initial step involves importing the necessary libraries:

In [5]:
import os
import geopandas as gpd
import pandas as pd
import numpy as np
from datetime import datetime, timezone
import h5py
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import xarray as xr
import xarray
from shapely import geometry
from glob import glob

Each library plays a crucial role in data manipulation, visualization, and geographic data handling. For instance, `xarray` is used for handling **multi-dimensional arrays** efficiently, while `geopandas` provides tools for **geographic data manipulation**.

## File Paths and Dataset Handling

### define project directory and the names for different datasets

The code begins by defining the project directory and listing the `HDF5` files to be processed:

In [6]:
project_dir = "./CSES_files"

print(f"Percorso cartella di progetto: {project_dir}")

EFD1 = 'CSES_01_EFD_1_L02_A1_213330_20211206_164953_20211206_172707_000.h5'
HEP1 = 'CSES_01_HEP_1_L02_A4_176401_20210407_182209_20210407_190029_000.h5'
HEP4 = 'CSES_01_HEP_4_L02_A4_202091_20210923_184621_20210923_192441_000.h5'
LAP1 = 'CSES_01_LAP_1_L02_A3_174201_20210324_070216_20210324_073942_000.h5'
SCM1 = 'CSES_01_SCM_1_L02_A2_183380_20210523_154551_20210523_162126_000.h5'
HEPD = 'CSES_HEP_DDD_0219741_20220117_214156_20220117_230638_L3_0000267631.h5'


Percorso cartella di progetto: ./CSES_files


The `dataset` function is then defined to open an `xarray` dataset from a given file path:

In [7]:
# Function to open an xarray dataset from a given path
def dataset(path):
    return xarray.open_dataset(path, engine = 'h5netcdf', phony_dims = 'sort')

# Function to list all variable names in a dataset
def variables(data):
    return list(data.keys())

### List of file paths to be processed

In [8]:
file_list = [
    os.path.join(project_dir, EFD1),
    os.path.join(project_dir, HEP1),
    os.path.join(project_dir, HEP4),
    os.path.join(project_dir, LAP1),
    os.path.join(project_dir, SCM1),
    os.path.join(project_dir, HEPD)
]

# Redefine the dataset function to open xarray datasets
def dataset(path):
    return xarray.open_dataset(path, engine = 'h5netcdf', phony_dims = 'sort')

# Redefine the variables function to list all variable names in a dataset
def variables(data):
    return list(data.keys())


## Extracting Metadata

To understand the data better, the code extracts `metadata` such as `start` and `end dates`, and `orbit numbers` from the filenames. This is achieved using the `extract_dates` and `extract_orbit` functions:

### Extract satellite number from a file name

In [9]:
def extract_satellite_number(file_name):
    try:
        parts = file_name.split('_')
        satellite_number = parts[1]
        return satellite_number
    except IndexError:
        print(f"Errore nell'estrazione del numero del satellite per il file {file_name}")
        return None

### Extract instrument code from a file name

In [10]:
def extract_instrument_code(file_name):
    try:
        parts = file_name.split('_')
        instrument_code = parts[2]
        return instrument_code
    except IndexError:
        print(f"Errore nell'estrazione del codice strumento per il file {file_name}")
        return None

### Extract instrument number from a file name

In [11]:
def extract_instrument_number(file_name):
    try:
        parts = file_name.split('_')
        instrument_number = parts[3]
        return instrument_number
    except IndexError:
        print(f"Errore nell'estrazione del numero strumento per il file {file_name}")
        return None

### Extract data level from a file name

In [12]:
def extract_data_level(file_name):
    try:
        parts = file_name.split('_')
        data_level = parts[4]
        return data_level
    except IndexError:
        print(f"Errore nell'estrazione del livello dei dati per il file {file_name}")
        return None

### Extract orbit number from a file name

In [13]:
def extract_orbit(file_name):
    try:
        base_name = os.path.basename(file_name)
        parts = base_name.split('_')
        start_index = None
        for i in range(len(parts)):
            if parts[i].isdigit() and len(parts[i]) == 8: 
                start_index = i
                break
    
        if start_index is None:
            raise ValueError(f"Formato data non trovato nel nome del file: {file_name}")
        
        orbit = parts[start_index - 1]  
        return orbit
    except ValueError as e:
        print(f"Errore nel parsing dell'orbita per il file {file_name}: {e}")
        return None

### Extract start and end dates from a file name

In [14]:

def extract_dates(file_name):
    try:
        base_name = os.path.basename(file_name) #returns the final component of a pathname
        parts = base_name.split('_')
        
        #find the index of the part that contains the start_date
        start_index = None
        for i in range(len(parts)):
            if parts[i].isdigit() and len(parts[i]) == 8:  # find the part with data format YYYYMMDD
                start_index = i
                break
        
        if start_index is None:
            raise ValueError(f"Formato data non trovato nel nome del file: {file_name}")
        
        start_date_str = '_'.join(parts[start_index:start_index + 2]) 
        end_date_str = '_'.join(parts[start_index + 2:start_index + 4])  
        
        start_date = datetime.strptime(start_date_str, '%Y%m%d_%H%M%S')
        end_date = datetime.strptime(end_date_str, '%Y%m%d_%H%M%S')
        
        return start_date, end_date
    except ValueError as e:
        print(f"Errore nel parsing delle date per il file {file_name}: {e}")
        return None, None

### parse_filename function to include the data


In [16]:
def parse_filename(file_name):
    satellite_nr = extract_satellite_number(file_name)
    instrument_code = extract_instrument_code(file_name)
    instrument_nr = extract_instrument_number(file_name)
    data_l = extract_data_level(file_name)
    start_date, end_date = extract_dates(file_name)
    semiorbit_nr = extract_orbit(file_name)
    return {
        'file_name': file_name,
        "satellite_nr": satellite_nr,
        "instrument_code": instrument_code,
        "instrument_nr": instrument_nr,
        "data_l":data_l,
        "semiorbit_nr": semiorbit_nr,
        "start_date": start_date, 
        "end_date" : end_date
    }

## Creating the DataFrame

The parsed metadata is stored in a list of dictionaries, which is then converted into a pandas DataFrame:

In [17]:
data = []

for file in file_list:
    metadata = parse_filename(file)
    if metadata:
        data.append(metadata)
    #metadata["semiorbit_nr"]
    #semiorbits_geo[metadata["semiorbit_nr"]]
    # {
    #     "start_date": ...
    #     "start_date": ...
    #     "start_date": ...
    # }
if data:
    columns = list(data[0].keys())
else:
    columns = []


df = pd.DataFrame(data, columns=columns)

df

Unnamed: 0,file_name,satellite_nr,instrument_code,instrument_nr,data_l,semiorbit_nr,start_date,end_date
0,./CSES_files/CSES_01_EFD_1_L02_A1_213330_20211...,files/CSES,01,EFD,1,213330,2021-12-06 16:49:53,2021-12-06 17:27:07
1,./CSES_files/CSES_01_HEP_1_L02_A4_176401_20210...,files/CSES,01,HEP,1,176401,2021-04-07 18:22:09,2021-04-07 19:00:29
2,./CSES_files/CSES_01_HEP_4_L02_A4_202091_20210...,files/CSES,01,HEP,4,202091,2021-09-23 18:46:21,2021-09-23 19:24:41
3,./CSES_files/CSES_01_LAP_1_L02_A3_174201_20210...,files/CSES,01,LAP,1,174201,2021-03-24 07:02:16,2021-03-24 07:39:42
4,./CSES_files/CSES_01_SCM_1_L02_A2_183380_20210...,files/CSES,01,SCM,1,183380,2021-05-23 15:45:51,2021-05-23 16:21:26
5,./CSES_files/CSES_HEP_DDD_0219741_20220117_214...,files/CSES,HEP,DDD,219741,219741,2022-01-17 21:41:56,2022-01-17 23:06:38


## Geographic Data Handling

The `polygon` function allows us to create a polygon from geographic coordinates and filter out data points that fall outside this polygon:

In [18]:
def polygon(points, data):
    
    ds = dataset(data)

    geo_lat = ds.GEO_LAT
    geo_lon = ds.GEO_LON


    latitudes = [point[1] for point in points]
    longitudes = [point[0] for point in points]

    lat_min = min(latitudes)
    lat_max = max(latitudes)
    lon_min = min(longitudes)
    lon_max = max(longitudes)

    lat_mask = (geo_lat >= lat_min) & (geo_lat <= lat_max)
    lon_mask = (geo_lon >= lon_min) & (geo_lon <= lon_max)

    print(f"Bounding Box - lat_min: {lat_min}, lat_max: {lat_max}, lon_min: {lon_min}, lon_max: {lon_max}")

    final_mask = lat_mask + lon_mask

    filtered_subset = ds.where(final_mask, drop=True)

    if filtered_subset.GEO_LAT.size > 0 and filtered_subset.GEO_LON.size > 0:
        return(filtered_subset)

In [19]:
polygon_points = [(100.0, 30.0), (120.0, 30.0), (120.0, 50.0), (100.0, 50.0)]

# Test the polygon function with a polygon and one of the data files
filtered_points = polygon(polygon_points, file_list[0])

print(filtered_points)

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/home/wvuser/scientific-dashboard/parsing/CSES_files/CSES_01_EFD_1_L02_A1_213330_20211206_164953_20211206_172707_000.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)