This notebook filters the Akebono dataset to reduce incorrect, noisy data that makes model training more difficult.

The following filters are considered:
- Keep only the following columns:
    - 'Altitude', 'GCLAT', 'GCLON', 'ILAT', 'GLAT', 'GMLT', 'XXLAT', 'XXLON', 'Te1', 'Ne1', 'Pv1', 'I1', 'DateFormatted'
- Remove all rows where XXLAT and XXLON = 999
- Remove all rows where altitude = 2.0
- ILAT should be within 0 and 90. We will squash > 90 to 90.
- No date filtering
    - Reason: The data is valid for those dates and there's no reason to exclude it.
    - Note: This is different to the methodology that PI used, PI did not explain rationale so we can update this decision if a reason is provided.
- No altitude filtering (e.g. 1000km - 8000km)
    - Reason: The data is valid for those altitudes and there's no reason to exclude it.
    - Note: This is different to the methodology that PI used, PI did not explain rationale so we can update this decision if a reason is provided.
- TODO: Add solar index data

Questions:
- Why is XXLON not between 0 and 360?

In [1]:
import pandas as pd

In [13]:
# Load the TSV file
df = pd.read_csv('../data/Akebono_combined.tsv', sep='\t')

In [14]:
# Keep only specific columns
columns_to_keep = ['Altitude', 'GCLAT', 'GCLON', 'ILAT', 'GLAT', 'GMLT', 'XXLAT', 'XXLON', 'Te1', 'Ne1', 'Pv1', 'I1', 'DateFormatted']
df = df[columns_to_keep]

# Remove rows where XXLAT and XXLON are 999
rows_before = len(df)
df = df[(df['XXLAT'] != 999) | (df['XXLON'] != 999)]
print(f"Rows removed where XXLAT and XXLON are 999: {rows_before - len(df)}")
print(f"XXLAT range after filtering: {df['XXLAT'].min()} - {df['XXLAT'].max()}")
print(f"XXLON range after filtering: {df['XXLON'].min()} - {df['XXLON'].max()}")
print()

# Remove rows where altitude is 2.0. Should be 0 as it's coupled with the XXLAT and XXLON filter.
rows_before = len(df)
df = df[df['Altitude'] != 2.0]
print(f"Rows removed where altitude is 2.0: {rows_before - len(df)}")
print(f"Altitude range after filtering: {df['Altitude'].min()} - {df['Altitude'].max()} km")
print()
# Squash ILAT values > 90 to 90
ilat_affected = (df['ILAT'] > 90).sum()
df['ILAT'] = df['ILAT'].clip(0, 90)
print(f"ILAT values > 90 squashed to 90: {ilat_affected} rows affected")
print(f"ILAT range after squashing: {df['ILAT'].min()} - {df['ILAT'].max()}")
print()
# Print the total number of rows removed
original_rows = len(pd.read_csv('../data/Akebono_combined.tsv', sep='\t'))
print(f"Rows in original dataset: {original_rows}")
print(f"Rows in filtered dataset: {len(df)}")
print(f"Total rows removed: {original_rows - len(df)}")


Rows removed where XXLAT and XXLON are 999: 133659
XXLAT range after filtering: -89.78 - 86.64
XXLON range after filtering: 0.0 - 24.0

Rows removed where altitude is 2.0: 0
Altitude range after filtering: 225.0 - 10475.0 km

ILAT values > 90 squashed to 90: 22891 rows affected
ILAT range after squashing: 19.91 - 90.0

Rows in original dataset: 4405241
Rows in filtered dataset: 4271582
Total rows removed: 133659


In [15]:
# Save the filtered dataset to a new TSV file
df.to_csv('../data/Akebono_v2.tsv', sep='\t', index=False)


In [16]:
# Make it into train, test, validation splits
# Import necessary libraries
from sklearn.model_selection import train_test_split

# Calculate the sizes for each split
total_rows = 4271582
test_size = 100000
val_size = 100000
train_size = total_rows - test_size - val_size

# Create the splits
train_df, temp_df = train_test_split(df, test_size=test_size+val_size, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=test_size, random_state=42)

# Print the sizes of each split
print(f"Train set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"Test set size: {len(test_df)}")

# Save each split to a separate TSV file
train_df.to_csv('../data/Akebono_v2_train.tsv', sep='\t', index=False)
val_df.to_csv('../data/Akebono_v2_val.tsv', sep='\t', index=False)
test_df.to_csv('../data/Akebono_v2_test.tsv', sep='\t', index=False)



Train set size: 4071582
Validation set size: 100000
Test set size: 100000
