This notebook filters the Akebono dataset to reduce incorrect, noisy data that makes model training more difficult.

V2 has the following filters applied:
- Keep only the following columns:
    - 'Altitude', 'GCLAT', 'GCLON', 'ILAT', 'GLAT', 'GMLT', 'XXLAT', 'XXLON', 'Te1', 'Ne1', 'Pv1', 'I1', 'DateFormatted'
- Remove all rows where XXLAT and XXLON = 999
- Remove all rows where altitude = 2.0
- ILAT should be within 0 and 90. We will squash > 90 to 90.
- Date should be after 1990-01-01
    - Early data is taken whilst the instrument was still calibrating.
- Altitude should be between 1000km and 8000km
    - Instrument wasn't built to record data outside this range.


V2A has the following filters applied:
- Keep only the following columns:
    - 'Altitude', 'GCLAT', 'GCLON', 'ILAT', 'GLAT', 'GMLT', 'XXLAT', 'XXLON', 'Te1', 'Ne1', 'Pv1', 'I1', 'DateFormatted'
- Remove all rows where XXLAT and XXLON = 999
- Remove all rows where altitude = 2.0
- Remove all rows where ILAT > 90.
- Date should be after 1990-01-01
    - Early data is taken whilst the instrument was still calibrating.
- Altitude should be between 1000km and 8000km
    - Instrument wasn't built to record data outside this range.

In [1]:
import pandas as pd

In [9]:
# Load the TSV file
df = pd.read_csv('../data/Akebono_combined.tsv', sep='\t')

In [10]:
# Keep only specific columns
columns_to_keep = ['Altitude', 'GCLAT', 'GCLON', 'ILAT', 'GLAT', 'GMLT', 'XXLAT', 'XXLON', 'Te1', 'Ne1', 'Pv1', 'I1', 'DateFormatted']
df = df[columns_to_keep]

# Convert DateFormatted to datetime
df['DateFormatted'] = pd.to_datetime(df['DateFormatted'])

# Remove rows where XXLAT and XXLON are 999
rows_before = len(df)
df = df[(df['XXLAT'] != 999) | (df['XXLON'] != 999)]
print(f"Rows removed where XXLAT and XXLON are 999: {rows_before - len(df)}")
print(f"XXLAT range after filtering: {df['XXLAT'].min()} - {df['XXLAT'].max()}")
print(f"XXLON range after filtering: {df['XXLON'].min()} - {df['XXLON'].max()}")
print()

# Remove rows where altitude is 2.0. Should be 0 as it's coupled with the XXLAT and XXLON filter.
rows_before = len(df)
df = df[df['Altitude'] != 2.0]
print(f"Rows removed where altitude is 2.0: {rows_before - len(df)}")
print(f"Altitude range after filtering: {df['Altitude'].min()} - {df['Altitude'].max()} km")
print()

# Version 2: Squash ILAT values > 90 to 90
ilat_affected = (df['ILAT'] > 90).sum()
df['ILAT'] = df['ILAT'].clip(0, 90)
print(f"ILAT values > 90 squashed to 90: {ilat_affected} rows affected")
print(f"ILAT range after squashing: {df['ILAT'].min()} - {df['ILAT'].max()}")
print()

# # Version 2A: Remove rows where ILAT > 90
# rows_before = len(df)
# df = df[df['ILAT'] <= 90]
# print(f"Rows removed where ILAT > 90: {rows_before - len(df)}")
# print(f"ILAT range after filtering: {df['ILAT'].min()} - {df['ILAT'].max()}")
# print()

# Remove rows where DateFormatted is before 1990-01-01
rows_before = len(df)
df = df[df['DateFormatted'].dt.year >= 1990]
print(f"Rows removed where DateFormatted is before 1990-01-01: {rows_before - len(df)}")
print(f"DateFormatted range after filtering: {df['DateFormatted'].min()} - {df['DateFormatted'].max()}")
print()

# Remove rows where Altitude is less than 1000km or greater than 8000km
altitude_affected = (df['Altitude'] < 1000) | (df['Altitude'] > 8000)
df = df[~altitude_affected]
print(f"Rows removed where Altitude is less than 1000km or greater than 8000km: {altitude_affected.sum()}")
print(f"Altitude range after filtering: {df['Altitude'].min()} - {df['Altitude'].max()} km")
print()

# Print the total number of rows removed
original_rows = len(pd.read_csv('../data/Akebono_combined.tsv', sep='\t'))
print(f"Rows in original dataset: {original_rows}")
print(f"Rows in filtered dataset: {len(df)}")
print(f"Total rows removed: {original_rows - len(df)}")


Rows removed where XXLAT and XXLON are 999: 133659
XXLAT range after filtering: -89.78 - 86.64
XXLON range after filtering: 0.0 - 24.0

Rows removed where altitude is 2.0: 0
Altitude range after filtering: 225.0 - 10475.0 km

Rows removed where ILAT > 90: 22891
ILAT range after filtering: 19.91 - 85.0

Rows removed where DateFormatted is before 1990-01-01: 144109
DateFormatted range after filtering: 1990-01-08 00:00:00 - 2001-12-30 00:00:00

Rows removed where Altitude is less than 1000km or greater than 8000km: 844752
Altitude range after filtering: 1000.0 - 8000.0 km

Rows in original dataset: 4405241
Rows in filtered dataset: 3259830
Total rows removed: 1145411


In [11]:
# Save the filtered dataset to a new TSV file
df.to_csv('../data/Akebono_v2.tsv', sep='\t', index=False)


In [12]:
# Make it into train, test, validation splits
# Import necessary libraries
from sklearn.model_selection import train_test_split

# Calculate the sizes for each split
total_rows = len(df)
test_size = 100000
val_size = 100000
train_size = total_rows - test_size - val_size

# Create the splits
train_df, temp_df = train_test_split(df, test_size=test_size+val_size, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=test_size, random_state=42)

# Print the sizes of each split
print(f"Train set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"Test set size: {len(test_df)}")

# Save each split to a separate TSV file
train_df.to_csv('../data/Akebono_v2_train.tsv', sep='\t', index=False)
val_df.to_csv('../data/Akebono_v2_val.tsv', sep='\t', index=False)
test_df.to_csv('../data/Akebono_v2_test.tsv', sep='\t', index=False)



Train set size: 3059830
Validation set size: 100000
Test set size: 100000
