# Feature Engineering

Convert data into more ML friendly formats.  Reversible so the model output later can be reverted back to TLE style format.

This conversion needs to be performed on all datasets.

Features:

| Column        | Desc  | Effect on SGP4 |
| :------------- | :------| :----|
| `NORAD_CAT_ID` | Satellite identifier, not used in training, no action needed |
| `OBJECT_TYPE` | Satellite meta data, not used in training, no action needed (only in `full` version) |
| `TLE_LINE1` | Actual TLE line 1, not used in training, no action needed (only in `full` version) |
| `TLE_LINE2` | Actual TLE line 2, not used in training, no action needed (only in `full` version) |
| `MEAN_MOTION_DOT` | Some sort of scaling may be needed | NOT used in SGP4 propagation |
| `MEAN_MOTION_DDOT` | Some sort of scaling may be needed | NOT used in SGP4 propagation |
| `BSTAR` | Some sort of scaling may be needed | Affects `v` component based on `r` (assuming higher bstar = higher drag = more decay) |
| `INCLINATION` | Convert cyclic 0 .. 180 | Defines path of possible `r` values (0 = equator, 90 = polar orbit) |
| `RA_OF_ASC_NODE` | Convert cyclic 0 .. 360 | Defines path of possible `r` values (kind of like rotating the orbit viewed from the poles?) |
| `ECCENTRICITY` | Some scaling needed, 0 .. 0.25 | Defines path of possible `r` values (0 = circular orbit) |
| `ARG_OF_PERICENTER` | Convert cyclic 0 .. 360 | Defines path of possible `r` values (0 means closest when crossing north-south reference plane) |
| `MEAN_ANOMALY` | Convert cyclic 0 .. 360, this loops multiple times per day and most cycles are unobserved in the data | Defines which `r` position is used |
| `MEAN_MOTION` | > 11.25 | Defines path of possible `r` values (smaller = longer orbit) |
| `REV_AT_EPOCH` | 0-99999, but sometimes inconcsistency in data where there is an offset to this from different ground stations (a guess) | NOT used in SGP4 propagation |
| `EPOCH` | Time, while no scaling is needed, we will need to use this for constructing `X` and `y` | Time and time offset used for propagation |
| `GP_ID` | Unique identifier for the TLE entry, not used in training, no action needed |

While `MEAN_ANOMALY` is represeted in degrees, because a lot of cycles are left out due to how sparse the data is, a combination of `REV_AT_EPOCH` + `MEAN_ANOMALY` may be a better representation of the features rather than using sin/cos representation.  Other conversion can be done without grouping, but due to `REV_AT_EPOCH` rolling over at 100k and inconsistency between ground stations, we might need to handle it per satellite.



Datasets:

```
2_min/train.pkl
0_min/test.pkl
0_min/secret_test.pkl
```

Converting `min` versions only for now to save some memory and disk space.  Can be replaced with `full` if needed.

In [1]:
import pandas as pd
import numpy as np
import os

from tqdm.notebook import tqdm
tqdm.pandas()

import matplotlib.pyplot as plt

import sys
sys.path.append('../models/model_0')

from clean_data import __jday_convert

In [2]:
version = "min" # or "min" or "full" data

In [3]:
input_files = [
    (2, "train.pkl"),
    (0, "test.pkl"),
    (0, "secret_test.pkl")
]

for n,f in input_files:
    print(f"{os.environ['GP_HIST_PATH']}/../{n}_{version}/{f}")

train_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../2_{version}/train.pkl")
# test_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../0_{version}/test.pkl")
# secret_test_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../0_{version}/secret_test.pkl")

/mistorage/mads/data/gp_history/../2_min/train.pkl
/mistorage/mads/data/gp_history/../0_min/test.pkl
/mistorage/mads/data/gp_history/../0_min/secret_test.pkl


In [4]:
train_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../2_{version}/train.pkl")

In [5]:
def convert_feature_values(df):
    name = df.name
    df = df.set_index("EPOCH").sort_index()
    # convert ARG_OF_PERICENTER, RA_OF_ASC_NODE, and MEAN_ANOMALY to non-cyclic version
    df["ARG_OF_PERICENTER_ADJUSTED"] = np.cumsum(np.around(df.ARG_OF_PERICENTER.diff().fillna(0) / -360))*360 + df.ARG_OF_PERICENTER
    df["RA_OF_ASC_NODE_ADJUSTED"] = np.cumsum(np.around(df.RA_OF_ASC_NODE.diff().fillna(0) / -360))*360 + df.RA_OF_ASC_NODE
    
    # this is because for REV_AT_EPOCH = 100,000, it's recorded as 10,000 instead of 0
    # this doesn't handle the case for multiple ground stations reporting though, if the previous is different....
    # would it be better to just remove this as an outlier just to be safe?
    # 90k +- 20 max offset based on MEAN_MOTION maximum from earlier steps
    df.loc[(df.REV_AT_EPOCH==10000) & df.REV_AT_EPOCH.diff().between(-90020,-89980),'REV_AT_EPOCH'] = 0

    # combine REV_AT_EPOCH and MEAN_ANOMALY for a non-cyclic representation
    adjusted_rev = df.REV_AT_EPOCH + np.cumsum(np.around(df.REV_AT_EPOCH.diff().fillna(0) / -100000)) * 100000
    df["REV_MEAN_ANOMALY_COMBINED"] = adjusted_rev * 360 + df.MEAN_ANOMALY
    
    # this is to handle the REV_AT_EPOCH problem inconsistency problem
    # otherwise the REV_MEAN_ANOMALY_COMBINED difference may be incorrect
    # bfill because we may start at non-zero due to previous data removal bit
    a = np.round((adjusted_rev.diff().fillna(method='bfill')/2000))
    df["SUBGROUP"] = np.cumsum(a).astype(int)
    return df

# Leaving here for reference only, not actually used anymore
def revert_feature_values(df):
    df['REV_AT_EPOCH'] = ((df.REV_MEAN_ANOMALY_COMBINED // 360) % 100000).astype(int)
    df['MEAN_ANOMALY'] = df.REV_MEAN_ANOMALY_COMBINED % 360
    df['RA_OF_ASC_NODE'] = df.RA_OF_ASC_NODE_ADJUSTED % 360
    df['ARG_OF_PERICENTER'] = df.ARG_OF_PERICENTER_ADJUSTED % 360
    return df

In [6]:
converted_df = train_df.groupby(by="NORAD_CAT_ID", as_index=False).progress_apply(convert_feature_values).reset_index()

  0%|          | 0/12298 [00:00<?, ?it/s]

In [8]:
def generate_X_y(df):
    idx = df.name
    
    df = df[['NORAD_CAT_ID', 'EPOCH', 'BSTAR', 'INCLINATION', 'RA_OF_ASC_NODE', 'ECCENTRICITY', 'ARG_OF_PERICENTER',
             'MEAN_ANOMALY', 'MEAN_MOTION', 'REV_AT_EPOCH',
             'ARG_OF_PERICENTER_ADJUSTED', 'RA_OF_ASC_NODE_ADJUSTED', 'REV_MEAN_ANOMALY_COMBINED',
             'GP_ID']]
    df = df.drop_duplicates(subset=['EPOCH']).sort_values("EPOCH")
    dfs = []
    for i in range(1,11):
        dfi = pd.concat([df,df.shift(-i).add_suffix("_b")], axis=1).dropna()
        dfs.append(dfi)
    ddf = pd.concat(dfs)

    # Reference variables only, DO NOT USE TO TRAIN
    df = ddf[['NORAD_CAT_ID','GP_ID','GP_ID_b','EPOCH','EPOCH_b']]
    df.columns = ['__NORAD_CAT_ID','__GP_ID_1','__GP_ID_2','__EPOCH_1','__EPOCH_2']
    df['__GP_ID_2'] = df['__GP_ID_2'].astype(int)
    
    # X
    x_cols = ['BSTAR', 'INCLINATION', 'RA_OF_ASC_NODE', 'ECCENTRICITY', 'ARG_OF_PERICENTER',
              'MEAN_ANOMALY', 'MEAN_MOTION', 'REV_AT_EPOCH']
    df[['X_EPOCH_JD', 'X_EPOCH_FR']] = ddf.EPOCH.apply(__jday_convert).to_list()
    df[['X_'+x for x in x_cols]] = ddf[x_cols]
    df['X_delta_EPOCH'] = (ddf.EPOCH_b - ddf.EPOCH).astype(int) / 86400000000000 # in days
    # y
    df['y_delta_INCLINATION'] = ddf.INCLINATION_b - ddf.INCLINATION
    df['y_delta_ECCENTRICITY'] = ddf.ECCENTRICITY_b - ddf.ECCENTRICITY
    df['y_delta_MEAN_MOTION'] = ddf.MEAN_MOTION_b - ddf.MEAN_MOTION
    df['y_delta_ARG_OF_PERICENTER'] = ddf.ARG_OF_PERICENTER_ADJUSTED_b - ddf.ARG_OF_PERICENTER_ADJUSTED
    df['y_delta_RA_OF_ASC_NODE'] = ddf.RA_OF_ASC_NODE_ADJUSTED_b - ddf.RA_OF_ASC_NODE_ADJUSTED
    df['y_delta_REV_MEAN_ANOMALY_COMBINED'] = ddf.REV_MEAN_ANOMALY_COMBINED_b - ddf.REV_MEAN_ANOMALY_COMBINED
    
    # not sure if this day limiting thing makes sense....
    df = df[(df['X_delta_EPOCH'] < 5) & (df['X_delta_EPOCH'] > 0.25)]
    return df

In [15]:
# %%time

# sample_df = converted_df[converted_df.INCLINATION.between(65,67)]
# random_ids = np.append(np.random.choice(sample_df.NORAD_CAT_ID.unique(), 16), [10615, 10417, 36954, 21723])
# sample_df = sample_df[sample_df.NORAD_CAT_ID.isin(random_ids)]
# processed_sample_df = sample_df.groupby(["NORAD_CAT_ID","SUBGROUP"], as_index=False).progress_apply(generate_X_y).reset_index(drop=True)

# # save sample
# processed_sample_df.to_pickle(f"{os.environ['GP_HIST_PATH']}/../3_min/sample_train.pkl")

  0%|          | 0/37 [00:00<?, ?it/s]

CPU times: user 9.16 s, sys: 422 ms, total: 9.58 s
Wall time: 9.97 s


In [16]:
processed_df = converted_df.groupby(["NORAD_CAT_ID","SUBGROUP"], as_index=False).progress_apply(generate_X_y).reset_index(drop=True)

  0%|          | 0/40240 [00:00<?, ?it/s]

In [17]:
processed_df

Unnamed: 0,__NORAD_CAT_ID,__GP_ID_1,__GP_ID_2,__EPOCH_1,__EPOCH_2,X_EPOCH_JD,X_EPOCH_FR,X_BSTAR,X_INCLINATION,X_RA_OF_ASC_NODE,...,X_MEAN_ANOMALY,X_MEAN_MOTION,X_REV_AT_EPOCH,X_delta_EPOCH,y_delta_INCLINATION,y_delta_ECCENTRICITY,y_delta_MEAN_MOTION,y_delta_ARG_OF_PERICENTER,y_delta_RA_OF_ASC_NODE,y_delta_REV_MEAN_ANOMALY_COMBINED
0,51,44157681,44157682,1990-01-01 20:45:14.021568,1990-01-02 22:21:09.744192,2447892.5,0.864746,0.000000,47.2306,195.6283,...,147.5683,12.179665,30802,1.066617,0.0003,-2.700000e-06,2.570000e-06,3.0466,-3.2978,4676.9018
1,51,44157682,44157683,1990-01-02 22:21:09.744192,1990-01-03 21:58:56.583552,2447893.5,0.931363,0.000000,47.2309,192.3305,...,144.4701,12.179668,30815,0.984570,0.0000,4.900000e-06,1.420000e-06,2.8523,-3.0445,4317.1000
2,51,44157683,44157684,1990-01-03 21:58:56.583552,1990-01-04 21:36:43.503264,2447894.5,0.915933,0.000000,47.2309,189.2860,...,141.5701,12.179669,30827,0.984571,0.0000,5.000000e-07,6.800000e-07,2.9042,-3.0439,4317.0492
3,51,44157684,44157685,1990-01-04 21:36:43.503264,1990-01-07 22:28:13.305503,2447895.5,0.900504,0.000000,47.2309,186.2421,...,138.6193,12.179670,30839,3.035762,0.0002,-3.260000e-05,5.040000e-06,9.1717,-9.3911,13310.7026
4,51,44157685,44157686,1990-01-07 22:28:13.305503,1990-01-09 19:45:38.411999,2447898.5,0.936265,0.000000,47.2311,176.8510,...,129.3219,12.179675,30876,1.887096,0.0004,-1.400000e-06,2.820000e-06,5.5625,-5.8331,8274.3654
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271995012,47853,175617569,175826201,2021-04-16 20:28:14.372832,2021-04-20 12:42:44.088480,2459320.5,0.852944,0.000036,51.6436,286.0303,...,145.1726,15.492509,571,3.676733,-0.0006,-9.700000e-06,2.789300e-04,16.3577,-18.1942,20503.6372
271995013,47853,175648770,175854603,2021-04-16 23:34:00.778944,2021-04-20 20:27:09.611712,2459320.5,0.981953,0.000036,51.6436,285.3920,...,144.6344,15.492516,573,3.870241,-0.0006,-1.150000e-05,2.939300e-04,17.2533,-19.1517,21582.7416
271995014,47853,175676928,175887975,2021-04-17 21:14:25.340064,2021-04-21 11:56:00.521664,2459321.5,0.885016,0.000048,51.6436,280.9231,...,140.7709,15.492588,587,3.612213,-0.0004,-1.040000e-05,2.566500e-04,16.3184,-17.8750,20143.6772
271995015,47853,175681852,175892372,2021-04-17 22:47:18.516768,2021-04-21 16:34:39.772704,2459321.5,0.949520,0.000048,51.6436,280.6039,...,140.4380,15.492591,588,3.741218,-0.0004,-1.070000e-05,2.646300e-04,17.0130,-18.5133,20862.9825


In [18]:
%%time
processed_df.to_pickle(f"{os.environ['GP_HIST_PATH']}/../3_min/train.pkl")

CPU times: user 3.19 s, sys: 33.4 s, total: 36.6 s
Wall time: 13min 10s
