# Feature Engineering

Convert data into more ML friendly formats.  Reversible so the model output later can be reverted back to TLE style format.

This conversion needs to be performed on all datasets.

Features:

| Column        | Desc  | Effect on SGP4 |
| :------------- | :------| :----|
| `NORAD_CAT_ID` | Satellite identifier, not used in training, no action needed |
| `OBJECT_TYPE` | Satellite meta data, not used in training, no action needed (only in `full` version) |
| `TLE_LINE1` | Actual TLE line 1, not used in training, no action needed (only in `full` version) |
| `TLE_LINE2` | Actual TLE line 2, not used in training, no action needed (only in `full` version) |
| `MEAN_MOTION_DOT` | Some sort of scaling may be needed | NOT used in SGP4 propagation |
| `MEAN_MOTION_DDOT` | Some sort of scaling may be needed | NOT used in SGP4 propagation |
| `BSTAR` | Some sort of scaling may be needed | Affects `v` component based on `r` (assuming higher bstar = higher drag = more decay) |
| `INCLINATION` | Convert cyclic 0 .. 180 | Defines path of possible `r` values (0 = equator, 90 = polar orbit) |
| `RA_OF_ASC_NODE` | Convert cyclic 0 .. 360 | Defines path of possible `r` values (kind of like rotating the orbit viewed from the poles?) |
| `ECCENTRICITY` | Some scaling needed, 0 .. 0.25 | Defines path of possible `r` values (0 = circular orbit) |
| `ARG_OF_PERICENTER` | Convert cyclic 0 .. 360 | Defines path of possible `r` values (0 means closest when crossing north-south reference plane) |
| `MEAN_ANOMALY` | Convert cyclic 0 .. 360, this loops multiple times per day and most cycles are unobserved in the data | Defines which `r` position is used |
| `MEAN_MOTION` | > 11.25 | Defines path of possible `r` values (smaller = longer orbit) |
| `REV_AT_EPOCH` | 0-99999, but sometimes inconcsistency in data where there is an offset to this from different ground stations (a guess) | NOT used in SGP4 propagation |
| `EPOCH` | Time, while no scaling is needed, we will need to use this for constructing `X` and `y` | Time and time offset used for propagation |
| `GP_ID` | Unique identifier for the TLE entry, not used in training, no action needed |

While `MEAN_ANOMALY` is represeted in degrees, because a lot of cycles are left out due to how sparse the data is, a combination of `REV_AT_EPOCH` + `MEAN_ANOMALY` may be a better representation of the features rather than using sin/cos representation.  Other conversion can be done without grouping, but due to `REV_AT_EPOCH` rolling over at 100k and inconsistency between ground stations, we might need to handle it per satellite.



Datasets:

```
2_min/train.pkl
0_min/test.pkl
0_min/secret_test.pkl
```

Converting `min` versions only for now to save some memory and disk space.  Can be replaced with `full` if needed.

In [1]:
import pandas as pd
import numpy as np
import os

from tqdm.notebook import tqdm
tqdm.pandas()

import matplotlib.pyplot as plt


In [2]:
version = "min" # or "min" or "full" data

In [3]:
input_files = [
    (2, "train.pkl"),
    (0, "test.pkl"),
    (0, "secret_test.pkl")
]

for n,f in input_files:
    print(f"{os.environ['GP_HIST_PATH']}/../{n}_{version}/{f}")

train_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../2_{version}/train.pkl")
# test_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../0_{version}/test.pkl")
# secret_test_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../0_{version}/secret_test.pkl")

/mistorage/mads/data/gp_history/../2_min/train.pkl
/mistorage/mads/data/gp_history/../0_min/test.pkl
/mistorage/mads/data/gp_history/../0_min/secret_test.pkl


In [12]:
train_df = pd.read_pickle(f"{os.environ['GP_HIST_PATH']}/../2_{version}/train.pkl")

In [13]:
def convert_feature_values(df):
    df = df.set_index("EPOCH").sort_index()
    # convert ARG_OF_PERICENTER, RA_OF_ASC_NODE, and MEAN_ANOMALY to non-cyclic version
    df["ARG_OF_PERICENTER_ADJUSTED"] = np.cumsum(np.around(df.ARG_OF_PERICENTER.diff().fillna(0) / -360))*360 + df.ARG_OF_PERICENTER
    df["RA_OF_ASC_NODE_ADJUSTED"] = np.cumsum(np.around(df.RA_OF_ASC_NODE.diff().fillna(0) / -360))*360 + df.RA_OF_ASC_NODE
    df["MEAN_ANOMALY_ADJUSTED"] = df.MEAN_ANOMALY + df.REV_AT_EPOCH*360
    
    remove_cols = [
        'ARG_OF_PERICENTER',
        'RA_OF_ASC_NODE',
        'MEAN_ANOMALY',
        'REV_AT_EPOCH',
    ]
    return df[filter(lambda v: v not in remove_cols, df.columns)]

# FIXME: values seems to be correct but COLUMNS ARE OUT OF ORDER
def revert_feature_values(df):
    
    df['REV_AT_EPOCH'] = (df.MEAN_ANOMALY_ADJUSTED // 360).astype(int)
    df['MEAN_ANOMALY'] = df.MEAN_ANOMALY_ADJUSTED % 360
    df['RA_OF_ASC_NODE'] = df.RA_OF_ASC_NODE_ADJUSTED % 360
    df['ARG_OF_PERICENTER'] = df.ARG_OF_PERICENTER_ADJUSTED % 360
    
    remove_cols = [
        'ARG_OF_PERICENTER_ADJUSTED',
        'RA_OF_ASC_NODE_ADJUSTED',
        'MEAN_ANOMALY_ADJUSTED',
    ]
    return df[filter(lambda v: v not in remove_cols, df.columns)].copy()
# converted = convert_feature_values(sample_df)
# display(converted)
# reverted = revert_feature_values(converted)
# display(reverted)


In [14]:
converted_df = train_df.groupby(by="NORAD_CAT_ID", as_index=False).progress_apply(convert_feature_values)

  0%|          | 0/12298 [00:00<?, ?it/s]

In [15]:
train_df

Unnamed: 0,NORAD_CAT_ID,MEAN_MOTION_DOT,MEAN_MOTION_DDOT,BSTAR,INCLINATION,RA_OF_ASC_NODE,ECCENTRICITY,ARG_OF_PERICENTER,MEAN_ANOMALY,MEAN_MOTION,REV_AT_EPOCH,EPOCH,GP_ID
0,18549,1.801000e-05,0.0,0.002592,62.2415,180.1561,0.070489,265.6761,86.2771,12.852684,58561,2004-04-27 14:18:48.216960,2
1,18727,-2.000000e-08,0.0,0.000100,73.3600,345.6887,0.008815,270.3999,88.6911,12.642166,75486,2004-04-27 15:59:40.727904,3
2,19027,1.280000e-05,0.0,0.001076,83.0239,250.9465,0.008493,184.3222,175.7249,13.856401,95359,2004-04-27 19:45:13.686048,5
3,19128,1.320000e-06,0.0,0.000166,70.9841,207.4830,0.020756,161.3777,199.5075,13.715209,79821,2004-04-27 15:43:11.393472,6
4,19242,2.280000e-06,0.0,0.000739,90.1460,192.1834,0.002746,300.4617,59.3655,12.992417,47996,2004-04-27 03:43:04.015775,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...
55239834,47040,5.020000e-06,0.0,0.000352,100.3388,129.4893,0.006708,69.8194,291.0167,14.029620,65304,2021-04-21 18:21:15.174144,175915050
55239835,47056,4.883000e-05,0.0,0.000726,74.0327,243.8364,0.003218,190.1928,182.7247,14.748841,48496,2021-04-21 15:15:24.708096,175915052
55239836,47107,6.130000e-06,0.0,0.002987,66.0638,230.2248,0.006675,67.9694,292.8394,12.786370,6566,2021-04-21 16:39:29.953152,175915058
55239837,47199,4.112000e-05,0.0,0.000541,97.8943,161.7037,0.000385,65.8106,294.3508,14.806983,2321,2021-04-21 13:56:56.002272,175915068


In [16]:
converted_df

Unnamed: 0_level_0,Unnamed: 1_level_0,NORAD_CAT_ID,MEAN_MOTION_DOT,MEAN_MOTION_DDOT,BSTAR,INCLINATION,ECCENTRICITY,MEAN_MOTION,GP_ID,ARG_OF_PERICENTER_ADJUSTED,RA_OF_ASC_NODE_ADJUSTED,MEAN_ANOMALY_ADJUSTED
Unnamed: 0_level_1,EPOCH,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1990-01-01 20:45:14.021568,51,-2.300000e-07,0.0,0.000000,47.2306,0.010245,12.179665,44157681,211.8889,195.6283,1.108887e+07
0,1990-01-02 22:21:09.744192,51,-2.400000e-07,0.0,0.000000,47.2309,0.010242,12.179668,44157682,214.9355,192.3305,1.109354e+07
0,1990-01-03 21:58:56.583552,51,-2.400000e-07,0.0,0.000000,47.2309,0.010247,12.179669,44157683,217.7878,189.2860,1.109786e+07
0,1990-01-04 21:36:43.503264,51,-2.300000e-07,0.0,0.000000,47.2309,0.010247,12.179670,44157684,220.6920,186.2421,1.110218e+07
0,1990-01-07 22:28:13.305503,51,-2.400000e-07,0.0,0.000000,47.2311,0.010215,12.179675,44157685,229.8637,176.8510,1.111549e+07
...,...,...,...,...,...,...,...,...,...,...,...,...
12297,2021-04-20 12:42:44.088480,47853,2.701000e-05,0.0,0.000057,51.6430,0.000240,15.492788,175826201,231.2676,-92.1639,2.262088e+05
12297,2021-04-20 20:27:09.611712,47853,2.757000e-05,0.0,0.000058,51.6430,0.000238,15.492810,175854603,232.7012,-93.7597,2.280074e+05
12297,2021-04-21 11:56:00.521664,47853,2.493000e-05,0.0,0.000053,51.6432,0.000237,15.492845,175887975,235.6284,-96.9519,2.316044e+05
12297,2021-04-21 16:34:39.772704,47853,2.458000e-05,0.0,0.000052,51.6432,0.000236,15.492856,175892372,236.6558,-97.9094,2.326834e+05
