# Feature Backfill Flight Data
* Get Arrival Data for Arlanda for each day since the specified backfill date

./trino --server https://trino.opensky-network.org \
--user fredrikschultz \
--external-authentication \
--catalog minio --schema osky

Kolla den här länken för att sätta upp trino som behövs för att få historisk data. Sedan kan vi använda opensky api för att få daily data. Om man kollar på länken behöver man bli godkänd för att få tillgång till trino, man behöver skicka ett formulär till dom. 

https://opensky-network.org/data/trino

In [1]:
import pandas as pd
import hopsworks 

In [2]:
csv_file="data/arlanda_flights_2020_2025.csv"
df = pd.read_csv(csv_file)

df['flight_date'] = pd.to_datetime(
    df['flight_date'],
    format='%Y-%m-%d %H:%M:%S.%f %Z',
    utc = True 
)

df['flight_date'] = (
    df['flight_date']
      .dt.tz_convert('UTC')
      .dt.normalize()
      .dt.tz_localize(None)   # <-- remove timezone entirely
)

df['city'] = "Märsta"

df.head()

Unnamed: 0,flight_date,total_landings,city
0,2019-12-31,178,Märsta
1,2020-01-01,239,Märsta
2,2020-01-02,234,Märsta
3,2020-01-03,166,Märsta
4,2020-01-04,241,Märsta


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2186 entries, 0 to 2185
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   flight_date     2186 non-null   datetime64[ns]
 1   total_landings  2186 non-null   int64         
 2   city            2186 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 51.4+ KB


## Data Cleaning 

In [4]:
df = df.rename(columns={'flight_date': 'date'})
df['date'] = pd.to_datetime(df['date'], errors='coerce')


df['total_landings'] = df['total_landings'].astype('float32')

In [5]:
full = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
missing_days = full.difference(df['date'])

print("Missing days:", len(missing_days))
missing_days[:10]   # show first few

Missing days: 3


DatetimeIndex(['2023-12-02', '2023-12-03', '2023-12-04'], dtype='datetime64[ns]', freq=None)

## Create the Feature Group in Hopsworks 

In [6]:
project = hopsworks.login()
fs = project.get_feature_store()

2026-01-05 16:35:57,976 INFO: Initializing external client
2026-01-05 16:35:57,978 INFO: Base URL: https://c.app.hopsworks.ai:443
2026-01-05 16:35:59,850 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1286325


In [7]:
feature_group = fs.get_or_create_feature_group(
    name="flight_data_arlanda",
    version=1,
    primary_key=['city'],
    event_time="date",
    description="Data from OpenSky. Total number of landings each day at Arlanda since 2019-12-31",
)

In [8]:
feature_group.insert(
    df,
    write_options={"wait_for_job": True}
)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1286325/fs/1265794/fg/1908063


Uploading Dataframe: 100.00% |█| Rows 2186/2186 | Elapsed Time: 00:01 | Remainin


Launching job: flight_data_arlanda_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1286325/jobs/named/flight_data_arlanda_1_offline_fg_materialization/executions
2026-01-05 16:36:26,558 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2026-01-05 16:36:29,764 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-05 16:38:25,362 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2026-01-05 16:38:25,597 INFO: Waiting for log aggregation to finish.
2026-01-05 16:38:34,394 INFO: Execution finished successfully.


(Job('flight_data_arlanda_1_offline_fg_materialization', 'SPARK'), None)