# Feature Backfill Flight Data
* Get Arrival Data for Arlanda for each day since the specified backfill date

./trino --server https://trino.opensky-network.org \
--user fredrikschultz \
--external-authentication \
--catalog minio --schema osky

Kolla den här länken för att sätta upp trino som behövs för att få historisk data. Sedan kan vi använda opensky api för att få daily data. Om man kollar på länken behöver man bli godkänd för att få tillgång till trino, man behöver skicka ett formulär till dom. 

https://opensky-network.org/data/trino

In [4]:
import pandas as pd
import hopsworks 

In [10]:
csv_file="data/hotel_nights_stockholm.csv"

df = pd.read_csv(
    csv_file,
    sep=";",                 # semicolon-delimited
    encoding="utf-8",
)

# Clean the numeric column (handles normal + non-breaking spaces)
col = df.columns[1]
df[col] = (
    df[col].astype(str)
          .str.replace("\u00a0", "", regex=False)  # NBSP
          .str.replace(" ", "", regex=False)       # normal spaces
          .astype(int)
)

df["Datum"] = pd.to_datetime(df["Datum"], format="%Y-%m-%d")

print(df.head())

       Datum  Gästnätter i snitt per dag
0 2021-01-01                       10924
1 2021-01-02                       11302
2 2021-01-03                        8949
3 2021-01-04                        9511
4 2021-01-05                        9841


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 2 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Datum                       1795 non-null   datetime64[ns]
 1   Gästnätter i snitt per dag  1795 non-null   int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 28.2 KB


## Data Cleaning 

In [18]:
df = df.rename(columns={'Datum': 'date'})
df = df.rename(columns={'Gästnätter i snitt per dag': 'nights_per_day'})


df['nights_per_day'] = df['nights_per_day'].astype('float32')
df['city'] = "Märsta"

In [19]:
df

Unnamed: 0,date,nights_per_day,city
0,2021-01-01,10924.0,Märsta
1,2021-01-02,11302.0,Märsta
2,2021-01-03,8949.0,Märsta
3,2021-01-04,9511.0,Märsta
4,2021-01-05,9841.0,Märsta
...,...,...,...
1790,2025-11-26,39776.0,Märsta
1791,2025-11-27,38664.0,Märsta
1792,2025-11-28,48580.0,Märsta
1793,2025-11-29,52335.0,Märsta


## Create the Feature Group in Hopsworks 

In [20]:
project = hopsworks.login()
fs = project.get_feature_store()

2026-01-06 17:20:11,789 INFO: Closing external client and cleaning up certificates.
Connection closed.
2026-01-06 17:20:11,794 INFO: Initializing external client
2026-01-06 17:20:11,795 INFO: Base URL: https://c.app.hopsworks.ai:443
2026-01-06 17:20:13,559 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1286325


In [21]:
feature_group = fs.get_or_create_feature_group(
    name="hotel_data",
    version=1,
    primary_key=['city'],
    event_time="date",
    description="Data from Swedish tourist database. Daily hotel nights since 2021",
)

In [22]:
feature_group.insert(
    df,
    write_options={"wait_for_job": True}
)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1286325/fs/1265794/fg/1893841


Uploading Dataframe: 100.00% |█| Rows 1795/1795 | Elapsed Time: 00:00 | Remainin


Launching job: hotel_data_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1286325/jobs/named/hotel_data_1_offline_fg_materialization/executions
2026-01-06 17:20:34,836 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2026-01-06 17:20:41,270 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-06 17:22:46,895 INFO: Waiting for log aggregation to finish.
2026-01-06 17:23:05,805 INFO: Execution finished successfully.


(Job('hotel_data_1_offline_fg_materialization', 'SPARK'), None)