# Backfill and Feature Engineering Notebook
This notebook consists of 5 parts:
1. Importing libaries and loading packages
2. Data Loading
3. Data Preprocessing
4. Feature Engineering
5. Hopsworks Feature Storage

Throughout this notebook, our decision-making process is informed by insights gained from the exploratory data analysis (EDA) we conducted. This analysis helped us identify the most relevant information for our methods and strategies

## 1. Importing libaries and loading packages
In this section, we import necessary libraries and define key functions.

In [1]:
# Package for hopsworks integration
# !pip install -U hopsworks --quiet

# Import standard Python libraries
import pandas as pd 
import hopsworks 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning tools
from sklearn.preprocessing import StandardScaler  
from sklearn.cluster import KMeans  
from sklearn.metrics import silhouette_score  

# Import other useful libraries
import uuid  # Unique identifier generation
import requests  # For making API requests
import json  
import io 
import os
import base64 
from datetime import datetime, timedelta  # Date/time handling and manipulation
import pytz  # Timezone conversions and support

import openmeteo_requests
import requests_cache
from retry_requests import retry

# Environment variable management
from dotenv import load_dotenv
load_dotenv()

True

## 2. Loading Sensor Data
We load historic data to be used in the notebook.

- The data is gathered trough an API given by the company, and the API data results are stored in the *bikelane_historic_data.csv* and *building_historic_data.csv*. The data loacated in the *data* folder is old data that the EDA and end-to-end-pipeline is made from. Here you can also find more information about the data in the readme.txt file.

- The data derivied from the API is from two parking spots, one of the spots are close to a building and the other close to a bikelane, and will be refered to with this as the identifyer.

- In this section we will be pinging the API and saving historic data from march and april in two csv-files, containing data from the parking spot close to the building and the one close to the bikelane

In [2]:
# getting the time for now, tomorrow and yesterday
now = datetime.now()  # Get current time 
today = now 
yesterday = today - timedelta(days=1)
tomorrow = today + timedelta(days=1)
print(today)

2024-05-24 12:35:02.881736


In [3]:
# Format 'today', 'tomorrow', and 'yesterday' as "YYYY-MM-DD"
formatted_today = today.strftime('%Y-%m-%d %H:%M:%S')
formatted_tomorrow = tomorrow.strftime('%Y-%m-%d %H:%M:%S')
formatted_yesterday = yesterday.strftime('%Y-%m-%d %H:%M:%S')

In [4]:
# defining API information
dev_eui_building = "0080E115003BEA91"
dev_eui_bikelane = "0080E115003E3597"
url = "https://data.sensade.com"

basic_auth = base64.b64encode(f"{os.getenv('API_USERNAME')}:{os.getenv('API_PASSWORD')}".encode())
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Basic {basic_auth.decode("utf-8")}'
}

In [5]:
# Function to ping the API and get data in a given time interval
def API_call(dev_eui, from_date, to_date):
    payload = json.dumps({
    "dev_eui": dev_eui,
    "from": from_date,
    "to": to_date
})

    API_response = requests.request("GET", url, headers=headers, data=payload)

    if API_response.status_code != 200:
        exit(13)

    csv_data = API_response.text
    df = pd.read_csv(io.StringIO(csv_data))
    return df

This section below is commented out because the files are already created

In [6]:
# march_building = API_call(dev_eui_building, "2024-03-01", "2024-04-01")
# march_bikelane = API_call(dev_eui_bikelane, "2024-03-01", "2024-04-01")
# april_building = API_call(dev_eui_building, "2024-04-01", "2024-05-01")
# april_bikelane = API_call(dev_eui_bikelane, "2024-04-01", "2024-05-01")

In [7]:
# building_historic_df = pd.concat([march_building, april_building], ignore_index=True)
# bikelane_historic_df = pd.concat([march_bikelane, april_bikelane], ignore_index=True)

In [8]:
# saving the data as CSV
# building_historic_df.to_csv('building_historic_df.csv', index=False)
# bikelane_historic_df.to_csv('bikelane_historic_df.csv', index=False)

In [9]:
# loading the data saved in the directory
building_historic_df = pd.read_csv('building_historic_df.csv')
bikelane_historic_df = pd.read_csv('bikelane_historic_df.csv')

In [10]:
# Defining a function that tries to parse the datetime with microseconds first, and if it fails, parses it without microseconds
def parse_datetime(dt_str):
    try:
        return datetime.strptime(dt_str, '%Y-%m-%d %H:%M:%S.%f')
    except ValueError:
        return datetime.strptime(dt_str, '%Y-%m-%d %H:%M:%S')

In [11]:
# Applying the function on the dataframes
building_historic_df = building_historic_df.copy()
building_historic_df['time'] = building_historic_df['time'].apply(parse_datetime)
bikelane_historic_df = bikelane_historic_df.copy()
bikelane_historic_df['time'] = bikelane_historic_df['time'].apply(parse_datetime)

In [12]:
# Creating a column for the time in the format of "YYYY-MM-DD HH" to merge with weather data
bikelane_historic_df['time_hour'] = bikelane_historic_df['time'].dt.strftime('%Y-%m-%d %H')
building_historic_df['time_hour'] = building_historic_df['time'].dt.strftime('%Y-%m-%d %H')

# Converting the time_hour column to datetime
bikelane_historic_df['time_hour'] = pd.to_datetime(bikelane_historic_df['time_hour'])
building_historic_df['time_hour'] = pd.to_datetime(building_historic_df['time_hour'])

## 3. Loading Weather Data

In [13]:
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

In [14]:
weather_url = "https://archive-api.open-meteo.com/v1/archive"
weather_params = {
	"latitude": 57.01,
	"longitude": 9.99,
	"start_date": "2024-03-01",
	"end_date": "2024-04-30",
	"hourly": ["temperature_2m", "relative_humidity_2m", "precipitation", "surface_pressure", "cloud_cover", "et0_fao_evapotranspiration", "wind_speed_10m", "soil_temperature_0_to_7cm", "soil_moisture_0_to_7cm"]
}
responses = openmeteo.weather_api(weather_url, params=weather_params)

In [15]:
# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()} {response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
hourly_precipitation = hourly.Variables(2).ValuesAsNumpy()
hourly_surface_pressure = hourly.Variables(3).ValuesAsNumpy()
hourly_cloud_cover = hourly.Variables(4).ValuesAsNumpy()
hourly_et0_fao_evapotranspiration = hourly.Variables(5).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(6).ValuesAsNumpy()
hourly_soil_temperature_0_to_7cm = hourly.Variables(7).ValuesAsNumpy()
hourly_soil_moisture_0_to_7cm = hourly.Variables(8).ValuesAsNumpy()

hourly_data = {"date": pd.date_range(
	start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
	end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = hourly.Interval()),
	inclusive = "left"
)}
hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["precipitation"] = hourly_precipitation
hourly_data["surface_pressure"] = hourly_surface_pressure
hourly_data["cloud_cover"] = hourly_cloud_cover
hourly_data["et0_fao_evapotranspiration"] = hourly_et0_fao_evapotranspiration
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["soil_temperature_0_to_7cm"] = hourly_soil_temperature_0_to_7cm
hourly_data["soil_moisture_0_to_7cm"] = hourly_soil_moisture_0_to_7cm

hourly_dataframe = pd.DataFrame(data = hourly_data)
hourly_dataframe.head()

Coordinates 56.977149963378906°N 10.0632905960083°E
Elevation 23.0 m asl
Timezone None None
Timezone difference to GMT+0 0 s


Unnamed: 0,date,temperature_2m,relative_humidity_2m,precipitation,surface_pressure,cloud_cover,et0_fao_evapotranspiration,wind_speed_10m,soil_temperature_0_to_7cm,soil_moisture_0_to_7cm
0,2024-03-01 00:00:00+00:00,6.2305,87.634949,0.1,1001.978699,100.0,0.006267,16.904673,5.3805,0.296
1,2024-03-01 01:00:00+00:00,5.7305,88.205788,0.1,1002.073425,80.400009,0.004743,16.299694,5.1305,0.292
2,2024-03-01 02:00:00+00:00,5.2305,89.097878,0.0,1001.470093,60.600002,0.003818,16.87398,4.9305,0.289
3,2024-03-01 03:00:00+00:00,4.7805,90.966789,0.0,1000.767395,39.600002,0.00123,16.981165,4.7305,0.286
4,2024-03-01 04:00:00+00:00,3.6805,94.174324,0.0,1000.855896,53.400002,0.0,17.0763,4.2305,0.284


In [16]:
#remove the timezone from the date column
hourly_dataframe['date'] = hourly_dataframe['date'].dt.tz_localize(None)
#Convert to datetime object
hourly_dataframe['date'] = pd.to_datetime(hourly_dataframe['date'])

## 4. Merging Sensor and Weather Data
Now before just dumping the data into a feature store we do a little preprocessing to enhance the use of our datasets.

This preprocessing consists of:
- Making unique identifyers for each datapoint
- Combining the three datasets into one 
- Making clusters used for labeling, which is nessesary when we want to train our models later
- Converting the data column to pandas datetime
- minor adjustments for the naming of radar columns to fix some hopsworks problem where the name of the columns cannot start with a number, and making the relevant columns into float format

In [17]:
# Merging the weather data with the building sensor data
building_historic_df = pd.merge(building_historic_df, hourly_dataframe, left_on='time_hour', right_on='date', how='left')
# Merging the weather data with the bikelane sensor data
bikelane_historic_df = pd.merge(bikelane_historic_df, hourly_dataframe, left_on='time_hour', right_on='date', how='left')

In [18]:
# removing date column
building_historic_df = building_historic_df.drop(columns=['date'])
bikelane_historic_df = bikelane_historic_df.drop(columns=['date'])

## Feature Engineering

In [19]:
# Create a unique identifier for each row in the datasets
def create_id(df, dataset_name):
    # Assign the sensor prefix based on the dataset name
    if dataset_name == 'building_historic_df':
        df['psensor'] = "BUILDING"
    elif dataset_name == 'bikelane_historic_df':
        df['psensor'] = "BIKELANE"
    else:
        raise ValueError("Unknown dataset name provided")

    # Create a new column 'id' with a unique identifier for each row
    df['id'] = df['time'].astype(str) + '_' + df['psensor']

    return df

In [20]:
# Applying the ID creator function to the datasets
df_bikelane = create_id(bikelane_historic_df, 'bikelane_historic_df')
df_building = create_id(building_historic_df, 'building_historic_df')

In [21]:
#Renaming the radar columns to start with radar to deal with hopsworks problem
df_building = df_building.rename(columns={'0_radar': 'radar_0', '1_radar': 'radar_1', '2_radar': 'radar_2', '3_radar': 'radar_3', '4_radar': 'radar_4', '5_radar': 'radar_5', '6_radar': 'radar_6', '7_radar': 'radar_7'})
df_bikelane = df_bikelane.rename(columns={'0_radar': 'radar_0', '1_radar': 'radar_1', '2_radar': 'radar_2', '3_radar': 'radar_3', '4_radar': 'radar_4', '5_radar': 'radar_5', '6_radar': 'radar_6', '7_radar': 'radar_7'})

In [22]:
# Converting the columns to float
df_building[['x','y','z', 'radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'f_cnt', 'dr', 'rssi']] = df_building[['x','y','z', 'radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'f_cnt', 'dr', 'rssi']].astype(float)
df_bikelane[['x','y','z', 'radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'f_cnt', 'dr', 'rssi']] = df_bikelane[['x','y','z', 'radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'f_cnt', 'dr', 'rssi']].astype(float)

## Creating full dataframe with backfilled battery and radar columns

In [23]:
building_full_df = df_building.copy()
bikelane_full_df = df_bikelane.copy()

In [24]:
# backfill missing values in radar columns and battery column with the previous value
building_full_df['radar_0'] = building_full_df['radar_0'].bfill()
building_full_df['radar_1'] = building_full_df['radar_1'].bfill()
building_full_df['radar_2'] = building_full_df['radar_2'].bfill()
building_full_df['radar_3'] = building_full_df['radar_3'].bfill()
building_full_df['radar_4'] = building_full_df['radar_4'].bfill()
building_full_df['radar_5'] = building_full_df['radar_5'].bfill()
building_full_df['radar_6'] = building_full_df['radar_6'].bfill()
building_full_df['radar_7'] = building_full_df['radar_7'].bfill()
building_full_df['battery'] = building_full_df['battery'].bfill()
bikelane_full_df['radar_0'] = bikelane_full_df['radar_0'].bfill()
bikelane_full_df['radar_1'] = bikelane_full_df['radar_1'].bfill()
bikelane_full_df['radar_2'] = bikelane_full_df['radar_2'].bfill()
bikelane_full_df['radar_3'] = bikelane_full_df['radar_3'].bfill()
bikelane_full_df['radar_4'] = bikelane_full_df['radar_4'].bfill()
bikelane_full_df['radar_5'] = bikelane_full_df['radar_5'].bfill()
bikelane_full_df['radar_6'] = bikelane_full_df['radar_6'].bfill()
bikelane_full_df['radar_7'] = bikelane_full_df['radar_7'].bfill()
bikelane_full_df['battery'] = bikelane_full_df['battery'].bfill()

In [25]:
print(building_full_df.shape)
print(bikelane_full_df.shape)

(7667, 32)
(7595, 32)


In [26]:
# dropping observations with missing values in the radar columns
building_full_df = building_full_df.dropna(subset=['radar_0'])
bikelane_full_df = bikelane_full_df.dropna(subset=['radar_0'])

## Creating dataframe with radar data

In [27]:
building_radar_df = df_building[['time', 'battery', 'temperature', 'radar_0', 'radar_1',
       'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7',
       'package_type', 'f_cnt', 'dr', 'snr', 'rssi', 'hw_fw_version',
       'time_hour', 'temperature_2m', 'relative_humidity_2m', 'precipitation',
       'surface_pressure', 'cloud_cover', 'et0_fao_evapotranspiration',
       'wind_speed_10m', 'soil_temperature_0_to_7cm', 'soil_moisture_0_to_7cm',
       'psensor', 'id']]
bikelane_radar_df = df_bikelane[['time', 'battery', 'temperature', 'radar_0', 'radar_1',
       'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7',
       'package_type', 'f_cnt', 'dr', 'snr', 'rssi', 'hw_fw_version',
       'time_hour', 'temperature_2m', 'relative_humidity_2m', 'precipitation',
       'surface_pressure', 'cloud_cover', 'et0_fao_evapotranspiration',
       'wind_speed_10m', 'soil_temperature_0_to_7cm', 'soil_moisture_0_to_7cm',
       'psensor', 'id']]

In [28]:
building_radar_df = building_radar_df.dropna(subset=['battery'])
bikelane_radar_df = bikelane_radar_df.dropna(subset=['battery'])

In [29]:
print(building_radar_df.shape)
print(bikelane_radar_df.shape)

(2685, 29)
(2628, 29)


## Magnetic Field Clustering
In this step, we develop a method to label the data points as either 'detection' or 'no_detection.' 

Our exploratory data analysis revealed that the electromagnetic field data is best suited for our objectives. Therefore, we focus on the x, y, and z data from this dataset.

In our case, we chose KMeans as our clustering method and used the magnetic sensor data from the x, y, and z axes as features. This is done after normalizing the data using StandardScaler.

In [30]:
# Making a dataframe for the features we wish to cluster on
building_mag = building_full_df[["x","y","z"]]
bikelane_mag = bikelane_full_df[["x","y","z"]]

In [31]:
# Normalizing the data
scaler = StandardScaler()
building_mag_norm = scaler.fit_transform(building_mag)
bikelane_mag_norm = scaler.fit_transform(bikelane_mag)
# Clustering the magnetic field data with 2 clusters using kmeans
building_kmeans = KMeans(n_clusters=2, random_state=0).fit(building_mag_norm)
bikelane_kmeans = KMeans(n_clusters=2, random_state=0).fit(bikelane_mag_norm)


In [32]:
# Adding cluster labels to the mag dataframe
building_mag = building_mag.copy() #dealing with an error
bikelane_mag = bikelane_mag.copy() #dealing with an error
building_mag['mag_cluster'] = building_kmeans.labels_
bikelane_mag['mag_cluster'] = bikelane_kmeans.labels_
building_full_df = building_full_df.copy() #dealing with an error
bikelane_full_df = bikelane_full_df.copy() #dealing with an error
building_full_df['mag_cluster'] = building_mag['mag_cluster']
bikelane_full_df['mag_cluster'] = bikelane_mag['mag_cluster']

In [33]:
# Renaming the cluster labels to 'detection' and 'no_detection'
building_full_df['mag_cluster'].replace({0: 'no_detection', 1: 'detection'}, inplace=True)
bikelane_full_df['mag_cluster'].replace({0: 'no_detection', 1: 'detection'}, inplace=True)

In [34]:
# Fixing an error with the mag_cluster column type
building_full_df['mag_cluster'] = building_full_df['mag_cluster'].astype(str)
building_full_df['mag_cluster'].replace('nan', None, inplace=True)  # Replace 'nan' string with actual None
bikelane_full_df['mag_cluster'] = bikelane_full_df['mag_cluster'].astype(str)
bikelane_full_df['mag_cluster'].replace('nan', None, inplace=True)  # Replace 'nan' string with actual None

In [35]:
building_full_df['mag_cluster'].value_counts()

mag_cluster
no_detection    7077
detection        588
Name: count, dtype: int64

In [36]:
bikelane_full_df['mag_cluster'].value_counts()

mag_cluster
no_detection    5006
detection       2589
Name: count, dtype: int64

In [37]:
# saving the data as CSV for EDA purposes
building_full_df.to_csv('building_full_df.csv', index=False)
bikelane_full_df.to_csv('bikelane_full_df.csv', index=False)

# Clustering with radar data

In [38]:
# Making a dataframe for the features we wish to cluster on
building_radar = building_radar_df[['radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7']]
bikelane_radar = bikelane_radar_df[['radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7']]

In [39]:
# Normalizing the data
scaler = StandardScaler()
building_radar_norm = scaler.fit_transform(building_radar)
bikelane_radar_norm = scaler.fit_transform(bikelane_radar)
# Clustering the magnetic field data with 2 clusters using kmeans
building_kmeans = KMeans(n_clusters=2, random_state=0).fit(building_radar_norm)
bikelane_kmeans = KMeans(n_clusters=2, random_state=0).fit(bikelane_radar_norm)

In [40]:
# Adding cluster labels to the radar dataframe
building_radar = building_radar.copy() #dealing with an error
bikelane_radar = bikelane_radar.copy() #dealing with an error
building_radar['radar_cluster'] = building_kmeans.labels_
bikelane_radar['radar_cluster'] = bikelane_kmeans.labels_
building_radar_df = building_radar_df.copy() #dealing with an error
bikelane_radar_df = bikelane_radar_df.copy() #dealing with an error
building_radar_df['radar_cluster'] = building_radar['radar_cluster']
bikelane_radar_df['radar_cluster'] = bikelane_radar['radar_cluster']


In [41]:
# Renaming the cluster labels to 'detection' and 'no_detection'
building_radar_df['radar_cluster'].replace({0: 'no_detection', 1: 'detection'}, inplace=True)
bikelane_radar_df['radar_cluster'].replace({0: 'no_detection', 1: 'detection'}, inplace=True)

In [42]:
building_radar_df['radar_cluster'].value_counts()

radar_cluster
no_detection    2248
detection        437
Name: count, dtype: int64

In [43]:
bikelane_radar_df['radar_cluster'].value_counts()

radar_cluster
no_detection    2204
detection        424
Name: count, dtype: int64

In [44]:
# saving the data as CSV for EDA purposes
building_radar_df.to_csv('building_radar_df.csv', index=False)
bikelane_radar_df.to_csv('bikelane_radar_df.csv', index=False)

## Combining building and bikelane dataframes

In [45]:
combined_full_df = pd.concat([building_full_df, bikelane_full_df], axis=0)
combined_radar_df = pd.concat([building_radar_df, bikelane_radar_df], axis=0)

In [46]:
combined_radar_df.columns

Index(['time', 'battery', 'temperature', 'radar_0', 'radar_1', 'radar_2',
       'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'package_type',
       'f_cnt', 'dr', 'snr', 'rssi', 'hw_fw_version', 'time_hour',
       'temperature_2m', 'relative_humidity_2m', 'precipitation',
       'surface_pressure', 'cloud_cover', 'et0_fao_evapotranspiration',
       'wind_speed_10m', 'soil_temperature_0_to_7cm', 'soil_moisture_0_to_7cm',
       'psensor', 'id', 'radar_cluster'],
      dtype='object')

In [47]:
radar_cluster = combined_radar_df[['id', 'radar_cluster']]

In [48]:
combined_full_df = pd.merge(combined_full_df, radar_cluster, left_on='id', right_on='id', how='left')

In [49]:
combined_full_df['radar_cluster'] = combined_full_df['radar_cluster'].bfill()

In [50]:
combined_full_df.to_csv('combined_full_df.csv', index=False)

## 5. Hopsworks Feature Storage

Now we would like to connect to the Hopsworks Feature Store so we can access and create feature groups.

In creating feature groups we take all the relevant coulmns and store it in hopworks, so that we later can acces and interperet for further use.

We also specify a 'primary_key' that is used for relating diferent dimention tables to each other, in our case this is the unique ID that we made in the preprocessing step. 

The 'time' column is used as the event time key.

we also we put `online_enabled` to `True` to make the feature group online for acces with an API when we make feature views.

And finally we give descriptions to each coulmn with information given by the *README.txt* in *data*. 


In [51]:
# Connceting to the Hopsworks project
project = hopsworks.login(project="annikaij")
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/549019
Connected. Call `.close()` to terminate connection gracefully.


In [53]:
# Create a feature group for the parking spot close to the building
hist_combined_full_fg = fs.get_or_create_feature_group(
    name="hist_combined_full_fg",
    version=1,
    description="Data from Sensor and Weather API for both Parking Spots",
    primary_key=['id'],
    event_time='time',
    online_enabled=True
)

In [54]:
# Insert the magnetic field features into the feature group
hist_combined_full_fg.insert(combined_full_df, write_options={"wait_for_job" : False})


Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/549019/fs/544841/fg/844145


Uploading Dataframe: 0.00% |          | Rows 0/15260 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: hist_combined_full_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/549019/jobs/named/hist_combined_full_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fa302b77c50>, None)

In [19]:
# Making descriptions for the features
# feature_descriptions = [
#     {"name": "time", "description": "Timepoint of the datapoint"},
#     {"name": "battery", "description": "Battery level of the sensor"},
#     {"name": "temperature", "description": "Temperature recorded by the sensor"},
#     {"name": "x", "description": "Magnetic field reading in the x direction"},
#     {"name": "y", "description": "Magnetic field reading in the y direction"},
#     {"name": "z", "description": "Magnetic field reading in the z direction"},
#     {"name": "radar_0", "description": "Radar reading from sensor radar sensor 0"},
#     {"name": "radar_1", "description": "Radar reading from sensor radar sensor 1"},
#     {"name": "radar_2", "description": "Radar reading from sensor radar sensor 2"},
#     {"name": "radar_3", "description": "Radar reading from sensor radar sensor 3"},
#     {"name": "radar_4", "description": "Radar reading from sensor radar sensor 4"},
#     {"name": "radar_5", "description": "Radar reading from sensor radar sensor 5"},
#     {"name": "radar_6", "description": "Radar reading from sensor radar sensor 6"},
#     {"name": "radar_7", "description": "Radar reading from sensor radar sensor 7"},
#     {"name": "package_type", "description": "Heartbeat indicates no significant change since last reading or change package type means that x, y or z has changed significantly +-30"},
#     {"name": "f_cnt", "description": "number of packages transmitted since last network registration"},
#     {"name": "dr", "description": "data rate parameter in LoRaWAN. It ranges between 1 and 5 where 1 is the slowest transmission data rate and 5 is the highest. This datarate is scaled by the network server depending on the signal quality of the past packages send"},
#     {"name": "snr", "description": "signal to noise ratio – the higher value, the better the signal quality"},
#     {"name": "rssi", "description": "signal strength – the higher value, the better the signal quality"},
#     {"name": "psensor", "description": "sensor identifier (ex. EL1, EL2, EL3)"},
#     {"name": "hw_fw_version", "description": "hardware and firmware version of the sensor"},
#     {"name": "id", "description": "unique identifier for each datapoint made uuid4"}
# ]
# 
# for desc in feature_descriptions: 
#     api_hist_building_fg.update_feature_description(desc["name"], desc["description"])

In [55]:
# Create a feature group for the parking spot close to the bike lane
hist_combined_radar_fg = fs.get_or_create_feature_group(
    name="hist_combined_radar_fg",
    version=1,
    description="Data from Sensor and Weather API for both Parking Spots but only with Non-Backfilled Radar Data",
    primary_key=['id'],
    event_time='time',
    online_enabled=True
)

In [56]:
# Insert the magnetic field features into the feature group
hist_combined_radar_fg.insert(combined_radar_df, write_options={"wait_for_job" : False})


Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/549019/fs/544841/fg/845145


Uploading Dataframe: 0.00% |          | Rows 0/5313 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: hist_combined_radar_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/549019/jobs/named/hist_combined_radar_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fa30486f750>, None)

In [22]:
# Making descriptions for the features
# feature_descriptions = [
#     {"name": "time", "description": "Timepoint of the datapoint"},
#     {"name": "battery", "description": "Battery level of the sensor"},
#     {"name": "temperature", "description": "Temperature recorded by the sensor"},
#     {"name": "x", "description": "Magnetic field reading in the x direction"},
#     {"name": "y", "description": "Magnetic field reading in the y direction"},
#     {"name": "z", "description": "Magnetic field reading in the z direction"},
#     {"name": "radar_0", "description": "Radar reading from sensor radar sensor 0"},
#     {"name": "radar_1", "description": "Radar reading from sensor radar sensor 1"},
#     {"name": "radar_2", "description": "Radar reading from sensor radar sensor 2"},
#     {"name": "radar_3", "description": "Radar reading from sensor radar sensor 3"},
#     {"name": "radar_4", "description": "Radar reading from sensor radar sensor 4"},
#     {"name": "radar_5", "description": "Radar reading from sensor radar sensor 5"},
#     {"name": "radar_6", "description": "Radar reading from sensor radar sensor 6"},
#     {"name": "radar_7", "description": "Radar reading from sensor radar sensor 7"},
#     {"name": "package_type", "description": "Heartbeat indicates no significant change since last reading or change package type means that x, y or z has changed significantly +-30"},
#     {"name": "f_cnt", "description": "number of packages transmitted since last network registration"},
#     {"name": "dr", "description": "data rate parameter in LoRaWAN. It ranges between 1 and 5 where 1 is the slowest transmission data rate and 5 is the highest. This datarate is scaled by the network server depending on the signal quality of the past packages send"},
#     {"name": "snr", "description": "signal to noise ratio – the higher value, the better the signal quality"},
#     {"name": "rssi", "description": "signal strength – the higher value, the better the signal quality"},
#     {"name": "psensor", "description": "sensor identifier (ex. EL1, EL2, EL3)"},
#     {"name": "hw_fw_version", "description": "hardware and firmware version of the sensor"},
#     {"name": "id", "description": "unique identifier for each datapoint made uuid4"}
# ]
# 
# for desc in feature_descriptions: 
#     api_hist_bikelane_fg.update_feature_description(desc["name"], desc["description"])

## **Next up:** 2: Latest API data
Go to the 2_latest_api_feature_pipeline.ipynb notebook