# Backfill and Feature Engineering Notebook
This notebook consists of 5 parts:
1. Importing libaries and loading packages
2. Data Loading
3. Data Preprocessing
4. Feature Engineering
5. Hopsworks Feature Storage

Throughout this notebook, our decision-making process is informed by insights gained from the exploratory data analysis (EDA) we conducted. This analysis helped us identify the most relevant information for our methods and strategies

## 1. Importing libaries and loading packages
In this section, we import necessary libraries and define key functions.

In [1]:
# Package for hopsworks
# !pip install -U hopsworks --quiet

#Packages
import random
import pandas as pd
import hopsworks
import numpy as np
import uuid

# For visualisations
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

#UML
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

  from .autonotebook import tqdm as notebook_tqdm


## 2. Data Loading
We load historic data to be used in the notebook.

- The data provided by Sensade consists of 9 different CSV files. Loacated in the *data* folder.
- The first three datasets are NORM1, 2 and 3. These CSV files consists of data from *normal* parking spots, but the data here is not collected in the same way as the data will be collected in the furture. therefore, it will not be used for our project.
- Then theres FORB1, 2 and 3. These datasets are with data from places where parking is *forbidden* and will therefore not be suitable for our project.
- The last three datasets EL1, 2 and 3 are datasets from electric car parking spots, collected the way data is collected in the furute and and will be used in the project for modelling.
- More information on the specific dataset is found in the README.txt file in the *data* folder

In [3]:
# Load the data as df_1, df_2, df_3
df_1 = pd.read_csv('/workspaces/MLOps_Project/data/EL1.csv') # The path to the dataset should probably be changed when we're setting up the serverless-ml-pipeline
df_2 = pd.read_csv('/workspaces/MLOps_Project/data/EL2.csv') # The path to the dataset should probably be changed when we're setting up the serverless-ml-pipeline
df_3 = pd.read_csv('/workspaces/MLOps_Project/data/EL3.csv') # The path to the dataset should probably be changed when we're setting up the serverless-ml-pipeline


## 3. Data Preprocessing
Now before just dumping the data into a feature store we do a little preprocessing to enhance the use of our datasets.

This preprocessing consists of:
- Making unique identifyers for each datapoint
- Combining the three datasets into one 
- Making clusters used for labeling, which is nessesary when we want to train our models later
- Converting the data column to pandas datetime
- minor adjustments for the naming of radar columns to fix some hopsworks problem where the name of the columns cannot start with a number, and making the relevant columns into float format

In [6]:
# Create a unique identifier for each row in the datasets
def create_id(df, dataset_name):
    # Assign the sensor prefix based on the dataset name
    if dataset_name == 'df_1':
        df['psensor'] = "EL1"
    elif dataset_name == 'df_2':
        df['psensor'] = "EL2"
    elif dataset_name == 'df_3':
        df['psensor'] = "EL3"
    else:
        raise ValueError("Unknown dataset name provided")

    # Create a new column 'id' with a unique identifier for each row
    df['id'] = [str(uuid.uuid4()) for _ in df.index]

    return df

In [7]:
# Applying the function to the datasets
df_1 = create_id(df_1, 'df_1')
df_2 = create_id(df_2, 'df_2')
df_3 = create_id(df_3, 'df_3')

In [8]:
# Concatenate the datasets
df_main = pd.concat([df_1, df_2, df_3], axis=0)

In [10]:
# Converting the date to datetime
df_main['time'] = pd.to_datetime(df_main['time'])

In [11]:
#Renaming the radar columns to start with radar
df_main = df_main.rename(columns={'0_radar': 'radar_0', '1_radar': 'radar_1', '2_radar': 'radar_2', '3_radar': 'radar_3', '4_radar': 'radar_4', '5_radar': 'radar_5', '6_radar': 'radar_6', '7_radar': 'radar_7'})


In [12]:
# Converting the columns to float
df_main[['x','y','z', 'radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'f_cnt', 'dr', 'rssi']] = df_main[['x','y','z', 'radar_0', 'radar_1', 'radar_2', 'radar_3', 'radar_4', 'radar_5', 'radar_6', 'radar_7', 'f_cnt', 'dr', 'rssi']].astype(float)

## 4. Feature Engineering
In this step, we develop a method to label the data points as either 'detection' or 'no_detection.' 

Our exploratory data analysis revealed that the electromagnetic field data is best suited for our objectives. Therefore, we focus on the x, y, and z data from this dataset.

In our case, we chose KMeans as our clustering method and used the magnetic sensor data from the x, y, and z axes as features. This is done after normalizing the data using StandardScaler.

In [13]:
# Making a dataframe for the features we wish to cluster on
mag = df_main[["x","y","z"]]

In [14]:
# Normalizing the data
scaler = StandardScaler()
mag_normalized = scaler.fit_transform(mag)
# Clustering the magnetic field data with 2 clusters using kmeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(mag_normalized)

In [15]:
# Adding cluster labels to the mag dataframe
mag = mag.copy() #dealing with an error
mag['mag_cluster'] = kmeans.labels_
df_label = df_main[['id', 'time']]
df_label = df_label.copy() #dealing with an error
df_label['mag_cluster'] = mag['mag_cluster']

In [16]:
df_label['mag_cluster'].replace({0: 'detection', 1: 'no_detection'}, inplace=True)

## 5. Hopsworks Feature Storage

Now we would like to connect to the Hopsworks Feature Store so we can access and create feature groups.

In creating feature groups we take all the relevant coulmns and store it in hopworks, so that we later can acces and interperet for further use.

We also specify a 'primary_key' that is used for relating diferent dimention tables to each other, in our case this is the unique ID that we made in the preprocessing step. 

The 'time' column is used as the event time key.

we also we put `online_enabled` to `True` to make the feature group online for acces with an API when we make feature views.

And finally we give descriptions to each coulmn with information given by the *README.txt* in *data*. 


In [17]:
# Connceting to the Hopsworks project

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/549014
Connected. Call `.close()` to terminate connection gracefully.


In [18]:
# Create a feature group for the magnetic field features
mag_fg = fs.get_or_create_feature_group(
    name="historic_parking_detection_features",
    version=1,
    description="Historical data for parking detection",
    primary_key=['id'],
    event_time='time',
    online_enabled=True
)

In [19]:
# Insert the magnetic field features into the feature group
mag_fg.insert(df_main, write_options={"wait_for_job" : False})


Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/549014/fs/544837/fg/766332


Uploading Dataframe: 100.00% |██████████| Rows 20570/20570 | Elapsed Time: 00:08 | Remaining Time: 00:00


Launching job: historic_parking_detection_features_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/549014/jobs/named/historic_parking_detection_features_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f274ba81e70>, None)

In [20]:
# Making descriptions for the features
feature_descriptions = [
    {"name": "time", "description": "Timepoint of the datapoint"},
    {"name": "battery", "description": "Battery level of the sensor"},
    {"name": "temperature", "description": "Temperature recorded by the sensor"},
    {"name": "x", "description": "Magnetic field reading in the x direction"},
    {"name": "y", "description": "Magnetic field reading in the y direction"},
    {"name": "z", "description": "Magnetic field reading in the z direction"},
    {"name": "radar_0", "description": "Radar reading from sensor radar sensor 0"},
    {"name": "radar_1", "description": "Radar reading from sensor radar sensor 1"},
    {"name": "radar_2", "description": "Radar reading from sensor radar sensor 2"},
    {"name": "radar_3", "description": "Radar reading from sensor radar sensor 3"},
    {"name": "radar_4", "description": "Radar reading from sensor radar sensor 4"},
    {"name": "radar_5", "description": "Radar reading from sensor radar sensor 5"},
    {"name": "radar_6", "description": "Radar reading from sensor radar sensor 6"},
    {"name": "radar_7", "description": "Radar reading from sensor radar sensor 7"},
    {"name": "package_type", "description": "Heartbeat indicates no significant change since last reading or change package type means that x, y or z has changed significantly +-30"},
    {"name": "f_cnt", "description": "number of packages transmitted since last network registration"},
    {"name": "dr", "description": "data rate parameter in LoRaWAN. It ranges between 1 and 5 where 1 is the slowest transmission data rate and 5 is the highest. This datarate is scaled by the network server depending on the signal quality of the past packages send"},
    {"name": "snr", "description": "signal to noise ratio – the higher value, the better the signal quality"},
    {"name": "rssi", "description": "signal strength – the higher value, the better the signal quality"},
    {"name": "psensor", "description": "sensor identifier (ex. EL1, EL2, EL3)"},
    {"name": "id", "description": "unique identifier for each datapoint made uuid4"}
]

for desc in feature_descriptions: 
    mag_fg.update_feature_description(desc["name"], desc["description"])

In [21]:
# Create a feature group for the magnetic field labels
mag_label_fg = fs.get_or_create_feature_group(
    name="historic_parking_detection_labels",
    version=1,
    description="Historical labels on data for parking detection",
    primary_key=['id'],
    event_time='time',
    online_enabled=True
)

In [22]:
# Insert the magnetic field labels into the feature group
mag_label_fg.insert(df_label, write_options={"wait_for_job" : False})

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/549014/fs/544837/fg/765325


Uploading Dataframe: 100.00% |██████████| Rows 20570/20570 | Elapsed Time: 00:06 | Remaining Time: 00:00


Launching job: historic_parking_detection_labels_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/549014/jobs/named/historic_parking_detection_labels_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f277442b7c0>, None)

In [23]:
# Making descriptions for the labels
feature_descriptions = [
    {"name": "time", "description": "Timepoint of the datapoint"},
    {"name": "id", "description": "unique identifier for each datapoint made with uuid4"},
    {"name": "mag_cluster", "description": "Label for the datapoint, whether a parking detection was made or not"}
]

for desc in feature_descriptions: 
    mag_label_fg.update_feature_description(desc["name"], desc["description"])

## **Next up:** 2: Feature Pipeline
Go to the 2_feature_pipeline.ipynb notebook