# Iris Flower - Feature Pipeline

In this notebook we will, 

1. Run in either "Backfill" or "Normal" operation. 
2. IF *BACKFILL==True*, we will load our DataFrame with data from the iris.csv file 

   ELSE *BACKFILL==False*, we will load our DataFrame with one synthetic Iris Flower sample 
3. Write our DataFrame to a Feature Group

In [1]:
!pip install hopsworks 




Set **BACKFILL=True** if you want to create features from the iris.csv file containing historical data.

In [162]:
import random
import pandas as pd
import hopsworks

#set to true initially

BACKFILL= False

### Synthetic Data Functions

These synthetic data functions can be used to create a DataFrame containing a single Iris Flower sample.

In [178]:
import pandas as pd
import random
import uuid

def generate_aqi_data_dynamic():
    """
    Returns a single random AQI data row as a DataFrame, dynamically generated from the existing dataset.
    """
    # Load your dataset, so it can refresh and generate new values
    aqi_df = pd.read_csv("../aqi.csv")
    # Drop rows with any empty (NaN or None) values
    aqi_df = aqi_df.dropna()
    # Randomly select a row from the dataset for Country, City, lat, and lng
    location_data = aqi_df.sample(1).iloc[0]

    # Define ranges for AQI components
    aqi_ranges = {
        "aqi_value": (0, 500),
        "co_aqi_value": (0, 50),
        "ozone_aqi_value": (0, 200),
        "no2_aqi_value": (0, 100),
        "pm25_aqi_value": (0, 500),
    }

    # Generate random integer values for AQI components
    aqi_data = {key: random.randint(int(value[0]), int(value[1])) for key, value in aqi_ranges.items()}

    # Add location data
    aqi_data["country"] = location_data["Country"]
    aqi_data["city"] = location_data["City"]
    aqi_data["lat"] = location_data["lat"]
    aqi_data["lng"] = location_data["lng"]

    # Generate AQI Category
    aqi_data["aqi_category"] = (
        "Good" if aqi_data["aqi_value"] <= 50 else
        "Moderate" if aqi_data["aqi_value"] <= 100 else
        "Unhealthy for Sensitive Groups" if aqi_data["aqi_value"] <= 150 else
        "Unhealthy" if aqi_data["aqi_value"] <= 200 else
        "Very Unhealthy" if aqi_data["aqi_value"] <= 300 else
        "Hazardous" if aqi_data["aqi_value"] <= 500 else
        "Beyond Hazardous"
    )


    # Create DataFrame with the correct order and column names
    result_df = pd.DataFrame([aqi_data], columns=[
        "country", "city", "aqi_value", "aqi_category", "co_aqi_value", 
        "ozone_aqi_value", "no2_aqi_value", "pm25_aqi_value", "lat", "lng"
    ])
    
    # Ensure correct data types
    result_df = result_df.astype({
        'country': 'string', 
        'city': 'string', 
        'aqi_value': 'int64', 
        'aqi_category': 'string', 
        'co_aqi_value': 'int64', 
        'ozone_aqi_value': 'int64', 
        'no2_aqi_value': 'int64', 
        'pm25_aqi_value': 'int64', 
        'lat': 'float64', 
        'lng': 'float64', 
    })

    return result_df

def get_random_aqi_values():
    """
    Returns a DataFrame containing random AQI values using the existing dataset for location data.
    """
    return generate_aqi_data_dynamic()

## Backfill or create new synthetic input data

You can run this pipeline in either *backfill* or *synthetic-data* mode.

In [194]:
import uuid
import pandas as pd

if BACKFILL == True:
    aqi_df = pd.read_csv("../aqi.csv")
    # Example list of columns to drop
    columns_to_drop = ['CO AQI Category', 'Ozone AQI Category', 'NO2 AQI Category', 'PM2.5 AQI Category']

    # Drop all category columns except 'aqicategory'
    aqi_df = aqi_df.drop(columns=[col for col in columns_to_drop if col in aqi_df.columns])
else:
    aqi_df = get_random_aqi_values()
   
# Add UUID to each row
aqi_df["uuid"] = [str(uuid.uuid4()) for _ in range(len(aqi_df))]

# Drop rows with any empty (NaN or None) values
aqi_df = aqi_df.dropna()
# Refactoring column names
aqi_df.columns = [col.lower().replace(' ', '_').replace('.', '') for col in aqi_df.columns]
aqi_df

Unnamed: 0,country,city,aqi_value,aqi_category,co_aqi_value,ozone_aqi_value,no2_aqi_value,pm25_aqi_value,lat,lng,uuid
0,Italy,Massarosa,303,Hazardous,16,123,58,200,43.8667,10.3333,cee2bf5f-9542-4aff-bbea-13ee87634eab


## Authenticate with Hopsworks using your API Key

Hopsworks will prompt you to paste in your API key and provide you with a link to find your API key if you have not stored it securely already.

In [145]:
#GJG5iRl8457zOwDh.DgZqNKTOsoidXdslZaeeNzVRrWwcPos5a0VjR3Hw7ONynMMdDo39Wm9YAP232zhl
project = hopsworks.login()
fs = project.get_feature_store()


2025-01-03 00:53:45,254 INFO: Initializing external client
2025-01-03 00:53:45,256 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-03 00:53:48,614 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1207459


## Create and write to a feature group - primary keys

To prevent duplicate entries, Hopsworks requires that each DataFame has a *primary_key*. 
A *primary_key* is one or more columns that uniquely identify the row. Here, we assume
that each Iris flower has a unique combination of ("sepal_length","sepal_width","petal_length","petal_width")
feature values. If you randomly generate a sample that already exists in the feature group, the insert operation will fail.

The *feature group* will create its online schema using the schema of the Pandas DataFame.

In [195]:
aqi_fg = fs.get_or_create_feature_group(name="aqi",
                                  version=1,
                                  primary_key=["uuid"],
                                  description="Air Quality Prediction dataset project"
                                 )
aqi_fg.insert(aqi_df)

Uploading Dataframe: 100.00% |██████████| Rows 1/1 | Elapsed Time: 00:02 | Remaining Time: 00:00


Launching job: aqi_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1207459/jobs/named/aqi_1_offline_fg_materialization/executions


(Job('aqi_1_offline_fg_materialization', 'SPARK'), None)

In [1]:
# iris_fg.read()
# # 	sepal_length	sepal_width	petal_length	petal_width	variety	uuid
# # 0	5.732203	2.366797	3.115597	1.155345	Versicolor	e5452a04-75a6-45c4-95cf-be4807f537d9