This Jupyter Notebook is designed to preprocess and extract features from time-series data. The notebook loads datasets, processes them to remove any invalid data, and extracts meaningful features using both custom functions and the tsfresh library. The processed data is then prepared for downstream machine learning tasks.

Key Steps in the Notebook
1. Importing Necessary Libraries
The notebook begins by importing essential Python libraries such as:
Pandas: For data manipulation and analysis.
NumPy: For numerical operations.
tsfresh: For time-series feature extraction.
datetime and timedelta: For handling time-related operations.
ast: For safely evaluating expressions from the data.

3. Loading the Data
The notebook loads time-series data from CSV files corresponding to five different blade feeding stations:
Station 1: BIC_Blade_Feeding_Station1_full.csv
Station 2: BIC_Blade_Feeding_Station2_full.csv
Station 3: BIC_Blade_Feeding_Station3_full.csv
Station 4: BIC_Blade_Feeding_Station4_full.csv
Station 5: BIC_Blade_Feeding_Station5_full.csv

3. Data Cleaning
Removing Rows with Empty Arrays:
The function remove_empty_array_rows identifies and removes rows containing empty arrays across various columns. This step ensures that subsequent feature extraction processes do not encounter errors due to invalid data.

5. Feature Extraction
Custom Feature Extraction:
The notebook defines several custom functions to calculate various statistical features from the time-series data, such as abs_energy, mean, maximum, and more. These functions are applied to each column of the data to extract relevant features.

Using tsfresh for Feature Extraction:
The tsfresh library is utilized to extract comprehensive features from the time-series data. This library automates the extraction of numerous features commonly used in time-series analysis.

5. Processing Individual Stations
The notebook processes each station's dataset individually using the process_station function, which:
    Calculates time differences between specific start and end times.
    Safely evaluates string representations of lists in the data.
    Expands rows based on timestamp increments for better feature extraction.
    Combines the extracted features with additional data columns.
6. Merging and Saving Data
After feature extraction, the notebook combines the features with the original data columns like time differences and station identifiers.
The processed data for each station is then prepared for further analysis or modeling.

8. Final Dataframe Construction
The notebook constructs a final DataFrame for each station, aligning indices and merging feature data with additional columns.
The resulting DataFrames are ready for use in machine learning models or other analyses.

In [1]:
import os
import numpy as np
import pandas as pd
from tsfresh.feature_extraction import extract_features, MinimalFCParameters, ComprehensiveFCParameters
from datetime import datetime, timedelta
import ast
from tsfresh import extract_features

In [2]:
df1 = pd.read_csv('BIC_Blade_Feeding_Station1_full.csv')



In [3]:
print(df1.shape)


(137186, 27)


In [4]:


def remove_empty_array_rows(df):
    """
    This function removes rows that contain empty arrays from a single DataFrame.
    
    Parameters:
    df (pd.DataFrame): The DataFrame to process.
    
    Returns:
    pd.DataFrame: The cleaned DataFrame with rows containing empty arrays removed.
    """
    
    def is_empty_array(value):
        try:
            # Try to parse the value as a literal (e.g., list)
            parsed_value = ast.literal_eval(value)
            # Check if it's a list and if it's empty
            return isinstance(parsed_value, list) and len(parsed_value) == 0
        except (ValueError, SyntaxError):
            # If it fails to parse, it's not a list
            return False
    
    # Identify the rows with empty arrays
    rows_with_empty_arrays = []
    
    for column in df.columns:
        empty_array_rows = df[df[column].apply(is_empty_array)].index.tolist()
        rows_with_empty_arrays.extend(empty_array_rows)
    
    # Remove duplicates (if the same row has empty arrays in multiple columns)
    rows_with_empty_arrays = list(set(rows_with_empty_arrays))
    
    # Remove the rows from the DataFrame
    df_cleaned = df.drop(rows_with_empty_arrays)
    
    # Reset the index of the cleaned DataFrame
    df_cleaned.reset_index(drop=True, inplace=True)
    
    # Display the number of rows removed
    print(f"Removed {len(rows_with_empty_arrays)} rows with empty arrays.")
    
    return df_cleaned



In [5]:
df1= remove_empty_array_rows(df1)


Removed 614 rows with empty arrays.


In [8]:


def process_station(df, station_id):
    # Drop unnecessary columns
    df = df.drop(['Unnamed: 0', 'razor_id', 'datetime_cam_start', 'datetime_cam_end', 'cam_mov_pos', 'cam_mov_id'], axis=1)
    
    # Define the calculate_time_difference function
    def calculate_time_difference(df, start_col, end_col):
        df[start_col] = pd.to_datetime(df[start_col])
        df[end_col] = pd.to_datetime(df[end_col])
        df['required_time'] = df[end_col] - df[start_col]
        new_col_name = f'{start_col}_to_{end_col}_milliseconds'
        df[new_col_name] = df['required_time'].dt.total_seconds() * 1000
        df = df.drop(['required_time'], axis=1)
        return df
    
    # List of tuples with start and end datetime column pairs
    datetime_columns = [('datetime_cl_start', 'datetime_cl_end'), 
                        ('datetime_ch_start', 'datetime_ch_end'),
                        ('datetime_ph_start', 'datetime_ph_end'),
                        ('datetime_pl_start', 'datetime_pl_end'),
                        ('datetime_mov_start', 'datetime_mov_end')]
    
    # Calculate time differences for each pair of datetime columns
    for start_col, end_col in datetime_columns:
        df = calculate_time_difference(df, start_col, end_col)

    # Define the safe_literal_eval function
    def safe_literal_eval(val):
        try:
            return ast.literal_eval(val)
        except (ValueError, SyntaxError):
            return val

    # Apply the safe_literal_eval function to all specified columns
    columns_to_parse = [
        'carb_pos_low', 'carb_val_low',
        'carb_pos_high', 'carb_val_high',
        'pp_pos_high', 'pp_val_high',
        'pp_pos_low', 'pp_val_low',
    ]
    
    for col in columns_to_parse:
        df[col] = df[col].apply(safe_literal_eval)

    # Function to expand rows based on a single column
    def expand_single_column(df, column, datetime_start_col, datetime_end_col):
        expanded_rows = []

        for idx, row in df.iterrows():
            datetime_start = row[datetime_start_col]
            datetime_end = row[datetime_end_col]

            list_length = len(row[column])
            if list_length > 1:
                time_increment = (datetime_end - datetime_start) / (list_length - 1)
            else:
                time_increment = timedelta(0)

            for i, value in enumerate(row[column]):
                timestamp = datetime_start + i * time_increment
                expanded_rows.append([idx, timestamp, value])

        expanded_df = pd.DataFrame(expanded_rows, columns=['id', 'time', column])
        
        return expanded_df

    # Define custom settings for feature extraction
    custom_settings = {
        "maximum": None,
        "mean": None,
        "minimum": None,
        "standard_deviation": None,
        "variance": None
    }


    # Function to process and extract features for a single column
    def process_and_extract(df, column, datetime_start_col, datetime_end_col):
        expanded_df = expand_single_column(df, column, datetime_start_col, datetime_end_col)

        # Fill NaN values with zeros
        # expanded_df = expanded_df.fillna(0)

        extracted_features = extract_features(
            expanded_df, column_id='id', column_sort='time', default_fc_parameters=custom_settings
        )
        return extracted_features

    # Process each column with its respective datetime columns
    carb_pos_low_features = process_and_extract(df, 'carb_pos_low', 'datetime_cl_start', 'datetime_cl_end')
    carb_val_low_features = process_and_extract(df, 'carb_val_low', 'datetime_cl_start', 'datetime_cl_end')

    # carb_pos_high_features = process_and_extract(df, 'carb_pos_high', 'datetime_ch_start', 'datetime_ch_end')
    # carb_val_high_features = process_and_extract(df, 'carb_val_high', 'datetime_cl_start', 'datetime_cl_end')

    # pp_pos_high_features = process_and_extract(df, 'pp_pos_high', 'datetime_ph_start', 'datetime_ph_end')
    # pp_val_high_features = process_and_extract(df, 'pp_val_high', 'datetime_ph_start', 'datetime_ph_end')
    
    pp_pos_low_features = process_and_extract(df, 'pp_pos_low', 'datetime_pl_start', 'datetime_pl_end')
    pp_val_low_features = process_and_extract(df, 'pp_val_low', 'datetime_pl_start', 'datetime_pl_end')
    
    # mov_pos_features = process_and_extract(df, 'mov_pos', 'datetime_mov_start', 'datetime_mov_end')


    # Combine all features
    all_features = pd.concat([carb_pos_low_features, carb_val_low_features, pp_pos_low_features, pp_val_low_features], axis=1)

    # List of columns to merge from df
    columns_to_merge = [
        # 'datetime_cl_start_to_datetime_cl_end_milliseconds',
        # 'datetime_ch_start_to_datetime_ch_end_milliseconds',
        # 'datetime_ph_start_to_datetime_ph_end_milliseconds',
        # 'datetime_pl_start_to_datetime_pl_end_milliseconds',
        'reject'
    ]  

    cleaned_station_df = all_features
    selected_df = df[columns_to_merge]

    # Ensure the indices are aligned if necessary
    updated_station_df = cleaned_station_df.reset_index(drop=True)
    selected_df = selected_df.reset_index(drop=True)

    # Merge the selected columns with cleaned_df
    final_station_df = pd.concat([updated_station_df, selected_df], axis=1)
    
    return final_station_df


In [9]:
station1_df = process_station(df1, station_id=1)


Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 30/30 [04:34<00:00,  9.14s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 30/30 [03:44<00:00,  7.48s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 30/30 [03:38<00:00,  7.29s/it]
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 30/30 [04:11<00:00,  8.39s/it]


In [10]:



# Save the DataFrame to a CSV file
station1_df.to_csv('final_station1_df_mover_full_.csv', index=False)
