# Question 1

This notebook breaks down the `transform_features_py` function that transforms raw exercise results data into aggregated features.

## Data

### Input Data Structure

The input data contains exercise results with columns including:

| Field                      | Meaning |
| :---                       | :---    |
| `session_exercise_result_id` | Identifier of an exercise performed by a patient in a given moment in time (primary key). Each time a patient performs the same exercise, even if in the same session, it will have a different `session_exercise_result_id`. |
| `session_group`              | Identifier of the physical therapy session in which this exercise was performed. Each time a patient performs a session it will have a different `session_group` (all exercises of the same session will have the same value). |
| `patient_id`              | Identifier of the patient that performed the session  (all sessions of the same patient will have the same value). |
| `patient_name`              | Name of the patient that performed the session (all sessions of the same patient will have the same value). |
| `patient_age`               | Age of the patient that performed the session (all sessions of the same patient will have the same value). |
| `exercise_name`              | Name of the performed exercise. |
| `exercise_side`              | Body side that the exercise regards. |
| `exercise_order`             | Order of the exercise within the session (the first exercise of the session has `order` 1, the second has `order` 2 and so on). |
| `prescribed_repeats`         | Number of repetitions (individual movements) the patient was supposed to perform in this specific exercise. Can be different among two performances of the same exercise in the same session. The exercise finishes when the number of performed repetitions - either correct or wrong - reaches this value. |
| `training_time`              | Time, in seconds, the patient spent performing the exercise |
| `correct_repeats`                         | Number of correct repetitions performed. |
| `wrong_repeats`              | Number of incorrect repetitions performed. |
| `leave_exercise`             | If the patient leaves the exercise before finishing it, this field stores the reason why. If the patient leaves the exercise, he is led into the following exercise in the session. |
| `leave_session`              | If the patient leaves the session before finishing it, this field stores the reason why (all exercises of the same session will have the same value). If the patient leaves the session, no more exercises are performed. |
| `pain`                       | Amount of pain between reported by the patient at the end of the session where this exercise was performed, between 0 and 10, where 0 is no pain and 10 is the worst possible pain (all exercises of the same session will have the same value). |
| `fatigue`                    | Amount of fatigue between reported by the patient at the end of the session where this exercise was performed, between 0 and 10, where 0 is no fatigue and 10 is the worst possible fatigue (all exercises of the same session will have the same value). |
| `therapy_name`               | Name of the therapy the patient is undertaking (the same for all exercises in the same session). |
| `session_number`               | Session number for that patient. The first session performed by the patient will have `session_number` equal to 1, the second 2, ... (the same for all exercises in the same session). |
| `quality`                    | Score from 1 to 5 reported by the patient at the end of the session when replying to the "How would you rate your experience today?" question (all exercises of the same session will have the same value).|
| `quality_reason_*`                  | Additional context collected when a quality bellow 5 is reported by the patient (see image bellow). Possible values are `movement_detection`, `my_self_personal`, `other`, `exercises`, `tablet`, `tablet_and_or_motion_trackers`, `easy_of_use`, `session_speed` (all exercises of the same session will have the same value). |
| `session_is_nok` | Classification model score on each session (1 corresponds to a `nok` session and 0 to an `ok` session). |

### Output Data Structure

The `transform_features_py` trasnforms data such that each row is indexed by `session_group` and has the following fields: 

| Field                           | Meaning |
| :---                            | :---    |
| `session_group`                 | explained above (primary key) |
| `patient_id`                    | explained above |
| `patient_name`                  | explained above |
| `patient_age`                   | explained above |
| `pain`                          | explained above |
| `fatigue`                       | explained above |
| `therapy_name`                  | explained above |
| `session_number`                | explained above |
| `leave_session`                 | explained above |
| `quality`                       | explained above |
| `quality_reason_*`              | explained above |
| `session_is_nok`                | explained above |
| `leave_exercise_*`           | Number of exercises in the session that were left due to reason `system_problem`, `other`, `unable_perform`, `pain` and `tired`, `technical_issues`, `difficulty` respectively. |
| `prescribed_repeats`            | Total number of repetitions (among all exercises) the patient was supposed to perform. |
| `training_time`                 | Time, in seconds, the patient spent performing the session. |
| `perc_correct_repeats`                       | Percentage of correct repetitions in the session. |
| `number_exercises`                | Number of exercises performed in the session. |  
| `number_of_distinct_exercises`    | Number of distinct exercises performed in the session. |
| `exercise_with_most_incorrect`  | Name of the exercise with the highest number of incorrect movements, if any. If there are two with the highest number of incorrect movement, you can pick any of them. |
| `first_exercise_skipped`        | Name of the first skipped exercise, if any. |


## Transformations Breakdown

These are the applied transformations:
1. **Aggregation**: Condenses multiple exercise rows into a single session row
2. **Feature Engineering**: Creates derived metrics like percentage of correct repetitions
3. **Categorization**: Creates indicator columns for reasons exercises were left or quality issues
4. **Problem Detection**: Identifies problematic exercises (most incorrect, first skipped)

## Code Breakdown

The code is split into several modules:
- `io.py`: handles file io; `load_exercise_data` for this question specifcally.
- `transform.py`: contains all applicable transformations to the exercise results.
- `data.py`: contains the `transform_features_py` function that leverages the `transform` module to transform the exercise results into features.
- **Tests**: tests are located in the `tests/data_test.py` module. For more information refer to [Question 1 - Testing Structure and Methodology](Question_1_Tests.md)

## Step by Step Guide

This breaks down the `transform_features_py` function and its components. The function transforms raw exercise results data into aggregated features.

### Quick Overview

`transform_features_py` is the main functions that orchestrates the entire process, and composed of the following functions:

1. `load_exercise_data` - Handles data loading from the parquet file
2. `aggregate_session_data` - Performs the initial groupby and aggregation
3. `calculate_performance_metrics` - Adds percentage of correct repetitions
4. `add_reason_counts` - Adds columns for counting different reasons for leaving exercises or quality issues
5. `identify_first_exercise_skipped` - Identifies the first exercise skipped by the patient
6. `identify_most_incorrect_exercise` - Identifies exercises with most incorrect
7. `order_columns` - Orders the columns in a logical sequence

Each function has a single responsibility and a clear purpose, which allows testing each component separately. It also makes the code easier to maintain and reuse, since it's easier to modify specific parts without affecting the whole. 

### 1. Load the Data - `load_exercise_data`

Read a Parquet file containing the exercise results data into a pandas DataFrame.


In [1]:
import pandas as pd
from pathlib import Path

from message.config import DATA_DIR  # no it module


def load_exercise_data(data_dir: str | Path) -> pd.DataFrame:
    """Load exercise results data from parquet file.

    Parameters
    ----------
    data_dir : str or Path
        Directory containing the exercise results data.

    Returns
    -------
    pd.DataFrame
        Raw exercise results data.
    """
    return pd.read_parquet(Path(data_dir, "exercise_results.parquet"))


df = load_exercise_data(DATA_DIR)
df

Unnamed: 0,session_exercise_result_sword_id,session_group,patient_id,therapy_name,exercise_name,exercise_side,exercise_order,prescribed_repeats,training_time,correct_repeats,...,quality_reason_other,quality_reason_exercises,quality_reason_tablet_and_or_motion_trackers,quality_reason_easy_of_use,quality_reason_tablet,quality_reason_session_speed,session_number,session_is_nok,patient_name,patient_age
0,39810278,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,prone_press_ups,center,10,5,35,1,...,0,0,0,0,0,0,318,False,Sonya Berg,76
1,39810303,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,child's_pose,center,12,1,33,1,...,0,0,0,0,0,0,318,False,Sonya Berg,76
2,39810255,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,plank,center,7,3,36,1,...,0,0,0,0,0,0,318,False,Sonya Berg,76
3,39810227,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,pelvic_anterior_posterior_tilt,center,4,20,23,20,...,0,0,0,0,0,0,318,False,Sonya Berg,76
4,39810238,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,pelvic_side_tilt,center,5,20,24,20,...,0,0,0,0,0,0,318,False,Sonya Berg,76
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1126649,40189500,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,squat,bilateral,6,8,29,6,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69
1126650,40189427,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,side_step,bilateral,5,8,23,8,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69
1126651,40190119,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,side_lying_clamshells,right,8,8,98,8,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69
1126652,40190002,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,side_lying_clamshells,left,7,8,263,8,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69


### 2. Group and Aggregate Data - `aggregate_session_data`

Apply simple aggregations:
1. Groups the data by `session_group`
2. Apply different aggregation functions:
   - `first`: Takes the first value in each group for patient info and session details
   - `sum`: Adds up numerical values across all exercises in the session
   - `count`: Counts the total number of exercises
   - `nunique`: Counts the number of distinct exercises
3. Apply type coercions


In [2]:
def aggregate_session_data(df: pd.DataFrame) -> pd.DataFrame:
    """Aggregate exercise data by session group.

    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.

    Returns
    -------
    pd.DataFrame
        Data aggregated by session_group.
    """
    grouped = (
        df.groupby("session_group")
        .agg(
            patient_id=("patient_id", "first"),
            patient_name=("patient_name", "first"),
            patient_age=("patient_age", "first"),
            pain=("pain", "first"),
            fatigue=("fatigue", "first"),
            therapy_name=("therapy_name", "first"),
            session_number=("session_number", "first"),
            leave_session=("leave_session", "first"),
            quality=("quality", "first"),
            session_is_nok=("session_is_nok", "first"),
            prescribed_repeats=("prescribed_repeats", "sum"),
            training_time=("training_time", "sum"),
            correct_repeats=("correct_repeats", "sum"),
            wrong_repeats=("wrong_repeats", "sum"),
            number_exercises=("exercise_name", "count"),
            number_of_distinct_exercises=("exercise_name", "nunique"),
            quality_reason_movement_detection=(
                "quality_reason_movement_detection",
                "first",
            ),
            quality_reason_my_self_personal=(
                "quality_reason_my_self_personal",
                "first",
            ),
            quality_reason_other=("quality_reason_other", "first"),
            quality_reason_exercises=("quality_reason_exercises", "first"),
            quality_reason_tablet=("quality_reason_tablet", "first"),
            quality_reason_tablet_and_or_motion_trackers=(
                "quality_reason_tablet_and_or_motion_trackers",
                "first",
            ),
            quality_reason_easy_of_use=("quality_reason_easy_of_use", "first"),
            quality_reason_session_speed=("quality_reason_session_speed", "first"),
        )
        .reset_index()
    )

    grouped["session_is_nok"] = grouped["session_is_nok"].astype("object")
    grouped["pain"] = grouped["pain"].astype("float64")
    grouped["fatigue"] = grouped["fatigue"].astype("float64")
    grouped["session_number"] = grouped["session_number"].astype("int64")
    grouped["quality"] = grouped["quality"].astype("float64")
    grouped["quality_reason_movement_detection"] = grouped[
        "quality_reason_movement_detection"
    ].astype("int64")
    grouped["quality_reason_my_self_personal"] = grouped[
        "quality_reason_my_self_personal"
    ].astype("int64")
    grouped["quality_reason_other"] = grouped["quality_reason_other"].astype("int64")
    grouped["quality_reason_exercises"] = grouped["quality_reason_exercises"].astype(
        "int64"
    )
    grouped["quality_reason_tablet"] = grouped["quality_reason_tablet"].astype("int64")
    grouped["quality_reason_tablet_and_or_motion_trackers"] = grouped[
        "quality_reason_tablet_and_or_motion_trackers"
    ].astype("int64")
    grouped["quality_reason_easy_of_use"] = grouped[
        "quality_reason_easy_of_use"
    ].astype("int64")
    grouped["quality_reason_session_speed"] = grouped[
        "quality_reason_session_speed"
    ].astype("int64")

    return grouped


df_step_2 = aggregate_session_data(df)
df_step_2

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,number_exercises,number_of_distinct_exercises,quality_reason_movement_detection,quality_reason_my_self_personal,quality_reason_other,quality_reason_exercises,quality_reason_tablet,quality_reason_tablet_and_or_motion_trackers,quality_reason_easy_of_use,quality_reason_session_speed
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,8,8,0,0,1,1,0,0,1,1
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,19,10,0,0,0,0,0,0,0,0
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,18,12,0,0,0,0,0,0,0,0
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,7,7,0,1,0,0,0,0,0,0
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,16,11,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,23,12,0,0,0,0,0,0,0,0
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,18,11,0,0,0,0,0,0,0,0
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,18,8,0,0,0,0,0,0,0,0
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,13,7,1,0,0,0,0,0,0,0


### 3. Calculate Percentage of Correct Repetitions: `calculate_performance_metrics`

Creates a new feature measuring exercise accuracy by dividing the correct repetitions by the total repetitions.

In [3]:
def calculate_performance_metrics(grouped: pd.DataFrame) -> pd.DataFrame:
    """Calculate performance metrics for each session.

    Parameters
    ----------
    grouped : pd.DataFrame
        Aggregated session data.

    Returns
    -------
    pd.DataFrame
        Session data with performance metrics added.
    """
    grouped["perc_correct_repeats"] = grouped["correct_repeats"] / (
        grouped["correct_repeats"] + grouped["wrong_repeats"]
    )
    return grouped


df_step_3 = calculate_performance_metrics(df_step_2)
df_step_3

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,number_of_distinct_exercises,quality_reason_movement_detection,quality_reason_my_self_personal,quality_reason_other,quality_reason_exercises,quality_reason_tablet,quality_reason_tablet_and_or_motion_trackers,quality_reason_easy_of_use,quality_reason_session_speed,perc_correct_repeats
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,8,0,0,1,1,0,0,1,1,0.989583
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,10,0,0,0,0,0,0,0,0,0.995
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,12,0,0,0,0,0,0,0,0,0.993333
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,7,0,1,0,0,0,0,0,0,0.88
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,11,0,0,0,0,0,0,0,0,0.986395
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,12,0,0,0,0,0,0,0,0,0.968182
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,11,0,0,0,0,0,0,0,0,0.992958
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,8,0,0,0,0,0,0,0,0,0.990741
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,7,1,0,0,0,0,0,0,0,0.949153


### 4. Process Reasons for Leaving Exercises: `add_reason_counts`

Adds columns that count how many times each reason was recorded for leaving an exercise:
1. Creates a list of predefined reasons why users might leave an exercise
2. Set Index for faster lookup
3. Count Reasons and Update the Data
    1. Adds a new column to grouped initialized to 0
    2. Filters `df` to find rows where leave exercise matches the current reason: `df[df["leave_exercise"] == reason]`
    3. Groups by "session_group" and counts occurrences of the reason: `.groupby("session_group")["leave_exercise"].count()`
    4. Updates grouped with these counts: `grouped.loc[df_leave_exercise.index, f"leave_exercise_{reason}"] = df_leave_exercise.values`
    5. Fills NaN values with 0 to ensure consistency.
4. Reset Index
5. Resets the index of grouped after aggregation to restore "session_group" as a column.

> Reset is inefficient but necessary for further operations.

In [4]:
def add_reason_counts(df: pd.DataFrame, grouped: pd.DataFrame) -> pd.DataFrame:
    """Add counts for different reasons for leaving exercises.

    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.
    grouped : pd.DataFrame
        Aggregated session data.

    Returns
    -------
    pd.DataFrame
        Session data with reason counts added.
    """
    leave_exercise_reasons = [
        "system_problem",
        "other",
        "unable_perform",
        "pain",
        "tired",
        "technical_issues",
        "difficulty",
    ]

    df.set_index("session_group", inplace=True)
    grouped.set_index("session_group", inplace=True)

    for reason in leave_exercise_reasons:
        grouped[f"leave_exercise_{reason}"] = 0
        df_leave_exercise = (
            df[df["leave_exercise"] == reason]
            .groupby("session_group")["leave_exercise"]
            .count()
        )
        grouped.loc[df_leave_exercise.index, f"leave_exercise_{reason}"] = (
            df_leave_exercise.values
        )
        grouped[f"leave_exercise_{reason}"].fillna(0, inplace=True)

    grouped.reset_index(inplace=True)  # this is inneficient...
    return grouped


df_step_4 = add_reason_counts(df, df_step_3)
df_step_4

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,quality_reason_easy_of_use,quality_reason_session_speed,perc_correct_repeats,leave_exercise_system_problem,leave_exercise_other,leave_exercise_unable_perform,leave_exercise_pain,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,1,1,0.989583,0,0,0,0,0,0,0
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,0,0,0.995,0,0,0,0,0,0,0
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,0,0,0.993333,0,0,0,0,0,0,0
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,0,0,0.88,0,0,0,0,0,0,0
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,0,0,0.986395,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,0,0,0.968182,0,0,0,0,0,0,0
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,0,0,0.992958,0,0,0,0,0,0,0
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,0,0,0.990741,0,0,0,0,0,0,0
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,0,0,0.949153,0,0,0,0,0,0,0


### 5. Identify Skipped Exercise: `identify_problematic_exercises`

Exercises that had the most incorrect repetitions:
1. Groups `df` by "session_group" and "exercise_name"
2. Sums the "wrong_repeats" column to get the total incorrect repetitions for each exercise in each session
3. Finds the exercise with the highest "wrong_repeats" count for each "session_group" (`grouped_wrong_reps.loc[grouped_wrong_reps.groupby("session_group")["wrong_repeats"].idxmax()]`)
4. Drops the "wrong_repeats" column after identifying the most incorrect exercise
5. Merges the identified most incorrect exercises into `grouped` using "session_group" as the key. Uses a **left join** to retain all session data, even if no incorrect repetitions exist
6. Renames "exercise_name" to "exercise_with_most_incorrect"

In [5]:
def identify_most_incorrect_exercise(
    df: pd.DataFrame, grouped: pd.DataFrame
) -> pd.DataFrame:
    """Identify exercises with most incorrect.

    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.
    grouped : pd.DataFrame
        Aggregated session data.

    Returns
    -------
    pd.DataFrame
        Session data with problematic exercise information added.
    """

    grouped_wrong_reps = (
        df.groupby(["session_group", "exercise_name"])["wrong_repeats"]
        .sum()
        .reset_index()
    )
    grouped_incorrect_ex = grouped_wrong_reps.loc[
        grouped_wrong_reps.groupby("session_group")["wrong_repeats"].idxmax()
    ].drop(columns="wrong_repeats", axis=1)
    grouped = grouped.merge(grouped_incorrect_ex, on="session_group", how="left")
    grouped = grouped.rename(columns={"exercise_name": "exercise_with_most_incorrect"})

    return grouped


df_step_5 = identify_most_incorrect_exercise(df, df_step_4)
df_step_5

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,quality_reason_session_speed,perc_correct_repeats,leave_exercise_system_problem,leave_exercise_other,leave_exercise_unable_perform,leave_exercise_pain,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,exercise_with_most_incorrect
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,1,0.989583,0,0,0,0,0,0,0,shoulder_abduction
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,0,0.995,0,0,0,0,0,0,0,hip_abduction
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,0,0.993333,0,0,0,0,0,0,0,hip_hyperextension
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,0,0.88,0,0,0,0,0,0,0,knee_flexion
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,0,0.986395,0,0,0,0,0,0,0,airplane
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,0,0.968182,0,0,0,0,0,0,0,diagonal_1_flexion
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,0,0.992958,0,0,0,0,0,0,0,sitting_neck_side_bending
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,0,0.990741,0,0,0,0,0,0,0,diagonal_1_flexion
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,0,0.949153,0,0,0,0,0,0,0,hip_hyperextension


### 6. Identify first skipepd exercise: `identify_first_exercise_skipped`

The first exercise skipped in each session by:
1. Filtering for exercises where `leave_exercise` is not null (skipped exercises)
2. Sorting by session group and exercise order
3. Getting the first skipped exercise for each session group
4. Merging this information back into the grouped DataFrame
5. Renaming the column to `first_exercise_skipped`


In [6]:
def identify_first_exercise_skipped(
    df: pd.DataFrame, grouped: pd.DataFrame
) -> pd.DataFrame:
    """Identify the first exercise skipped by the patient.

    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.
    grouped : pd.DataFrame
        Aggregated session data.

    Returns
    -------
    pd.DataFrame
        Session data with first skipped exercise added.
    """
    skipped_exercises = df[df["leave_exercise"].notnull()].sort_values(
        by=["session_group", "exercise_order"]
    )
    first_skipped = (
        skipped_exercises.groupby("session_group")
        .first()
        .reset_index()[["session_group", "exercise_name"]]
    )
    grouped = grouped.merge(first_skipped, on="session_group", how="left")
    grouped.rename(columns={"exercise_name": "first_exercise_skipped"}, inplace=True)
    return grouped


df_step_6 = identify_first_exercise_skipped(df, df_step_5)
df_step_6

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,perc_correct_repeats,leave_exercise_system_problem,leave_exercise_other,leave_exercise_unable_perform,leave_exercise_pain,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,exercise_with_most_incorrect,first_exercise_skipped
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,0.989583,0,0,0,0,0,0,0,shoulder_abduction,
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,0.995,0,0,0,0,0,0,0,hip_abduction,
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,0.993333,0,0,0,0,0,0,0,hip_hyperextension,
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,0.88,0,0,0,0,0,0,0,knee_flexion,
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,0.986395,0,0,0,0,0,0,0,airplane,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,0.968182,0,0,0,0,0,0,0,diagonal_1_flexion,
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,0.992958,0,0,0,0,0,0,0,sitting_neck_side_bending,
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,0.990741,0,0,0,0,0,0,0,diagonal_1_flexion,
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,0.949153,0,0,0,0,0,0,0,hip_hyperextension,


### 7. Define Column Order and Return Final DataFrame: `order_columns`

Orders output columns accordingly to question description:
1. Defines a specific column order for the output
2. Returns the grouped DataFrame with the specified column order


In [7]:
def order_columns(grouped: pd.DataFrame) -> pd.DataFrame:
    """Order columns in a logical sequence.

    Parameters
    ----------
    grouped : pd.DataFrame
        Session data with all features.

    Returns
    -------
    pd.DataFrame
        Session data with columns in the specified order.
    """
    columns_order = [
        "session_group",
        "patient_id",
        "patient_name",
        "patient_age",
        "pain",
        "fatigue",
        "therapy_name",
        "session_number",
        "leave_session",
        "quality",
        "quality_reason_movement_detection",
        "quality_reason_my_self_personal",
        "quality_reason_other",
        "quality_reason_exercises",
        "quality_reason_tablet",
        "quality_reason_tablet_and_or_motion_trackers",
        "quality_reason_easy_of_use",
        "quality_reason_session_speed",
        "session_is_nok",
        "leave_exercise_system_problem",
        "leave_exercise_other",
        "leave_exercise_unable_perform",
        "leave_exercise_pain",
        "leave_exercise_tired",
        "leave_exercise_technical_issues",
        "leave_exercise_difficulty",
        "prescribed_repeats",
        "training_time",
        "perc_correct_repeats",
        "number_exercises",
        "number_of_distinct_exercises",
        "exercise_with_most_incorrect",
        "first_exercise_skipped",
    ]
    grouped = grouped[columns_order]

    return grouped


df_step_7 = order_columns(df_step_6)
df_step_7

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,prescribed_repeats,training_time,perc_correct_repeats,number_exercises,number_of_distinct_exercises,exercise_with_most_incorrect,first_exercise_skipped
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,0,0,0,96,356,0.989583,8,8,shoulder_abduction,
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,0,0,0,200,767,0.995,19,10,hip_abduction,
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,0,0,0,150,683,0.993333,18,12,hip_hyperextension,
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,0,0,0,50,279,0.88,7,7,knee_flexion,
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,0,0,0,147,466,0.986395,16,11,airplane,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,0,0,0,220,731,0.968182,23,12,diagonal_1_flexion,
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,0,0,0,142,750,0.992958,18,11,sitting_neck_side_bending,
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,0,0,0,216,556,0.990741,18,8,diagonal_1_flexion,
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,0,0,0,118,693,0.949153,13,7,hip_hyperextension,


### 8. Putting it all together - `transform_features_py`

Function was modified to call all of the above sequentially to obtain the pretended features.

In [8]:
def transform_features_py() -> pd.DataFrame:
    """Loads the exercise results and transforms them into features.

    Returns
    -------
    pd.DataFrame
        The transformed features.
    """

    df = load_exercise_data(DATA_DIR)
    grouped = aggregate_session_data(df)
    grouped = calculate_performance_metrics(grouped)
    grouped = add_reason_counts(df, grouped)
    grouped = identify_first_exercise_skipped(df, grouped)
    grouped = identify_most_incorrect_exercise(df, grouped)
    grouped = order_columns(grouped)

    return grouped


df_final = transform_features_py()
df_final

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,prescribed_repeats,training_time,perc_correct_repeats,number_exercises,number_of_distinct_exercises,exercise_with_most_incorrect,first_exercise_skipped
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6.0,4.0,shoulder,6,,4.0,...,0,0,0,96,356,0.989583,8,8,shoulder_abduction,
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4.0,0.0,knee,8,,5.0,...,0,0,0,200,767,0.995,19,10,hip_abduction,
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4.0,2.0,low_back,9,,5.0,...,0,0,0,150,683,0.993333,18,12,hip_hyperextension,
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4.0,4.0,low_back,2,,4.0,...,0,0,0,50,279,0.88,7,7,knee_flexion,
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2.0,2.0,knee,17,,5.0,...,0,0,0,147,466,0.986395,16,11,airplane,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2.0,0.0,shoulder,11,,5.0,...,0,0,0,220,731,0.968182,23,12,diagonal_1_flexion,
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8.0,4.0,neck,13,,5.0,...,0,0,0,142,750,0.992958,18,11,sitting_neck_side_bending,
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0.0,0.0,elbow,13,,5.0,...,0,0,0,216,556,0.990741,18,8,diagonal_1_flexion,
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4.0,0.0,low_back,1,,3.0,...,0,0,0,118,693,0.949153,13,7,hip_hyperextension,


## Running from module

Finally we call the implmentation directly from the module.

In [9]:
from message.data import transform_features_py as transform_features_py_mod

df_features_mod = transform_features_py_mod()

In [10]:
df_features_mod.describe()

Unnamed: 0,patient_age,pain,fatigue,session_number,quality,quality_reason_movement_detection,quality_reason_my_self_personal,quality_reason_other,quality_reason_exercises,quality_reason_tablet,...,leave_exercise_unable_perform,leave_exercise_pain,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,prescribed_repeats,training_time,perc_correct_repeats,number_exercises,number_of_distinct_exercises
count,74790.0,68575.0,68573.0,74790.0,68312.0,74790.0,74790.0,74790.0,74790.0,74790.0,...,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0
mean,58.582992,2.479533,1.800067,28.571814,4.505475,0.116393,0.067509,0.048696,0.046089,0.013892,...,0.034029,0.022182,0.009039,0.0,0.0,135.370825,592.987458,,15.064233,8.792258
std,23.773241,2.065735,1.996785,55.128341,0.806024,0.320697,0.250903,0.215234,0.209679,0.117045,...,0.319079,0.253819,0.201078,0.0,0.0,53.943975,258.906156,,5.526578,2.418361
min,18.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,,1.0,1.0
25%,38.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,94.0,397.0,,10.0,7.0
50%,58.0,2.0,2.0,11.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,134.0,569.0,,15.0,9.0
75%,79.0,4.0,2.0,26.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,177.0,757.0,,20.0,11.0
max,99.0,10.0,10.0,812.0,5.0,1.0,1.0,1.0,1.0,1.0,...,13.0,11.0,16.0,0.0,0.0,725.0,5123.0,,54.0,43.0
