# Question 1

This notebook breaks down the `transform_features_py` function that transforms raw exercise results data into aggregated features.

## Data

### Input Data Structure

The input data contains exercise results with columns including:

| Field                      | Meaning |
| :---                       | :---    |
| `session_exercise_result_id` | Identifier of an exercise performed by a patient in a given moment in time (primary key). Each time a patient performs the same exercise, even if in the same session, it will have a different `session_exercise_result_id`. |
| `session_group`              | Identifier of the physical therapy session in which this exercise was performed. Each time a patient performs a session it will have a different `session_group` (all exercises of the same session will have the same value). |
| `patient_id`              | Identifier of the patient that performed the session  (all sessions of the same patient will have the same value). |
| `patient_name`              | Name of the patient that performed the session (all sessions of the same patient will have the same value). |
| `patient_age`               | Age of the patient that performed the session (all sessions of the same patient will have the same value). |
| `exercise_name`              | Name of the performed exercise. |
| `exercise_side`              | Body side that the exercise regards. |
| `exercise_order`             | Order of the exercise within the session (the first exercise of the session has `order` 1, the second has `order` 2 and so on). |
| `prescribed_repeats`         | Number of repetitions (individual movements) the patient was supposed to perform in this specific exercise. Can be different among two performances of the same exercise in the same session. The exercise finishes when the number of performed repetitions - either correct or wrong - reaches this value. |
| `training_time`              | Time, in seconds, the patient spent performing the exercise |
| `correct_repeats`                         | Number of correct repetitions performed. |
| `wrong_repeats`              | Number of incorrect repetitions performed. |
| `leave_exercise`             | If the patient leaves the exercise before finishing it, this field stores the reason why. If the patient leaves the exercise, he is led into the following exercise in the session. |
| `leave_session`              | If the patient leaves the session before finishing it, this field stores the reason why (all exercises of the same session will have the same value). If the patient leaves the session, no more exercises are performed. |
| `pain`                       | Amount of pain between reported by the patient at the end of the session where this exercise was performed, between 0 and 10, where 0 is no pain and 10 is the worst possible pain (all exercises of the same session will have the same value). |
| `fatigue`                    | Amount of fatigue between reported by the patient at the end of the session where this exercise was performed, between 0 and 10, where 0 is no fatigue and 10 is the worst possible fatigue (all exercises of the same session will have the same value). |
| `therapy_name`               | Name of the therapy the patient is undertaking (the same for all exercises in the same session). |
| `session_number`               | Session number for that patient. The first session performed by the patient will have `session_number` equal to 1, the second 2, ... (the same for all exercises in the same session). |
| `quality`                    | Score from 1 to 5 reported by the patient at the end of the session when replying to the "How would you rate your experience today?" question (all exercises of the same session will have the same value).|
| `quality_reason_*`                  | Additional context collected when a quality bellow 5 is reported by the patient (see image bellow). Possible values are `movement_detection`, `my_self_personal`, `other`, `exercises`, `tablet`, `tablet_and_or_motion_trackers`, `easy_of_use`, `session_speed` (all exercises of the same session will have the same value). |
| `session_is_nok` | Classification model score on each session (1 corresponds to a `nok` session and 0 to an `ok` session). |

### Output Data Structure

The `transform_features_py` trasnforms data such that each row is indexed by `session_group` and has the following fields: 

| Field                           | Meaning |
| :---                            | :---    |
| `session_group`                 | explained above (primary key) |
| `patient_id`                    | explained above |
| `patient_name`                  | explained above |
| `patient_age`                   | explained above |
| `pain`                          | explained above |
| `fatigue`                       | explained above |
| `therapy_name`                  | explained above |
| `session_number`                | explained above |
| `leave_session`                 | explained above |
| `quality`                       | explained above |
| `quality_reason_*`              | explained above |
| `session_is_nok`                | explained above |
| `leave_exercise_*`           | Number of exercises in the session that were left due to reason `system_problem`, `other`, `unable_perform`, `pain` and `tired`, `technical_issues`, `difficulty` respectively. |
| `prescribed_repeats`            | Total number of repetitions (among all exercises) the patient was supposed to perform. |
| `training_time`                 | Time, in seconds, the patient spent performing the session. |
| `perc_correct_repeats`                       | Percentage of correct repetitions in the session. |
| `number_exercises`                | Number of exercises performed in the session. |  
| `number_of_distinct_exercises`    | Number of distinct exercises performed in the session. |
| `exercise_with_most_incorrect`  | Name of the exercise with the highest number of incorrect movements, if any. If there are two with the highest number of incorrect movement, you can pick any of them. |
| `first_exercise_skipped`        | Name of the first skipped exercise, if any. |


## Transformations Breakdown

These are the applied transformations:
1. **Aggregation**: Condenses multiple exercise rows into a single session row
2. **Feature Engineering**: Creates derived metrics like percentage of correct repetitions
3. **Categorization**: Creates indicator columns for reasons exercises were left or quality issues
4. **Problem Detection**: Identifies problematic exercises (most incorrect, first skipped)

## Code Breakdown

The code is split into two modules:
- `io.py`: handles file io; `load_exercise_data` for this question specifcally.
- `transform.py`: contains all applicable transformations to the exercise results.
- `data.py`: contains the `transform_features_py` function that leverages the `transform` module to transform the exercise results into features.


## Step by Step Guide

This breaks down the `transform_features_py` function and its components. The function transforms raw exercise results data into aggregated features.

### Quick Overview

`transform_features_py` is the main functions that orchestrates the entire process, and composed of the following functions:

1. `load_exercise_data` - Handles data loading from the parquet file
2. `aggregate_session_data` - Performs the initial groupby and aggregation
3. `calculate_performance_metrics` - Adds calculated metrics like percentage of correct repetitions
4. `add_reason_counts` - Adds columns for counting different reasons for leaving exercises or quality issues
5. `identify_problematic_exercises` - Identifies and adds information about problematic exercises
6. `order_columns` - Orders the columns in a logical sequence

#### Benefits
Each function has a single responsibility and a clear purpose, which allows testing each component separately. It also makes the code easier to maintain and reuse, since it's easier to modify specific parts without affecting the whole. 

### 1. Load the Data - `load_exercise_data`

This line reads a Parquet file containing the exercise results data into a pandas DataFrame.


In [1]:

import pandas as pd
from pathlib import Path

from message.config import DATA_DIR # no it module

def load_exercise_data(data_dir: str | Path) -> pd.DataFrame:
    """Load exercise results data from parquet file.
    
    Parameters
    ----------
    data_dir : str or Path
        Directory containing the exercise results data.
        
    Returns
    -------
    pd.DataFrame
        Raw exercise results data.
    """
    return pd.read_parquet(Path(data_dir, "exercise_results.parquet"))


df = load_exercise_data(DATA_DIR)
df

Unnamed: 0,session_exercise_result_sword_id,session_group,patient_id,therapy_name,exercise_name,exercise_side,exercise_order,prescribed_repeats,training_time,correct_repeats,...,quality_reason_other,quality_reason_exercises,quality_reason_tablet_and_or_motion_trackers,quality_reason_easy_of_use,quality_reason_tablet,quality_reason_session_speed,session_number,session_is_nok,patient_name,patient_age
0,39810278,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,prone_press_ups,center,10,5,35,1,...,0,0,0,0,0,0,318,False,Sonya Berg,76
1,39810303,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,child's_pose,center,12,1,33,1,...,0,0,0,0,0,0,318,False,Sonya Berg,76
2,39810255,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,plank,center,7,3,36,1,...,0,0,0,0,0,0,318,False,Sonya Berg,76
3,39810227,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,pelvic_anterior_posterior_tilt,center,4,20,23,20,...,0,0,0,0,0,0,318,False,Sonya Berg,76
4,39810238,lg1c88p/9QtkOmQwiwd5stMlmOU=,glRS/3uRDZt6RpmB+LaLyx/a7wk=,low_back,pelvic_side_tilt,center,5,20,24,20,...,0,0,0,0,0,0,318,False,Sonya Berg,76
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1126649,40189500,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,squat,bilateral,6,8,29,6,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69
1126650,40189427,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,side_step,bilateral,5,8,23,8,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69
1126651,40190119,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,side_lying_clamshells,right,8,8,98,8,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69
1126652,40190002,BvM0aISy9U19fgHIIliUzbJ3B9M=,N2dimTApy2aibVhQf3sKNqv+XQ8=,low_back,side_lying_clamshells,left,7,8,263,8,...,0,0,0,0,0,0,1,False,Nicole Acevedo,69


### 2. Group and Aggregate Data - `aggregate_session_data`

This step:
1. Groups the data by `session_group`
2. Applies different aggregation functions:
   - `first`: Takes the first value in each group for patient info and session details
   - `sum`: Adds up numerical values across all exercises in the session
   - `count`: Counts the total number of exercises
   - `nunique`: Counts the number of distinct exercises
3. Resets the index to make `session_group` a column again
4. Sets `session_group` as the DataFrame index for easier group-based operations.


In [2]:
def aggregate_session_data(df: pd.DataFrame) -> pd.DataFrame:
    """Aggregate exercise data by session group.
    
    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.
        
    Returns
    -------
    pd.DataFrame
        Data aggregated by session_group.
    """
    grouped = df.groupby("session_group").agg(
        patient_id=("patient_id", "first"),
        patient_name=("patient_name", "first"),
        patient_age=("patient_age", "first"),
        pain=("pain", "first"),
        fatigue=("fatigue", "first"),
        therapy_name=("therapy_name", "first"),
        session_number=("session_number", "first"),
        leave_session=("leave_session", "first"),
        quality=("quality", "first"),
        session_is_nok=("session_is_nok", "first"),
        prescribed_repeats=("prescribed_repeats", "sum"),
        training_time=("training_time", "sum"),
        correct_repeats=("correct_repeats", "sum"),
        wrong_repeats=("wrong_repeats", "sum"),
        number_exercises=("exercise_name", "count"),
        number_of_distinct_exercises=("exercise_name", "nunique"),
    ).reset_index()
    
    grouped.set_index("session_group", inplace=True)
    return grouped

df_agg = aggregate_session_data(df)
df_agg

Unnamed: 0_level_0,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,session_is_nok,prescribed_repeats,training_time,correct_repeats,wrong_repeats,number_exercises,number_of_distinct_exercises
session_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6,4,shoulder,6,,4,True,96,356,95,1,8,8
++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4,0,knee,8,,5,False,200,767,199,1,19,10
++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4,2,low_back,9,,5,False,150,683,149,1,18,12
++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4,4,low_back,2,,4,True,50,279,44,6,7,7
++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2,2,knee,17,,5,False,147,466,145,2,16,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2,0,shoulder,11,,5,False,220,731,213,7,23,12
zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8,4,neck,13,,5,True,142,750,141,1,18,11
zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0,0,elbow,13,,5,False,216,556,214,2,18,8
zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4,0,low_back,1,,3,False,118,693,112,6,13,7


### 4. Calculate Percentage of Correct Repetitions: `calculate_performance_metrics`

Creates a new feature measuring exercise accuracy by dividing the correct repetitions by the total repetitions.

In [3]:
def calculate_performance_metrics(grouped: pd.DataFrame) -> pd.DataFrame:
    """Calculate performance metrics for each session.
    
    Parameters
    ----------
    grouped : pd.DataFrame
        Aggregated session data.
        
    Returns
    -------
    pd.DataFrame
        Session data with performance metrics added.
    """
    grouped["perc_correct_repeats"] = grouped["correct_repeats"] / (grouped["correct_repeats"] + grouped["wrong_repeats"])
    return grouped

df_agg_pct = calculate_performance_metrics(df_agg)
df_agg_pct

Unnamed: 0_level_0,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,session_is_nok,prescribed_repeats,training_time,correct_repeats,wrong_repeats,number_exercises,number_of_distinct_exercises,perc_correct_repeats
session_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6,4,shoulder,6,,4,True,96,356,95,1,8,8,0.989583
++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4,0,knee,8,,5,False,200,767,199,1,19,10,0.995
++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4,2,low_back,9,,5,False,150,683,149,1,18,12,0.993333
++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4,4,low_back,2,,4,True,50,279,44,6,7,7,0.88
++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2,2,knee,17,,5,False,147,466,145,2,16,11,0.986395
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2,0,shoulder,11,,5,False,220,731,213,7,23,12,0.968182
zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8,4,neck,13,,5,True,142,750,141,1,18,11,0.992958
zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0,0,elbow,13,,5,False,216,556,214,2,18,8,0.990741
zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4,0,low_back,1,,3,False,118,693,112,6,13,7,0.949153


### 5. Process Reasons for Leaving Exercises and Quality Ratings: `add_reason_counts`

1. Columns for each reason a patient might leave an exercise
2. Columns for each reason related to quality ratings
3. Counts occurrences of each reason within each session group
4. Fills missing values with 0


In [4]:
def add_reason_counts(df: pd.DataFrame, grouped: pd.DataFrame) -> pd.DataFrame:
    """Add counts for different reasons for leaving exercises and quality ratings.
    
    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.
    grouped : pd.DataFrame
        Aggregated session data.
        
    Returns
    -------
    pd.DataFrame
        Session data with reason counts added.
    """
    leave_exercise_reasons = ["system_problem", "other", "unable_perform", "pain", "tired", "technical_issues", "difficulty"]
    quality_reasons = ["movement_detection", "my_self_personal", "other", "exercises", "tablet", "tablet_and_or_motion_trackers", "easy_of_use", "session_speed"]

    for reason in leave_exercise_reasons:
        grouped[f"leave_exercise_{reason}"] = df[df["leave_exercise"] == reason].groupby("session_group")["leave_exercise"].count()
        grouped[f"leave_exercise_{reason}"].fillna(0, inplace=True)
    for reason in quality_reasons:
        grouped[f"quality_{reason}"] = df[df["quality"] == reason].groupby("session_group")["quality"].count()
        grouped[f"quality_{reason}"].fillna(0, inplace=True)

    return grouped

df_agg_pct_reasons = add_reason_counts(df, df_agg_pct)
df_agg_pct_reasons

Unnamed: 0_level_0,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,session_is_nok,...,leave_exercise_technical_issues,leave_exercise_difficulty,quality_movement_detection,quality_my_self_personal,quality_other,quality_exercises,quality_tablet,quality_tablet_and_or_motion_trackers,quality_easy_of_use,quality_session_speed
session_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6,4,shoulder,6,,4,True,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4,0,knee,8,,5,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4,2,low_back,9,,5,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4,4,low_back,2,,4,True,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2,2,knee,17,,5,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2,0,shoulder,11,,5,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8,4,neck,13,,5,True,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0,0,elbow,13,,5,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4,0,low_back,1,,3,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 6: Identify Skipped Exercise and with Most Incorrect Repetitions: `identify_problematic_exercises`

Exercises that had the most incorrect repetitions by:
1. Filtering for exercises with wrong repetitions
2. Finding the exercise with the maximum wrong repetitions for each session group
3. Merging this information back into the grouped DataFrame
4. Renaming the column to `exercise_with_most_incorrect`

The first exercise skipped in each session by:
1. Filtering for exercises where `leave_exercise` is not null (skipped exercises)
2. Sorting by session group and exercise order
3. Getting the first skipped exercise for each session group
4. Merging this information back into the grouped DataFrame
5. Renaming the column to `first_exercise_skipped`

> Function could be split into two; kept because of concept and both are extras

In [5]:
def identify_problematic_exercises(df: pd.DataFrame, grouped: pd.DataFrame) -> pd.DataFrame:
    """Identify exercises with problems (most incorrect, first skipped).
    
    Parameters
    ----------
    df : pd.DataFrame
        Raw exercise results data.
    grouped : pd.DataFrame
        Aggregated session data.
        
    Returns
    -------
    pd.DataFrame
        Session data with problematic exercise information added.
    """
    df_nonzero_wrong = df[df["wrong_repeats"] > 0]
    if not df_nonzero_wrong.empty:
        most_incorrect = df_nonzero_wrong.loc[df_nonzero_wrong.groupby("session_group")["wrong_repeats"].idxmax(), 
                                            ["session_group", "exercise_name"]]
    else:
        most_incorrect = pd.DataFrame(columns=["session_group", "exercise_name"])  # Empty DataFrame to merge

    grouped = grouped.merge(most_incorrect, on="session_group", how="left")
    grouped.rename(columns={"exercise_name": "exercise_with_most_incorrect"}, inplace=True)

    skipped_exercises = df[df["leave_exercise"].notnull()].sort_values(by=["session_group", "exercise_order"])
    first_skipped = skipped_exercises.groupby("session_group").first().reset_index()[["session_group", "exercise_name"]]
    grouped = grouped.merge(first_skipped, on="session_group", how="left")
    grouped.rename(columns={"exercise_name": "first_exercise_skipped"}, inplace=True)
    return grouped
    
df_agg_pct_reasons_prob = identify_problematic_exercises(df, df_agg_pct_reasons)
df_agg_pct_reasons_prob

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,quality_movement_detection,quality_my_self_personal,quality_other,quality_exercises,quality_tablet,quality_tablet_and_or_motion_trackers,quality_easy_of_use,quality_session_speed,exercise_with_most_incorrect,first_exercise_skipped
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6,4,shoulder,6,,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shoulder_abduction,
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4,0,knee,8,,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,hip_abduction,
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4,2,low_back,9,,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,hip_hyperextension,
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4,4,low_back,2,,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,knee_flexion,
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2,2,knee,17,,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,hip_hyperextension,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2,0,shoulder,11,,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,diagonal_1_flexion,
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8,4,neck,13,,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,sitting_neck_side_bending,
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0,0,elbow,13,,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,standing_row,
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4,0,low_back,1,,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,hip_hyperextension,


### Step 8: Define Column Order and Return Final DataFrame

Orders output columns accordingly to question description:
1. Defines a specific column order for the output
2. Uses the asterisk (`*`) to unpack lists of columns that match certain patterns
3. Returns the grouped DataFrame with the specified column order


In [6]:
def order_columns(grouped: pd.DataFrame) -> pd.DataFrame:
    """Order columns in a logical sequence.
    
    Parameters
    ----------
    grouped : pd.DataFrame
        Session data with all features.
        
    Returns
    -------
    pd.DataFrame
        Session data with columns in the specified order.
    """
    columns_order = [
        "session_group",
        "patient_id",
        "patient_name",
        "patient_age",
        "pain",
        "fatigue",
        "therapy_name",
        "session_number",
        "leave_session",
        "quality",
        * grouped.columns[grouped.columns.str.startswith("quality_reason_")],
        "session_is_nok",
        * grouped.columns[grouped.columns.str.startswith("leave_exercise_")],
        "prescribed_repeats",
        "training_time",
        "perc_correct_repeats",
        "number_exercises",
        "number_of_distinct_exercises",
        "exercise_with_most_incorrect",
        "first_exercise_skipped",
    ]
    return grouped[columns_order]

df_features = order_columns(df_agg_pct_reasons_prob)
df_features

Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,prescribed_repeats,training_time,perc_correct_repeats,number_exercises,number_of_distinct_exercises,exercise_with_most_incorrect,first_exercise_skipped
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6,4,shoulder,6,,4,...,0.0,0.0,0.0,96,356,0.989583,8,8,shoulder_abduction,
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4,0,knee,8,,5,...,0.0,0.0,0.0,200,767,0.995,19,10,hip_abduction,
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4,2,low_back,9,,5,...,0.0,0.0,0.0,150,683,0.993333,18,12,hip_hyperextension,
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4,4,low_back,2,,4,...,0.0,0.0,0.0,50,279,0.88,7,7,knee_flexion,
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2,2,knee,17,,5,...,0.0,0.0,0.0,147,466,0.986395,16,11,hip_hyperextension,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2,0,shoulder,11,,5,...,0.0,0.0,0.0,220,731,0.968182,23,12,diagonal_1_flexion,
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8,4,neck,13,,5,...,0.0,0.0,0.0,142,750,0.992958,18,11,sitting_neck_side_bending,
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0,0,elbow,13,,5,...,0.0,0.0,0.0,216,556,0.990741,18,8,standing_row,
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4,0,low_back,1,,3,...,0.0,0.0,0.0,118,693,0.949153,13,7,hip_hyperextension,


### 9. Putting it all together - `transform_features_py`

Function was modified to call all of the above sequentially to obtain the pretended features.

In [7]:
def transform_features_py() -> pd.DataFrame:
    """Loads the exercise results and transforms them into features.
    
    Returns
    -------
    pd.DataFrame
        The transformed features.
    """

    df = load_exercise_data(DATA_DIR)
    grouped = aggregate_session_data(df)
    grouped = calculate_performance_metrics(grouped)
    grouped = add_reason_counts(df, grouped)
    grouped = identify_problematic_exercises(df, grouped)
    grouped = order_columns(grouped)

    return grouped

df_final = transform_features_py()
df_final


Unnamed: 0,session_group,patient_id,patient_name,patient_age,pain,fatigue,therapy_name,session_number,leave_session,quality,...,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,prescribed_repeats,training_time,perc_correct_repeats,number_exercises,number_of_distinct_exercises,exercise_with_most_incorrect,first_exercise_skipped
0,++//wixk6DpH8NMGvqLqvpzWbzY=,3FDm7kzjNVgmqUPhyODoZpMIIGc=,Taylor Mendez,55,6,4,shoulder,6,,4,...,0.0,0.0,0.0,96,356,0.989583,8,8,shoulder_abduction,
1,++2JgoMe8JGBtUHdsOiLGO8UZ18=,KRnwvSlSa6U62Edl3dHJa0nVM5A=,Danielle Miller,75,4,0,knee,8,,5,...,0.0,0.0,0.0,200,767,0.995,19,10,hip_abduction,
2,++4kUzy7ewH5u7FjNoU8CW6thbY=,euuPYQwygUQC94V0LvmFOzHPoXQ=,Jeremy Randall,70,4,2,low_back,9,,5,...,0.0,0.0,0.0,150,683,0.993333,18,12,hip_hyperextension,
3,++8Q+lFKrp9IKCWsBT0IO0XEV1Y=,R0S8jUhp1lC00zuUYuB0QAnj6as=,Richard Robinson,98,4,4,low_back,2,,4,...,0.0,0.0,0.0,50,279,0.88,7,7,knee_flexion,
4,++9PC4/46Jmrl/PHbzkM1BCPg2g=,Uj48yKb4R53oteHJHRcnzpUjBdY=,Hector Perry,28,2,2,knee,17,,5,...,0.0,0.0,0.0,147,466,0.986395,16,11,hip_hyperextension,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74785,zzjYItanfFoaxcOT72WuLVdFlpU=,p5r85M4UOxA+ooLsAnpGA091yec=,Brianna George,28,2,0,shoulder,11,,5,...,0.0,0.0,0.0,220,731,0.968182,23,12,diagonal_1_flexion,
74786,zzkMg0F+Nr92E2UwQld7T2c9DxY=,Q8xrPkPRSOYhbjrzTxF77Fp40IA=,Sophia Harris,85,8,4,neck,13,,5,...,0.0,0.0,0.0,142,750,0.992958,18,11,sitting_neck_side_bending,
74787,zzwyhOME0/jCt/TlocGDlnM7Nx4=,4tqtZxzLS9+7QJtKa3pW0beUG3s=,Donna Moreno,34,0,0,elbow,13,,5,...,0.0,0.0,0.0,216,556,0.990741,18,8,standing_row,
74788,zzy4uJWf1oWcSbMEHF4RG15ELcU=,bWap/fmf6koGTRDMpejv7/J+ovw=,Nancy Walter,96,4,0,low_back,1,,3,...,0.0,0.0,0.0,118,693,0.949153,13,7,hip_hyperextension,


## Running from module

Finally we call the implmentation directly from the module.

In [8]:

df_features_mod = transform_features_py()


In [9]:
df_features_mod.describe()

Unnamed: 0,patient_age,pain,fatigue,session_number,quality,leave_exercise_system_problem,leave_exercise_other,leave_exercise_unable_perform,leave_exercise_pain,leave_exercise_tired,leave_exercise_technical_issues,leave_exercise_difficulty,prescribed_repeats,training_time,perc_correct_repeats,number_exercises,number_of_distinct_exercises
count,74790.0,68575.0,68573.0,74790.0,68312.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0,74790.0
mean,58.582992,2.479533,1.800067,28.571814,4.505475,0.059126,0.047493,0.034029,0.022182,0.009039,0.0,0.0,135.370825,592.987458,,15.064233,8.792258
std,23.773241,2.065735,1.996785,55.128341,0.806024,0.429565,0.486721,0.319079,0.253819,0.201078,0.0,0.0,53.943975,258.906156,,5.526578,2.418361
min,18.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,,1.0,1.0
25%,38.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0,397.0,,10.0,7.0
50%,58.0,2.0,2.0,11.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,134.0,569.0,,15.0,9.0
75%,79.0,4.0,2.0,26.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,177.0,757.0,,20.0,11.0
max,99.0,10.0,10.0,812.0,5.0,26.0,26.0,13.0,11.0,16.0,0.0,0.0,725.0,5123.0,,54.0,43.0
