## Feature Engineering

In this notebook, we take our original dataset and condense its 600,000+ rows into around 2,700 rows that represent the workout programs. From there, we one-hot encode our categorical columns to be understood cleanly by our model, and create the first version of our the description column's vector embeddings. These embeddings will allow the categorization to be very specific to a user's program query.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
huge_data = pd.read_csv('../data/cleaned_600k.csv')
huge_data.drop(columns=['Unnamed: 0'], inplace=True)

In [3]:
goal_map = {
    'Olympic Weightlifting': 'goal_olympic_weightlifting',
    'Muscle & Sculpting': 'goal_muscle_&_sculpting',
    'Bodyweight Fitness': 'goal_bodyweight_fitness',
    'Powerbuilding': 'goal_powerbuilding',
    'Bodybuilding': 'goal_bodybuilding',
    'Powerlifting': 'goal_powerlifting',
    'Athletics': 'goal_athletics'
}

level_map = {
    'Beginner': 'level_beginner',
    'Novice': 'level_novice',
    'Intermediate': 'level_intermediate',
    'Advanced': 'level_advanced'
}

def add_multilabel_onehot(df, col, value_map, prefix):
    exploded = df[[col]].explode(col)
    exploded[col] = exploded[col].apply(lambda x: f"{prefix}{x}")

    one_hot = pd.get_dummies(exploded[col])
    one_hot = one_hot.groupby(exploded.index).sum()

    expected_cols = list(value_map.values())
    for colname in expected_cols:
        if colname not in one_hot.columns:
            one_hot[colname] = 0
    one_hot = one_hot[expected_cols]

    for c in one_hot.columns:
        df[c] = one_hot[c]

In [4]:
# Clean up sets and reps columns, create new columns for the model to learn on
# Sets is average per week, reps is average per exercise

huge_data['is_rep_based'] = huge_data['reps'] > 0
huge_data['reps_count'] = huge_data['reps'].apply(lambda x: x if x > 0 else 0)
huge_data['reps_time'] = huge_data['reps'].apply(lambda x: -x if x < 0 else 0)


# Precompute program_length for each (title, description) pair to avoid repeated lookups
program_length_map = huge_data.drop_duplicates(['title', 'description']) \
    .set_index(['title', 'description'])['program_length'].to_dict()

def per_week(series, title, description):
    program_length = program_length_map.get((title, description), 0)
    return series.sum() / program_length if program_length else 0

# Group by program, aggregate features, and compute sets & reps per week

grouped = huge_data.groupby(['title', 'description'])
program_features = grouped.agg({
    'reps_count': 'mean',   # mean reps per exercise
    'reps_time': 'mean',
    'is_rep_based': 'mean'
}).reset_index()

# Compute sets per week and reps per week
program_features['sets'] = [
    per_week(group['sets'], title, description)
    for (title, description), group in grouped
]
program_features['reps_per_week'] = [
    per_week(group['reps_count'], title, description)
    for (title, description), group in grouped
]

In [5]:
# Extract categorical and numerical features from original dataset
# Aggregate to program-level to ensure one row per program
program_metadata = huge_data.groupby(['title', 'description']).agg({
    'level': 'first',
    'goal': 'first',
    'equipment': 'first',
    'program_length': 'mean',
    'time_per_workout': 'mean',
    'intensity': 'mean'
}).reset_index().reset_index().drop(columns=['index'])

# One-hot encode nested categorical features
categorical_cols = ['level', 'goal', 'equipment']
add_multilabel_onehot(program_metadata, 'level', level_map, 'level_')
add_multilabel_onehot(program_metadata, 'goal', goal_map, 'goal_')

# One hot encode normal categorical feature
ohe = OneHotEncoder(sparse_output=False)
equip_ohe = ohe.fit_transform(program_metadata[['equipment']])

feature_names = ohe.get_feature_names_out(['equipment'])
equip_df = pd.DataFrame(equip_ohe, columns=feature_names, index=program_metadata.index)

program_metadata = program_metadata.join(equip_df)

# Merge back sets and reps columns to rest of the dataset features
program_features = program_features.merge(
    program_metadata,
    on=['title', 'description'],
    how='left'
)

program_features = program_features.drop(columns=['level', 'goal', 'equipment'])
# Combining textual columns for model training
program_features['text'] = program_features['title'] + program_features['description']
program_features


Unnamed: 0,title,description,reps_count,reps_time,is_rep_based,sets,reps_per_week,program_length,time_per_workout,intensity,...,goal_bodyweight_fitness,goal_powerbuilding,goal_bodybuilding,goal_powerlifting,goal_athletics,equipment_at home,equipment_dumbbell only,equipment_full gym,equipment_garage gym,text
0,(mass monster) high intensity 4 day upper lowe...,Build tones of muscular with this high intensi...,9.994624,0.000000,1.000000,53.000000,309.833333,12.0,90.0,8.276882,...,0,0,0,0,0,0.0,0.0,1.0,0.0,(mass monster) high intensity 4 day upper lowe...
1,(not my program)shj jotaro,Build strength and size,7.906250,0.000000,1.000000,76.000000,189.750000,8.0,60.0,7.098958,...,0,0,0,0,0,0.0,0.0,1.0,0.0,(not my program)shj jotaroBuild strength and size
2,1 powerlift per day powerbuilding 5 day bro split,Based off of Andy Baker's KCS (Kingwood Streng...,10.920188,0.000000,1.000000,85.833333,387.666667,6.0,90.0,8.352113,...,0,0,0,0,0,0.0,0.0,1.0,0.0,1 powerlift per day powerbuilding 5 day bro sp...
3,10 week deadlift focus,Increase deadlift,11.988764,0.000000,1.000000,112.300000,426.800000,10.0,80.0,7.365169,...,0,0,0,0,0,0.0,0.0,1.0,0.0,10 week deadlift focusIncrease deadlift
4,10 week mass building program,This workout is designed to increase your musc...,13.792857,0.000000,1.000000,65.000000,386.200000,10.0,70.0,6.460714,...,0,0,0,0,0,0.0,0.0,0.0,1.0,10 week mass building programThis workout is d...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2627,🎧,Lihaskasvu,10.635965,0.000000,1.000000,66.000000,202.083333,12.0,90.0,7.890351,...,0,0,0,0,0,0.0,0.0,0.0,1.0,🎧Lihaskasvu
2628,👾reza's routine👾,This is a beginner friendly routine made for m...,11.200000,0.000000,1.000000,94.000000,560.000000,1.0,60.0,6.600000,...,0,0,0,0,0,0.0,1.0,0.0,0.0,👾reza's routine👾This is a beginner friendly ro...
2629,"🔥 ""upper body dominance: 3-day growth system"" 🔥","""Upper Body Dominance: A science-based 3-day w...",8.906250,0.625000,0.937500,32.000000,142.500000,6.0,60.0,6.750000,...,0,0,0,0,0,0.0,0.0,1.0,0.0,"🔥 ""upper body dominance: 3-day growth system"" ..."
2630,🙈🙉🙊🐵,Muscle Memory Training,10.640777,0.058252,0.995146,72.500000,274.000000,8.0,90.0,8.092233,...,0,0,0,0,0,0.0,0.0,1.0,0.0,🙈🙉🙊🐵Muscle Memory Training


In [6]:
program_features.describe()

Unnamed: 0,reps_count,reps_time,is_rep_based,sets,reps_per_week,program_length,time_per_workout,intensity,level_beginner,level_novice,...,goal_muscle_&_sculpting,goal_bodyweight_fitness,goal_powerbuilding,goal_bodybuilding,goal_powerlifting,goal_athletics,equipment_at home,equipment_dumbbell only,equipment_full gym,equipment_garage gym
count,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,...,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0,2632.0
mean,10.378716,1.04687,0.985826,63.996121,252.533132,8.824468,68.952128,7.86816,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.045593,0.026596,0.712386,0.215426
std,2.136301,11.76719,0.033964,33.601289,128.853524,4.179955,24.324504,0.734808,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.20864,0.160929,0.452736,0.411195
min,3.639706,0.0,0.375,1.0,7.8,1.0,10.0,3.92803,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9.082108,0.0,0.997293,40.5,165.5,5.0,60.0,7.448607,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10.301503,0.0,1.0,59.464286,237.708333,8.0,60.0,7.925,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,11.512931,0.023886,1.0,82.0,316.677083,12.0,90.0,8.315242,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,30.0,450.5,1.0,336.0,1678.4375,18.0,180.0,10.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0


In [7]:
texts = program_features['text'].to_list()
BATCH_SIZE = 64
embeddings = []

# The device can be changed to 'cuda' or 'cpu' for windows computers or Colab notebooks
model = SentenceTransformer('all-MiniLM-L6-V2', device='mps')

for i in range(0, len(texts), BATCH_SIZE):
    batch = texts[i: i + BATCH_SIZE]
    embedding = model.encode(batch)
    embeddings.append(embedding)

embeddings = np.vstack(embeddings)

# Add embeddings back to features dataframe
embd_cols = [f'embd_{i}' for i in range(embeddings.shape[1])]
embd_df = pd.DataFrame(embeddings, columns=embd_cols, index=program_features.index)
program_features = pd.concat([program_features, embd_df], axis=1)


In [8]:
md_cols = [
    'reps_count', 'reps_time', 'is_rep_based',
    'sets', 'reps_per_week', 'program_length', 'time_per_workout',
    'intensity', 'level_beginner', 'level_novice', 'level_intermediate',
    'level_advanced', 'goal_olympic_weightlifting',
    'goal_muscle_&_sculpting', 'goal_bodyweight_fitness',
    'goal_powerbuilding', 'goal_bodybuilding', 'goal_powerlifting',
    'goal_athletics', 'equipment_at home', 'equipment_dumbbell only',
    'equipment_full gym', 'equipment_garage gym'
]

final_features = program_features[md_cols + embd_cols]

In [9]:
final_features

Unnamed: 0,reps_count,reps_time,is_rep_based,sets,reps_per_week,program_length,time_per_workout,intensity,level_beginner,level_novice,...,embd_374,embd_375,embd_376,embd_377,embd_378,embd_379,embd_380,embd_381,embd_382,embd_383
0,9.994624,0.000000,1.000000,53.000000,309.833333,12.0,90.0,8.276882,0,0,...,-0.045354,0.161003,-0.034845,-0.006268,-0.051289,-0.042462,0.077358,-0.122119,-0.095263,0.018117
1,7.906250,0.000000,1.000000,76.000000,189.750000,8.0,60.0,7.098958,0,0,...,-0.046805,0.009465,-0.058725,-0.005001,-0.027342,0.037221,-0.046139,-0.018821,-0.033384,0.049609
2,10.920188,0.000000,1.000000,85.833333,387.666667,6.0,90.0,8.352113,0,0,...,-0.034850,0.111400,-0.125062,-0.045001,-0.002156,0.005530,0.022983,-0.080369,-0.037186,0.013385
3,11.988764,0.000000,1.000000,112.300000,426.800000,10.0,80.0,7.365169,0,0,...,-0.027893,0.090241,-0.047062,-0.018680,-0.049775,0.069561,0.024157,-0.054875,-0.012202,0.012184
4,13.792857,0.000000,1.000000,65.000000,386.200000,10.0,70.0,6.460714,0,0,...,0.012278,0.118403,-0.048055,-0.030562,0.008969,0.015169,-0.021078,-0.029177,-0.020832,-0.031500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2627,10.635965,0.000000,1.000000,66.000000,202.083333,12.0,90.0,7.890351,0,0,...,0.047251,0.033072,0.031344,0.048920,0.010956,0.042035,0.112500,0.047732,-0.008581,0.006375
2628,11.200000,0.000000,1.000000,94.000000,560.000000,1.0,60.0,6.600000,0,0,...,0.045862,0.120242,-0.011955,0.050975,-0.000107,0.019679,0.069140,0.043699,-0.054506,0.008461
2629,8.906250,0.625000,0.937500,32.000000,142.500000,6.0,60.0,6.750000,0,0,...,0.003194,0.152046,-0.035946,-0.020077,-0.037656,0.006636,0.087951,-0.003587,-0.079230,0.019457
2630,10.640777,0.058252,0.995146,72.500000,274.000000,8.0,90.0,8.092233,0,0,...,0.081989,0.045929,0.028703,-0.022043,-0.067185,0.085646,0.011623,0.031443,-0.054076,-0.034579


We have now converted our dataset into purely numerical features that we can feed into a clustering model like KMeans to quickly and efficiently cluster the programs. We will also be able to cosine similarity to find the closest programs to the one we search for.