# Overview

In this notebook, we conduct data cleaning on individual tables before merging them. Subsequently, we merge all the tables, resulting in a comprehensive dataframe. Finally, we flatten the dataframe, treating each row as a session execution.

**Author**: Oscar Javier Bastidas Jossa. 

**Email**: oscar.jossa@deusto.es.

In [1]:
import pandas as pd
import numpy as np

from utilities import Data_cleaning
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)

# Data Cleaning - Stage 1

Initially, we conduct a primary check for NaN values. Following this, we adjust the data types for the columns as necessary. Finally, we examine the data for any duplicated entries.

In [2]:
def run_individual_table_cleaner(df,
                                 change_data_type_columns={},
                                 columns_to_check_units=[]):
    """
        Method to create an instance for each df passed and run some methods of the Data_cleaning class

        Parameters:
        df (dataframe): The dataframe passed for each table
        change_data_type_columns (dict): A dictionary specifying the data types to change for some columns.
        columns_to_check_units (list): A list of columns to check their units.

        Returns:
        The dataframed with datatypes of the specified columns changed.
    """

    #Create an instance of the Data_cleaning class
    cleaner = Data_cleaning(df)

    # Converting data types of the indicated columns
    df = cleaner.converting_data_types(change_data_type_columns)

    # Check data type
    cleaner.check_data_type()
  
    # Check duplicated
    cleaner.check_duplicated()

    # Check the unity of the indicated columns
    cleaner.check_unit_columns(columns_to_check_units)

    return df

## session executions summary

In [3]:
# Loading data
import ast

session_executions_summary = pd.read_csv('../data/session_execution_summaries.csv', 
                                 on_bad_lines='skip', # skip bad lines without raising or warning when they are encountered.
                                 escapechar='\\',
                                 header = None)
session_executions_summary = session_executions_summary.replace('N', '0')
session_executions_summary[6] = session_executions_summary[6].apply(ast.literal_eval)
session_executions_summary[7] = session_executions_summary[7].apply(ast.literal_eval)
session_executions_summary[8] = session_executions_summary[8].apply(ast.literal_eval)

  session_executions_summary = pd.read_csv('../data/session_execution_summaries.csv',


In [4]:
# assigning the headers
session_executions_summary.columns = ['id', 'session_execution_id', 'total_reps', 
                              'total_time', 'reps_per_min',
                              'total_kcal', 'reps_per_exercise',
                              'secs_per_exercise', 'reps_per_min_per_exercise', 'created_at', 'updated_at',
                              'reps_set_per_block', 'time_set_per_block', 'average_reps_min_set_per_block',
                              'reps_min_set_block', 'effort', 'points', 'value_of_session', 'body_parts_spider', 'name']

session_executions_summary = session_executions_summary.drop(['body_parts_spider'], axis = 1)

# First check of table
change_data_type_columns = {'int_cols' : ['difficulty_feedback', 'enjoyment_feedback', 'discard_reason'],
                            'float_cols' : ['execution_time', 'reps_executed'],
                            'datetime' : ['created_at','updated_at'],
                            'boolean': ['discarded', 'imported']}
columns_to_check_units = ['difficulty_feedback', 'enjoyment_feedback']

# session_executions_summary = run_individual_table_cleaner(session_executions_summary, 
#                                                          change_data_type_columns,
#                                                          columns_to_check_units)

In [5]:
session_executions_summary.shape

(734011, 19)

## session executions

In [6]:
# Loading data
session_executions = pd.read_csv('../data/session_executions.csv', 
                                 on_bad_lines='skip', # skip bad lines without raising or warning when they are encountered. 
                                 header = None)

# assigning the headers
session_executions.columns = ['id', 'scheduled_at', 'user_program_id', 
                              'difficulty_feedback', 'enjoyment_feedback',
                              'feedback_comment', 'reps_executed',
                              'execution_time', 'order', 'created_at',
                              'updated_at', 'front_end_id', 'session_id',
                              'discarded', 'discard_reason', 'imported']

session_executions = session_executions.drop(['scheduled_at', 'feedback_comment',
                                              'order', 'front_end_id'], axis = 1)

# First check of table
change_data_type_columns = {'int_cols' : ['difficulty_feedback', 'enjoyment_feedback', 'discard_reason'],
                            'float_cols' : ['execution_time', 'reps_executed'],
                            'datetime' : ['created_at','updated_at'],
                            'boolean': ['discarded', 'imported']}
columns_to_check_units = ['difficulty_feedback', 'enjoyment_feedback']

session_executions = run_individual_table_cleaner(session_executions, 
                                                  change_data_type_columns,
                                                  columns_to_check_units)

  session_executions = pd.read_csv('../data/session_executions.csv',



 --- converting_data_types method executed --- 

converting int_cols
converting float_cols
converting datetime
converting boolean

 --- check_data_type method executed --- 

id                              int64
user_program_id                 int64
difficulty_feedback           float64
enjoyment_feedback            float64
reps_executed                 float64
execution_time                float64
created_at             datetime64[ns]
updated_at             datetime64[ns]
session_id                      int64
discarded                        bool
discard_reason                float64
imported                         bool
dtype: object

 --- check_duplicate method executed --- 

0         False
1         False
2         False
3         False
4         False
          ...  
738683    False
738684    False
738685    False
738686    False
738687    False
Length: 738688, dtype: bool

 --- check_unit_columns method executed --- 

5.0     423691
10.0     95045
1.0      52418
6.0       7788


## user_programs

In [7]:
# Loading data
user_programs = pd.read_csv('../data/user_programs.csv', on_bad_lines='skip', low_memory=False)

# Dropping some columns
user_programs = user_programs.drop(['enjoyment_notes'], axis = 1)

# First check of table
change_data_type_columns = {
    'datetime' : ['created_at','updated_at'],
    'boolean': ['active', 'completed']
}

user_programs = run_individual_table_cleaner(user_programs, 
                                  change_data_type_columns)



 --- converting_data_types method executed --- 

converting datetime
converting boolean

 --- check_data_type method executed --- 

id                             int64
user_id                        int64
program_id                     int64
created_at            datetime64[ns]
updated_at            datetime64[ns]
active                          bool
current_session_id           float64
completed                       bool
enjoyment                    float64
dtype: object

 --- check_duplicate method executed --- 

0        False
1        False
2        False
3        False
4        False
         ...  
81316    False
81317    False
81318    False
81319    False
81320    False
Length: 81321, dtype: bool

 --- check_unit_columns method executed --- 



## Users

In [8]:
users = pd.read_csv('../data/users.csv', low_memory=False)
users = users.drop(['email', 'encrypted_password', 
                     'reset_password_token','reset_password_sent_at',
                     'remember_created_at','is_admin','names', 'last_name',
                     'current_sign_in_ip', 'last_sign_in_ip', 
                     'recover_password_code','recover_password_attempts', 
                     'facebook_uid','workout_setting_voice_coach', 'workout_setting_sound',
                     'workout_setting_vibration', 'workout_setting_mobility',
                     'workout_setting_cardio_warmup', 'workout_setting_countdown',
                     'google_uid','t1_push','t1_core', 
                     't1_legs', 't1_full', 't1_push_exercise', 
                     't1_pull_up','t2_reps', 't2_steps', 
                     't2_reps_push', 't2_reps_core', 't2_reps_legs',
                     't2_reps_full', 't2_time_push', 't2_time_core',
                     't2_time_legs', 't2_time_full', 't1_full_exercise', 
                     't1_pull_up_exercise','warmup_setting', 
                     'warmup_session_id', 'stripe_id', 'provider', 'uid',
                     'affiliate_code', 'moengage_id', 'mix_panel_id',
                     'apple_id_token','platform', 'login_token', 'current_sign_in_at', 
                     'last_sign_in_at','login_token_generated_at', 'current_weekly_streak'], 
                      axis = 1)

# First check of table
change_data_type_columns = {'datetime' : ['created_at', 'updated_at', 'date_of_birth'],
                            'boolean': ['gender', 'newsletter_subscription',
                                        'notifications_setting', 'imported', 
                                        'scientific_data_usage',
                                        'affiliate_code_signup']}

columns_to_check_units = ['height', 'weight','activity_level',
                          'activity_level', 'goal', 'body_type', 
                          'body_fat', 'training_days_setting',
                          'best_weekly_streak']

users = run_individual_table_cleaner(users, 
                                     change_data_type_columns,
                                     columns_to_check_units)


 --- converting_data_types method executed --- 

converting datetime
converting boolean

 --- check_data_type method executed --- 

id                                  int64
created_at                 datetime64[ns]
updated_at                 datetime64[ns]
gender                               bool
date_of_birth              datetime64[ns]
height                            float64
weight                            float64
activity_level                      int64
goal                                int64
body_type                           int64
body_fat                          float64
newsletter_subscription              bool
sign_in_count                       int64
notifications_setting                bool
training_days_setting               int64
language                           object
country                            object
points                              int64
scientific_data_usage                bool
best_weekly_streak                  int64
affiliate_code_signup      

Be careful when analysis the data of some users, information such as **height** or **weigth** is not correct. Consider to check those variables in the final analysis.

## Programs

In [9]:
programs = pd.read_csv('../data/programs.csv', on_bad_lines='skip', low_memory=False)
programs = programs.drop(['user_id', 'code_name', 'name_es', 
                           'description_es', 'auto_generated', 'priority_order', 
                           'next_program_id'], axis = 1)


# First check of table
change_data_type_columns = {'datetime' : ['created_at', 'updated_at'],
                            'boolean': ['pro', 'available']}

columns_to_check_units = ['strength', 'endurance','technique',
                          'flexibility', 'intensity']

programs = run_individual_table_cleaner(programs, 
                                        change_data_type_columns,
                                        columns_to_check_units)
#programs


 --- converting_data_types method executed --- 

converting datetime
converting boolean

 --- check_data_type method executed --- 

id                         int64
created_at        datetime64[ns]
updated_at        datetime64[ns]
pro                         bool
available                   bool
strength                   int64
endurance                  int64
technique                  int64
flexibility                int64
intensity                  int64
name_en                   object
description_en            object
dtype: object

 --- check_duplicate method executed --- 

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
30     False
31     Fal

In [10]:
programs.loc[programs['id'] == 30, 'description_en']

291    The simplest equipment is often the most effec...
Name: description_en, dtype: object

## Table of program characteristics

In [11]:

program_characteristics = pd.read_csv('../data/program_characteristics.csv', on_bad_lines='skip', low_memory=False)
program_characteristics = program_characteristics.drop(['created_at', 'updated_at'], axis = 1)


# First check of table
change_data_type_columns = {'boolean': ['objective']}

program_characteristics = run_individual_table_cleaner(program_characteristics, 
                                                       change_data_type_columns)
program_characteristics



 --- converting_data_types method executed --- 

converting boolean

 --- check_data_type method executed --- 

id             int64
program_id     int64
objective       bool
value_en      object
value_es      object
dtype: object

 --- check_duplicate method executed --- 

0       False
1       False
2       False
3       False
4       False
        ...  
2802    False
2803    False
2804    False
2805    False
2806    False
Length: 2807, dtype: bool

 --- check_unit_columns method executed --- 



Unnamed: 0,id,program_id,objective,value_en,value_es
0,164,30,False,A program created by Marcos Vázquez from Fitne...,Un programa creado por Marcos Vázquez de Fitne...
1,165,30,False,Combine strength and hypertrophy with progress...,Combina fuerza e hipertrofia con ejercicios pr...
2,33,17,False,Sessions designed to speed up your metabolism ...,Sesiones diseñadas para acelerar tu metabolism...
3,60,19,False,Sessions designed to speed up your metabolism ...,Un programa adaptado a tu nivel para crear con...
4,69,13,False,Sessions created to help you progressively imp...,Un programa adaptado a tu nivel para afianzar ...
...,...,...,...,...,...
2802,2925,503,True,Avoid loss of muscle mass.,Evitar la pérdida de masa muscular.
2803,2926,503,True,Prevent cardiovascular diseases.,Prevenir enfermedades cardiovasculares.
2804,2922,503,False,"Training program focused on balance, flexibili...",Programa de entrenamiento enfocado a trabajar ...
2805,2914,500,False,14 training sessions to create a constant phys...,14 sesiones de entrenamiento para crear una ru...


## programs_sessions

In [12]:
program_sessions = pd.read_csv('../data/program_sessions.csv', on_bad_lines='skip', low_memory=False)

# First check of table
change_data_type_columns = {'datetime': ['created_at', 'updated_at']}

program_sessions = run_individual_table_cleaner(program_sessions, 
                                                       change_data_type_columns)
program_sessions


 --- converting_data_types method executed --- 

converting datetime

 --- check_data_type method executed --- 

id                     int64
program_id             int64
session_id             int64
created_at    datetime64[ns]
updated_at    datetime64[ns]
dtype: object

 --- check_duplicate method executed --- 

0       False
1       False
2       False
3       False
4       False
        ...  
1386    False
1387    False
1388    False
1389    False
1390    False
Length: 1391, dtype: bool

 --- check_unit_columns method executed --- 



Unnamed: 0,id,program_id,session_id,created_at,updated_at
0,662,49,586,2021-03-12 16:12:46.327341,2021-03-12 16:12:46.327341
1,855,24,778,2021-06-08 09:07:21.138981,2021-06-08 09:07:21.138981
2,856,24,779,2021-06-08 09:07:40.740510,2021-06-08 09:07:40.740510
3,857,24,780,2021-06-08 09:07:59.657941,2021-06-08 09:07:59.657941
4,279,5,288,2020-12-22 20:21:04.037297,2020-12-22 20:21:04.037297
...,...,...,...,...,...
1386,1783,503,1938,2022-01-07 19:14:08.725028,2022-01-07 19:14:08.725028
1387,1784,503,1939,2022-01-07 19:14:46.715396,2022-01-07 19:14:46.715396
1388,1785,503,1940,2022-01-07 19:15:23.371020,2022-01-07 19:15:23.371020
1389,1786,504,1941,2022-01-14 03:56:58.062909,2022-01-14 03:56:58.062909


## subscriptions

In [13]:
subscriptions = pd.read_csv('../data/subscriptions.csv', on_bad_lines='skip', low_memory=False)
subscriptions = subscriptions.drop(['platform', 'transaction_body', 'start_date', 
                                     'end_date', 'subscription_type', 'cancelled_at',
                                     'store_metadata','offer_code', 'affiliate_code',
                                     'cancellation_reason', 'receipt_data'], axis = 1)


# First check of table

change_data_type_columns = {'datetime': ['created_at', 'updated_at'],
                            'boolean': ['cancelled', 'status']}

subscriptions = run_individual_table_cleaner(subscriptions, 
                                                       change_data_type_columns)
subscriptions


 --- converting_data_types method executed --- 

converting datetime
converting boolean

 --- check_data_type method executed --- 

id                     int64
user_id                int64
product_id             int64
program_id             int64
status                object
created_at    datetime64[ns]
updated_at    datetime64[ns]
cancelled               bool
dtype: object

 --- check_duplicate method executed --- 

0        False
1        False
2        False
3        False
4        False
         ...  
13153    False
13154    False
13155    False
13156    False
13157    False
Length: 13158, dtype: bool

 --- check_unit_columns method executed --- 



Unnamed: 0,id,user_id,product_id,program_id,status,created_at,updated_at,cancelled
0,1353,1907,74,10,True,2021-10-25 11:03:03.245558,2021-10-26 11:50:42.563295,False
1,1352,645,74,8,True,2021-10-25 11:03:02.992647,2021-10-26 11:51:50.627662,False
2,1036,1604,74,6,True,2021-10-25 11:01:41.622483,2021-10-26 22:25:02.543061,False
3,272,529,74,13,True,2021-04-22 10:22:02.773388,2021-11-14 10:53:21.222846,False
4,1173,1737,74,6,True,2021-10-25 11:02:17.157806,2021-12-21 10:00:10.592175,True
...,...,...,...,...,...,...,...,...
13153,13543,16493,74,25,False,2022-03-24 09:13:26.441364,2022-03-24 09:13:26.441364,False
13154,13544,16494,74,10,False,2022-03-24 09:13:26.454334,2022-03-24 09:13:26.454334,False
13155,13545,16495,74,9,False,2022-03-24 09:13:26.467400,2022-03-24 09:13:26.467400,False
13156,13546,16497,74,14,False,2022-03-24 09:13:26.482328,2022-03-24 09:13:26.482328,False


### program profiles

In [14]:
program_profiles = pd.read_csv('../data/program_profiles.csv', on_bad_lines='skip', low_memory=False)

program_profiles = program_profiles.drop(['created_at', 'updated_at'], axis = 1)
program_profiles

Unnamed: 0,id,program_id,profile_id
0,1,34,1
1,2,28,20
2,3,28,19
3,4,27,17
4,5,27,15
...,...,...,...
623,626,500,16
624,627,500,17
625,628,500,18
626,629,500,19


### profiles

In [15]:
profiles= pd.read_csv('../data/profiles.csv')
profiles = profiles.drop(['fat_level', 'created_at', 'updated_at'], axis = 1)

### sessions

In [16]:
# Table of sessions 
sessions = pd.read_csv('../data/sessions.csv', on_bad_lines='skip', low_memory=False)
sessions = sessions.drop(['level', 'reps', 'created_at', 'updated_at', 'strength', 
                           'endurance', 'technique', 'flexibility', 'intensity',
                           'name_es'], axis = 1)

### session_blocks

In [17]:
session_blocks = pd.read_csv('../data/session_blocks.csv', on_bad_lines='skip', low_memory=False, header = None)
session_blocks.columns = ['id', 'session_id', 'time_duration', 
                              'created_at', 'updated_at',
                              'order', 'block_type', 'loop']
session_blocks = session_blocks.drop(['created_at', 'updated_at', 'time_duration', 'loop'], axis = 1)

# First check of table
change_data_type_columns = {'int_cols': ['order']}

session_blocks = run_individual_table_cleaner(session_blocks, 
                                             change_data_type_columns)
session_blocks



 --- converting_data_types method executed --- 

converting int_cols

 --- check_data_type method executed --- 

id              int64
session_id      int64
order         float64
block_type      int64
dtype: object

 --- check_duplicate method executed --- 

0       False
1       False
2       False
3       False
4       False
        ...  
2981    False
2982    False
2983    False
2984    False
2985    False
Length: 2986, dtype: bool

 --- check_unit_columns method executed --- 



Unnamed: 0,id,session_id,order,block_type
0,62,41,1.0,0
1,3239,1849,1.0,19
2,3240,1850,1.0,19
3,3241,1851,1.0,19
4,5,8,1.0,0
...,...,...,...,...
2981,3226,1836,1.0,19
2982,3227,1837,1.0,19
2983,3228,1838,1.0,19
2984,3229,1839,1.0,19


### session sets

In [18]:
session_sets = pd.read_csv('../data/session_sets.csv', on_bad_lines='skip', low_memory=False)
session_sets = session_sets.drop(['level', 'time_duration', 'reps', 
                                  'session_set_type','created_at', 'updated_at'], axis = 1)

# First check of table
change_data_type_columns = {'boolean': ['loop']}

session_sets = run_individual_table_cleaner(session_sets, 
                                            change_data_type_columns)


 --- converting_data_types method executed --- 

converting boolean

 --- check_data_type method executed --- 

id                  int64
order               int64
session_block_id    int64
loop                 bool
dtype: object

 --- check_duplicate method executed --- 

0        False
1        False
2        False
3        False
4        False
         ...  
11540    False
11541    False
11542    False
11543    False
11544    False
Length: 11545, dtype: bool

 --- check_unit_columns method executed --- 



### exercise_sets

In [19]:
# Table of exercise sets
exercise_sets = pd.read_csv('../data/exercise_sets.csv', on_bad_lines='skip', low_memory=False)
exercise_sets = exercise_sets.drop(['intensity_modificator'], axis = 1)

# First check of table
change_data_type_columns = {'datetime': ['created_at', 'updated_at'],
                            'boolean': ['track_reps']}

exercise_sets = run_individual_table_cleaner(exercise_sets, 
                                             change_data_type_columns)



 --- converting_data_types method executed --- 

converting datetime
converting boolean

 --- check_data_type method executed --- 

id                         int64
session_set_id             int64
exercise_id                int64
order                      int64
time_duration            float64
reps                     float64
created_at        datetime64[ns]
updated_at        datetime64[ns]
track_reps                  bool
dtype: object

 --- check_duplicate method executed --- 

0        False
1        False
2        False
3        False
4        False
         ...  
40833    False
40834    False
40835    False
40836    False
40837    False
Length: 40838, dtype: bool

 --- check_unit_columns method executed --- 



In [20]:
# Table of exercises
exercises = pd.read_csv('../data/exercises.csv', sep = ';', on_bad_lines='skip', low_memory=False)
exercises = exercises.drop(['video','reps', 'time','legacy_id', 'replacement_legacy_id', 
                             'family', 'sub_family', 'video_female', 'video_male', 
                             'harder_variation_id', 'easier_variation_id', 'name_es', 'description_es', 
                             'implement_variation_id', 'thumbnail', 'thumbnail_male', 'thumbnail_female', 
                             'notes_en', 'notes_es', 'thumbnail_400', 'thumbnail_400_male',
                             'thumbnail_400_female', 'coach_id', 'test_equivalent_id', 'excluded'], axis = 1)

# First check of table
change_data_type_columns = {'datetime' : ['created_at','updated_at'],
                            'boolean': ['deprecated']}
exercise = run_individual_table_cleaner(exercises, 
                                        change_data_type_columns)



 --- converting_data_types method executed --- 

converting datetime
converting boolean

 --- check_data_type method executed --- 

id                             int64
created_at            datetime64[ns]
updated_at            datetime64[ns]
deprecated                      bool
body_parts_focused            object
muscles                       object
joints                        object
met_multiplier               float64
name_en                       object
description_en                object
test_correction              float64
execution_time               float64
t1_min                         int64
t1_max                         int64
dtype: object

 --- check_duplicate method executed --- 

0      False
1      False
2      False
3      False
4      False
       ...  
968    False
969    False
970    False
971    False
972    False
Length: 973, dtype: bool

 --- check_unit_columns method executed --- 



In [21]:

# Table of blocks of session executions
session_block_executions = pd.read_csv('../data/session_block_executions.csv', on_bad_lines='skip', low_memory=False)
session_block_executions = session_block_executions.drop(['block_type', 
                                                           'reps_executed',
                                                           'execution_time'], axis = 1)

# First check of table
change_data_type_columns = {'datetime' : ['created_at','updated_at']}
session_block_executions = run_individual_table_cleaner(session_block_executions, 
                                                        change_data_type_columns)


 --- converting_data_types method executed --- 

converting datetime

 --- check_data_type method executed --- 

id                               int64
session_execution_id             int64
order                            int64
created_at              datetime64[ns]
updated_at              datetime64[ns]
dtype: object

 --- check_duplicate method executed --- 

0         False
1         False
2         False
3         False
4         False
          ...  
207967    False
207968    False
207969    False
207970    False
207971    False
Length: 207972, dtype: bool

 --- check_unit_columns method executed --- 



In [22]:
# Table of sets of executions
session_set_executions = pd.read_csv('../data/session_set_executions.csv', on_bad_lines='skip', low_memory=False)
session_set_executions = session_set_executions.drop(['reps_executed', 'execution_time'], axis = 1)

# First check of table
change_data_type_columns = {'datetime' : ['created_at','updated_at']}
session_set_executions = run_individual_table_cleaner(session_set_executions, 
                                                      change_data_type_columns)


 --- converting_data_types method executed --- 

converting datetime

 --- check_data_type method executed --- 

id                                     int64
order                                  int64
created_at                    datetime64[ns]
updated_at                    datetime64[ns]
session_block_execution_id             int64
dtype: object

 --- check_duplicate method executed --- 

0         False
1         False
2         False
3         False
4         False
          ...  
849284    False
849285    False
849286    False
849287    False
849288    False
Length: 849289, dtype: bool

 --- check_unit_columns method executed --- 



### exercise_executions 

In [23]:
# Table of exercise executions 
exercise_executions = pd.read_csv('../data/exercise_executions.csv', on_bad_lines='skip', low_memory=False, header = None)
exercise_executions.columns = ['id', 'exercise_id', 'session_set_execution_id', 
                              'reps_executed', 'execution_time',
                              'order', 'created_at', 'updated_at']

# First check of table
change_data_type_columns = {'datetime' : ['created_at','updated_at']}

columns_to_check_units = ['order']

exercise_executions = run_individual_table_cleaner(exercise_executions, 
                                                   change_data_type_columns,
                                                   columns_to_check_units)
exercise_executions


 --- converting_data_types method executed --- 

converting datetime

 --- check_data_type method executed --- 

id                                   int64
exercise_id                          int64
session_set_execution_id             int64
reps_executed                        int64
execution_time                       int64
order                                int64
created_at                  datetime64[ns]
updated_at                  datetime64[ns]
dtype: object

 --- check_duplicate method executed --- 

0          False
1          False
2          False
3          False
4          False
           ...  
2190284    False
2190285    False
2190286    False
2190287    False
2190288    False
Length: 2190289, dtype: bool

 --- check_unit_columns method executed --- 

1     659413
2     636362
3     520425
4     138942
5      66593
13     47360
6      39927
7      19290
8      16047
9      11001
10      9258
11      8015
12      7133
14      3630
15      1341
16      1294
17      1196


Unnamed: 0,id,exercise_id,session_set_execution_id,reps_executed,execution_time,order,created_at,updated_at
0,1279660,5236,47654,10,66,1,2021-10-29 13:07:01.918992,2021-10-29 13:07:01.918992
1,1279661,5968,47654,0,15,2,2021-10-29 13:07:01.924569,2021-10-29 13:07:01.924569
2,1279662,5317,47654,10,11,3,2021-10-29 13:07:01.929730,2021-10-29 13:07:01.929730
3,1279663,5968,47654,0,15,4,2021-10-29 13:07:01.934805,2021-10-29 13:07:01.934805
4,1279664,5222,47654,10,19,5,2021-10-29 13:07:01.940808,2021-10-29 13:07:01.940808
...,...,...,...,...,...,...,...,...
2190284,3116554,5968,669974,0,31,2,2022-05-27 07:41:53.378292,2022-05-27 07:41:53.378292
2190285,3116555,5870,669975,6,31,1,2022-05-27 07:41:53.386738,2022-05-27 07:41:53.386738
2190286,3116556,5968,669975,0,31,2,2022-05-27 07:41:53.390694,2022-05-27 07:41:53.390694
2190287,3116557,5870,669976,6,24,1,2022-05-27 07:41:53.399237,2022-05-27 07:41:53.399237


# Merging - stage 2

After merging all tables, we ensured adherence to foreign key relationships as specified in the SQL schema using the MySQL Workbench program with the mh_database_schema.mwb file. Subsequently, we renamed all columns to ensure clarity and prevent confusion. Each column header now consists of the name of the respective table followed by the variable name within that table. This naming convention was implemented to eliminate ambiguity and avoid the repetition of variables.

In [24]:
# Preparing the columns before merge. 

def rename_columns(df, name_to_append):
    """
        Method to rename a column appending before the name of column, the name of the table

        Parameters:
        df (dataframe): The dataframe to rename
        name_to_append (str): The name of the table to append.

        Returns:
        df (dataframe): The dataframe with the columns renamed.
    """
    df = df.rename(columns=lambda x: name_to_append + "_" + x)
    return df

user_programs = rename_columns(user_programs, "user_programs")
users = rename_columns(users, "users")
programs = rename_columns(programs, "programs")
program_sessions = rename_columns(program_sessions, "program_sessions")
program_characteristics = rename_columns(program_characteristics, "program_characteristics")
subscriptions = rename_columns(subscriptions, "subscriptions")
program_profiles = rename_columns(program_profiles, "program_profiles")
profiles = rename_columns(profiles, "profiles")

session_executions = rename_columns(session_executions, "session_executions")
session_block_executions = rename_columns(session_block_executions, "session_block_executions")
session_set_executions = rename_columns(session_set_executions, "session_set_executions")
exercise_executions = rename_columns(exercise_executions, "exercise_executions")

sessions = rename_columns(sessions, "sessions")
session_blocks = rename_columns(session_blocks, "session_blocks")
session_sets = rename_columns(session_sets, "session_sets")
exercise_sets = rename_columns(exercise_sets, "exercise_sets")
exercises = rename_columns(exercises, "exercises")
session_executions_summary = rename_columns(session_executions_summary, "session_executions_summary")


In [25]:
# Merging session executions with user_programs
# It is useful to retrieve the information about the user id and its programs

merge = session_executions.merge(user_programs,
                                  how ='inner',
                                  left_on = 'session_executions_user_program_id',
                                  right_on= 'user_programs_id')
merge.drop(columns=['user_programs_id'], inplace=True)
merge.shape

(736682, 20)

In [26]:
# Utseful to compare final results

session_executions_summary_with_users = merge.merge(session_executions_summary,
                                        how ='inner',
                                        left_on = 'session_executions_id',
                                        right_on= 'session_executions_summary_session_execution_id')
#merge2.drop(columns=[ser_programs_id'], inplace=True)
session_executions_summary_with_users.shape
# merge2.loc[merge2['session_executions_summary_reps_per_exercise'] == '0']

(732025, 39)

In [27]:
# Merging session_block_executions
# It is useful to connect with the session set executions

merge = merge.merge(session_block_executions,
                    how ='inner',
                    left_on = 'session_executions_id',
                    right_on = 'session_block_executions_session_execution_id')
merge.shape

(154582, 25)

In [28]:
# Merging session_set_executions
# It is useful to connect with the exercise executions

merge = merge.merge(session_set_executions,
                      how ='inner',
                      left_on = 'session_block_executions_id',
                      right_on = 'session_set_executions_session_block_execution_id')
merge.shape

(632541, 30)

In [29]:
# Merging session_set_executions
# It is useful to know the individual exercises executed during the session.

merge= merge.merge(exercise_executions,
                   how ='inner',
                   left_on = 'session_set_executions_id',
                   right_on = 'exercise_executions_session_set_execution_id')
print(merge['session_executions_id'].value_counts(dropna=False))
print(merge.shape)


916       737
917       736
919       735
920       734
918       733
         ... 
732672      1
731407      1
731128      1
4861        1
15888       1
Name: session_executions_id, Length: 51405, dtype: int64
(2180871, 38)


In [30]:
# Merging exercises
# It is useful to obtain information about the exercises

merge = merge.merge(exercises,
                     how ='inner',
                     left_on = 'exercise_executions_exercise_id',
                     right_on = 'exercises_id')
print(merge['session_executions_id'].value_counts(dropna=False))
print(merge.shape)

916     737
917     736
919     735
920     734
918     733
       ... 
9129      1
5911      1
9490      1
4861      1
4930      1
Name: session_executions_id, Length: 51405, dtype: int64
(2180871, 52)


In [31]:
merge = merge.merge(sessions,
             how='inner',
             left_on='session_executions_session_id',
             right_on='sessions_id')

In [32]:
merge = merge.merge(users,
             how='inner',
             left_on='user_programs_user_id',
             right_on='users_id')

In [33]:
merge.shape

(2180871, 89)

In [34]:
condition_equal_exercises = merge.loc[(merge['exercise_executions_reps_executed'] == 0) & (merge['exercise_executions_execution_time'] == 0)].index
merge = merge.drop(condition_equal_exercises)
merge.shape

(2129008, 89)

In [35]:
# Merging session executions with user_programs
# It is useful to retrieve the information about the user id and its programs

merge = merge.merge(session_executions_summary,
                                        how ='inner',
                                        left_on = 'session_executions_id',
                                        right_on= 'session_executions_summary_session_execution_id')
#merge2.drop(columns=[ser_programs_id'], inplace=True)
merge.shape
# merge2.loc[merge2['session_executions_summary_reps_per_exercise'] == '0']

(2127754, 108)

In [36]:
merge.shape

(2127754, 108)

In [37]:
# Selecting a single session execution from a single user
'''
single_user = merge1.loc[(merge1['session_executions_id'] == 4201)&(merge1['user_programs_user_id'] == 640)]
single_user[['session_executions_discarded', 'session_executions_imported','session_block_executions_id', 'session_set_executions_id', 'exercise_executions_id', 
   'session_block_executions_order', 'session_set_executions_order', 'exercise_executions_order',
   'exercise_executions_reps_executed', 'exercise_executions_execution_time','exercise_executions_updated_at',
   'exercises_body_parts_focused', 'exercises_name_en', 'sessions_description_es']].head(50)
'''

#merge2 = merge.loc[0:(merge.shape[0]*0.01), :]

#exercises_description_en	exercises_test_correction	exercises_execution_time	exercises_t1_min	exercises_t1_max
# Uncomment when testing the exercise counts
'''
merge2.loc[merge2['session_executions_id'] == 4201, ['exercise_executions_reps_executed', 
                                                     'exercise_executions_execution_time', 
                                                     'exercises_name_en', 
                                                     'sessions_order']].head(50)
'''

"\nmerge2.loc[merge2['session_executions_id'] == 4201, ['exercise_executions_reps_executed', \n                                                     'exercise_executions_execution_time', \n                                                     'exercises_name_en', \n                                                     'sessions_order']].head(50)\n"

# Data flattening - stage 3 

Instead of treating each row as an exercise execution, as indicated in the merged table, we flattened it to consider each row as a session execution. This restructuring proved significantly more useful for analyzing session data rather than individual exercises. Moreover, this format facilitated easier manipulation of the data for various analytical purposes.

## Columns Deleted

- All NaN columns:
    - session_executions_execution_time
    - session_executions_discard_reason

 - Information no relevant for data analysis
    - user_programs_created_at	
    - user_programs_updated_at
    - user_programs_current_session_id
    - session_block_executions_id
    - session_block_executions_session_execution_id
    - session_block_executions_created_at
    - session_block_executions_updated_at
    - session_set_executions_session_block_execution_id
    - exercise_executions_session_set_execution_id
    - session_id
    - sessions_code_name
    - sessions_warmup_id
    - session_executions_summary_id
    - session_executions_summary_reps_per_exercise
    - session_executions_summary_secs_per_exercise
    - session_executions_summary_reps_per_min_per_exercise
    - session_executions_summary_created_at
    - session_executions_summary_reps_set_per_block
    - session_executions_summary_time_set_per_block
    - session_executions_summary_average_reps_min_set_per_block
    - session_executions_summary_reps_min_set_block
    - session_executions_summary_name

 - Repeated
    - exercise_executions_exercise_id == exercises_id
    - sessions_description_es = sessions_description_en
    - session_executions_summary_session_execution_id = session_executions_id

 - Many NaNs
    - sessions_session_type
    - users_total_sessions
    - users_total_time
    - users_kcal_per_session
    - users_reps_per_session
 
 - wrong data
   - session_executions_summary_total_time
   - session_executions_summary_reps_per_min (This was not wrong, nevertheless I did not fiugre out what exactly is)

In [38]:
def fill_data(flat_df, row, exercise_name, session_execution_id):
    """
    Function to fill data in a flat dictionary from a row of another DataFrame.

    Parameters:
    flat_df (dict): Dictionary representing the flat DataFrame.
    row (pandas Series): A row from a DataFrame containing data to be filled into flat_df.
    exercise_name (str): The name of the exercise.
    session_execution_id(int): The id of the exercise

    Returns:
    dict: The updated flat dict that will be converted in a DataFrame.
    """

    # Fill reps and time data
    exercise_name_time = exercise_name + '_reps_' + str(flat_df[session_execution_id][exercise_name])
    flat_df[session_execution_id][exercise_name_time] = row['exercise_executions_reps_executed']
    exercise_name_time = exercise_name + '_time_' + str(flat_df[session_execution_id][exercise_name])
    flat_df[session_execution_id][exercise_name_time] = row['exercise_executions_execution_time']        

    # fill session execution data
    flat_df[session_execution_id]['session_executions_user_program_id'] = row['session_executions_user_program_id']
    flat_df[session_execution_id]['session_executions_difficulty_feedback'] = row['session_executions_difficulty_feedback']
    flat_df[session_execution_id]['session_executions_enjoyment_feedback'] = row['session_executions_enjoyment_feedback']
    flat_df[session_execution_id]['session_executions_reps_executed'] = row['session_executions_reps_executed']
    flat_df[session_execution_id]['session_executions_created_at'] = row['session_executions_created_at']
    flat_df[session_execution_id]['session_executions_updated_at'] = row['session_executions_updated_at']
    flat_df[session_execution_id]['session_executions_updated_at'] = row['session_executions_updated_at']
    flat_df[session_execution_id]['session_executions_discarded'] = row['session_executions_discarded']
    flat_df[session_execution_id]['session_executions_imported'] = row['session_executions_imported']
    
    # fill session user programs data
    flat_df[session_execution_id]['user_programs_user_id'] = row['user_programs_user_id']
    flat_df[session_execution_id]['user_programs_program_id'] = row['user_programs_program_id']
    flat_df[session_execution_id]['user_programs_active'] = row['user_programs_active']
    flat_df[session_execution_id]['user_programs_completed'] = row['user_programs_completed']
    flat_df[session_execution_id]['user_programs_enjoyment'] = row['user_programs_enjoyment'] # Check later, A lot of rows are NaNs

    # Fill users data
    flat_df[session_execution_id]['users_created_at'] = row['users_created_at']
    flat_df[session_execution_id]['users_updated_at'] = row['users_updated_at']
    flat_df[session_execution_id]['users_gender'] = row['users_gender']
    flat_df[session_execution_id]['users_activity_level'] = row['users_activity_level']
    flat_df[session_execution_id]['users_body_type'] = row['users_body_type']
    flat_df[session_execution_id]['users_newsletter_subscription'] = row['users_newsletter_subscription']
    flat_df[session_execution_id]['users_sign_in_count'] = row['users_sign_in_count']
    flat_df[session_execution_id]['users_notifications_setting'] = row['users_notifications_setting']
    flat_df[session_execution_id]['users_training_days_setting'] = row['users_training_days_setting']
    flat_df[session_execution_id]['users_country'] = row['users_country']
    flat_df[session_execution_id]['users_points'] = row['users_points']
    flat_df[session_execution_id]['users_scientific_data_usage'] = row['users_scientific_data_usage']
    flat_df[session_execution_id]['users_best_weekly_streak'] = row['users_best_weekly_streak']
    flat_df[session_execution_id]['users_imported'] = row['users_imported']
    flat_df[session_execution_id]['users_goal'] = row['users_goal']
    flat_df[session_execution_id]['users_date_of_birth'] = row['users_date_of_birth']
    flat_df[session_execution_id]['users_height'] = row['users_height']
    flat_df[session_execution_id]['users_weight'] = row['users_weight']
    flat_df[session_execution_id]['users_body_fat'] = row['users_body_fat']

    # TODO: check if it is usefull to fill: exercise_executions_ids, exercise_executions_order
    # flat_df[session_execution_id]['exercises_id'] = row['exercises_id']
    # flat_df[session_execution_id][''] = row['']
    # Fill session block executions and session set execitions data
    # flat_df[session_execution_id]['session_block_executions_order'] = row['session_block_executions_order'] # Check later, because it likely is non useful
    # flat_df[session_execution_id]['session_set_executions_order'] = row['session_set_executions_order']

    # fill exercises data
    flat_df[session_execution_id]['body_parts_focused'][exercise_name] =  row['exercises_body_parts_focused']
    flat_df[session_execution_id]['exercises_muscles'][exercise_name] = row['exercises_muscles']
    flat_df[session_execution_id]['exercises_joints'][exercise_name] = row['exercises_joints']
    flat_df[session_execution_id]['exercises_met_multiplier'][exercise_name] = row['exercises_met_multiplier']
    flat_df[session_execution_id]['exercises_description_en'][exercise_name] = row['exercises_description_en']
    flat_df[session_execution_id]['exercises_test_correction'][exercise_name] = row['exercises_test_correction']
    flat_df[session_execution_id]['exercises_execution_time'][exercise_name] = row['exercises_execution_time']
    flat_df[session_execution_id]['exercises_t1_min'][exercise_name] = row['exercises_t1_min']
    flat_df[session_execution_id]['exercises_t1_max'][exercise_name] = row['exercises_t1_max']

    # fill session data
    flat_df[session_execution_id]['sessions_order'] = row['sessions_order']
    flat_df[session_execution_id]['sessions_time_duration'] = row['sessions_time_duration']
    flat_df[session_execution_id]['sessions_name_en'] = row['sessions_name_en']
    flat_df[session_execution_id]['sessions_description_en'] = row['sessions_description_en']
    flat_df[session_execution_id]['sessions_calories'] = row['sessions_calories']

    # Fill session executions summary data
    flat_df[session_execution_id]['session_executions_summary_total_reps'] = row['session_executions_summary_total_reps']
    flat_df[session_execution_id]['session_executions_summary_total_kcal'] = row['session_executions_summary_total_kcal']
    flat_df[session_execution_id]['session_executions_summary_effort'] = row['session_executions_summary_effort']
    flat_df[session_execution_id]['session_executions_summary_points'] = row['session_executions_summary_points']
    flat_df[session_execution_id]['session_executions_summary_value_of_session'] = row['session_executions_summary_value_of_session']
    flat_df[session_execution_id]['session_executions_summary_updated_at'] = row['session_executions_summary_updated_at']

    return flat_df

In [39]:
flat_df = {} # Dict that will be converted in a flat dataframe

for index, row in merge.iterrows():    

    session_execution_id = row['session_executions_id']
    exercise_name = row['exercises_name_en']

    
    # Check if the session execution exists in the flattened dataframe. If it does, proceed to check the exercise name; 
    # if not, initialize the exercise name counter and initialize the exercise data dicts.
    if session_execution_id in flat_df.keys():
        
        # Check if the exercise name exist in the flattened dataframe. If it does, proceed to increase the exercise name counter; 
        # if not, initialize the exercise name counter.
        if exercise_name in flat_df[session_execution_id].keys():
            
            flat_df[session_execution_id][exercise_name] = flat_df[session_execution_id][exercise_name] + 1
            flat_df = fill_data(flat_df, row, exercise_name, session_execution_id)

        else:
            
            flat_df[session_execution_id][exercise_name] = 1
            flat_df = fill_data(flat_df, row, exercise_name, session_execution_id)
            
    else:
        
        flat_df[session_execution_id] = {}
        flat_df[session_execution_id][exercise_name] = 1
        
        #Initialization useful for exercises data
        flat_df[session_execution_id]['body_parts_focused'] = {}
        flat_df[session_execution_id]['exercises_muscles'] = {}
        flat_df[session_execution_id]['exercises_joints'] = {}
        flat_df[session_execution_id]['exercises_met_multiplier'] = {}
        flat_df[session_execution_id]['exercises_description_en'] = {}
        flat_df[session_execution_id]['exercises_test_correction'] = {}
        flat_df[session_execution_id]['exercises_execution_time'] = {}
        flat_df[session_execution_id]['exercises_t1_min'] = {}
        flat_df[session_execution_id]['exercises_t1_max'] = {}
    
        flat_df = fill_data(flat_df, row, exercise_name, session_execution_id)

In [40]:
merge_df = pd.DataFrame.from_dict(flat_df, orient='index')

In [41]:
# Create the dataframe and organize columns alphabetically
# merge_df = pd.DataFrame.from_dict(flat_df, orient='index')
merge_df = merge_df.reindex(sorted(merge_df.columns), axis=1)

# This dataset is the dataframe after flatteing each exercise executions in session executions
merge_df.to_hdf('../data/flattened_database_merged_with_session_executions_v02.h5', key='data', mode='w')
del merge_df    # allow df to be garbage collected

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block1_values] [items->Index(['body_parts_focused', 'exercises_description_en',
       'exercises_execution_time', 'exercises_joints',
       'exercises_met_multiplier', 'exercises_muscles', 'exercises_t1_max',
       'exercises_t1_min', 'exercises_test_correction',
       'session_executions_summary_effort',
       'session_executions_summary_updated_at',
       'session_executions_summary_value_of_session',
       'sessions_description_en', 'sessions_name_en', 'users_country'],
      dtype='object')]

  merge_df.to_hdf('../data/flattened_database_merged_with_session_executions_v02.h5', key='data', mode='w')


# Final check of the database: Step 4

In this step we check that the number of exercises correspond with the original number of rows in the exercise merged table

In [42]:
df = pd.read_hdf('../data/flattened_database_merged_with_session_executions_v01.h5',  key='data')

# Remove the variables different to exercises
df_only_exercises = df.drop(columns=['body_parts_focused',
'exercises_description_en',
'exercises_execution_time',
'exercises_joints',
'exercises_met_multiplier',
'exercises_muscles',
'exercises_t1_max',
'exercises_t1_min',
'exercises_test_correction',
'session_executions_created_at',
'session_executions_difficulty_feedback',
'session_executions_discarded',
'session_executions_enjoyment_feedback',
'session_executions_imported',
'session_executions_reps_executed',
'session_executions_updated_at',
'session_executions_user_program_id',
'sessions_calories',
'sessions_description_en',
'sessions_name_en',
'sessions_order',
'sessions_time_duration',
'user_programs_active',
'user_programs_completed',
'user_programs_program_id',
'user_programs_user_id',
'users_created_at',
'users_updated_at',
'users_gender',
'users_activity_level',
'users_body_type',
'users_newsletter_subscription',
'users_sign_in_count',
'users_notifications_setting',
'users_training_days_setting',
'users_country',
'users_points',
'users_scientific_data_usage',
'users_best_weekly_streak', 
'users_imported',
'users_goal',
'users_date_of_birth',
'users_height',
'users_weight',
'users_body_fat',
'session_executions_summary_total_reps',
'session_executions_summary_total_kcal',
'session_executions_summary_effort',
'session_executions_summary_points',
'session_executions_summary_value_of_session',
'session_executions_summary_updated_at'
])

# Iterate over the dataframe rows, remove the nans of the exercises that were not executed on that 
# session execution and sum only the strings with time (the reps are not included because they and the time refer to the same exercise).
# The total sum should match the number of the rows of the original merged dataframe (2180871, 63).
sum_total = 0
for index, row in df_only_exercises.iterrows():
    row = row.dropna()
    sum_exercises = sum(row.keys().str.contains('time'))
    sum_total = sum_total + sum_exercises

KeyError: "['users_goal', 'users_date_of_birth', 'users_height', 'users_weight', 'users_body_fat'] not found in axis"

In [45]:
print(df.shape, sum_total, merge.shape)

(51355, 9181) 2127754 (2127754, 108)


## Test to corroborate the total reps and total time: Step 4.1

We check the session executions summary table that contains the summary session executions with the total reps and total time and compare it with the dataframe that calculates the **total_reps** and **total_time** from the file **data_celaning_dataframe_session_v0.ipynb**

In [68]:
session_executions_summary.loc[(session_executions_summary['session_executions_summary_session_execution_id'] == 746205)]

Unnamed: 0,session_executions_summary_id,session_executions_summary_session_execution_id,session_executions_summary_total_reps,session_executions_summary_total_time,session_executions_summary_reps_per_min,session_executions_summary_total_kcal,session_executions_summary_reps_per_exercise,session_executions_summary_secs_per_exercise,session_executions_summary_reps_per_min_per_exercise,session_executions_summary_created_at,session_executions_summary_updated_at,session_executions_summary_reps_set_per_block,session_executions_summary_time_set_per_block,session_executions_summary_average_reps_min_set_per_block,session_executions_summary_reps_min_set_block,session_executions_summary_effort,session_executions_summary_points,session_executions_summary_value_of_session,session_executions_summary_name
733711,740498,746205,230,1337,10,90,"{'5235': {'reps': 34, 'name_en': 'Star jump', ...","{'5235': {'name_en': 'Star jump', 'name_es': '...","{'5235': {'value': 17, 'name_en': 'Star jump',...",2022-05-26 03:05:45.917324,2022-05-26 03:05:45.917324,"{""1"": {""1"": {""reps"": 25, ""block_name"": ""Ir""}, ...","{""1"": {""1"": {""block_name"": ""Ir"", ""execution_ti...","{""1"": {""1"": 12.195121951219512, ""2"": 23.809523...",0,9,100,5,0


In [63]:
import datetime

datetime1 = datetime.datetime(2022, 1, 23)
datetime2 = datetime.datetime(2022, 1, 25)

merge.loc[(merge['user_programs_user_id'] == 108) &
          (merge['session_executions_updated_at'] > datetime1) &
          (merge['session_executions_updated_at'] < datetime2), 
          ['user_programs_user_id', 'session_executions_updated_at', 'session_executions_summary_total_reps']]

Unnamed: 0,user_programs_user_id,session_executions_updated_at,session_executions_summary_total_reps
1198905,108,2022-01-24 11:38:17.767057,90
1198906,108,2022-01-24 11:38:17.767057,90
1198907,108,2022-01-24 11:38:17.767057,90
1198908,108,2022-01-24 11:38:17.767057,90
1198909,108,2022-01-24 11:38:17.767057,90
1198910,108,2022-01-24 11:38:17.767057,90
1198911,108,2022-01-24 11:38:17.767057,90
1198912,108,2022-01-24 11:38:17.767057,90
1198913,108,2022-01-24 11:38:17.767057,90
1198914,108,2022-01-24 11:38:17.767057,90


## Test to corroborate the flattening procedure Step 4.2

Experiments To corroborate if the proccess of flatteing was carried out correctly 

In [47]:
# Corregir el Merge Remove Error from here condition = (time_diff < 10) | (time_diff.shift(-1) < 10)

import datetime
a = merge.loc[(merge['session_executions_updated_at'] > datetime.datetime(2022, 1, 23)) & 
          (merge['session_executions_updated_at'] < datetime.datetime(2022, 1, 24)) & 
          (merge['exercises_name_en'] == 'Rest') &
          (merge['user_programs_user_id'] == 6243),
          ['session_executions_id', 'exercise_executions_reps_executed', 'exercise_executions_execution_time', 'exercises_name_en', 'session_executions_updated_at', 'user_programs_user_id']]
# The following values were not removed from the condition condition = (time_diff < 30) | (time_diff.shift(-1) < 30)
# in the data_cleaning_merged_table file
# a.drop([715728, 715729, 715728, 715730, 715731, 715737, 715738, 715739, 715740]).sum()