# Overview

In this notebook, I conduct the design and development of artificial intelligence models. The main goal of it is to try to predict the physical exercise adherence of the users based on how much time they have been exercising. 

**Author**: Jon Maestre Escobar

**Email**: jonmaestre@opendeusto.es.

In [1]:
import pandas as pd
import numpy as np
import warnings
import pandas as pd
from utilities import Data_cleaning

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.express.colors import sample_colorscale

import math
import copy
import re
%matplotlib inline

warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [2]:
df_filtered_merged = pd.read_hdf('../data/filtered_merged_dataset_v1.h5', key='df')
df_filtered_merged.fillna(0, inplace=True)
df_filtered_merged.shape

(46020, 8705)

## **Users Filtered by Different Amount of Training Days**

In this section, I have undertaken a detailed process to filter and analyze users based on the number of training days recorded in our dataset. My objective was to categorize users into different groups based on their training consistency and fill in any gaps in their training records to ensure comprehensive data analysis.

**Steps Taken**

- **Initial Filtering**:
    - **30 Days or Less**: I first identified users who have trained for 30 days or less. This involved grouping the data by user ID and counting the unique training days.
    - **31 to 90 Days**: Next, I filtered users who have trained for more than 30 days but not exceeding 90 days.
    - **91 to 180 Days**: Similarly, I filtered users who have trained for more than 90 days but up to 180 days.
    - **181 to 365 Days**: Finally, I identified users who have trained for more than 180 days but not exceeding 365 days.

- **Data Completion**:
    - For each of these groups, I ensured that the training records were complete. Specifically, I filled in any missing days with zeros from the user's first recorded training day to ensure there are no gaps in the data.
    - This involved generating a complete date range for each user based on their first training session and merging it with the existing records, filling in the missing entries.

- **Detailed Analysis**:
    - The filtered and completed datasets were then analyzed to understand user behavior and training patterns better. This step helps in identifying trends and making data-driven recommendations for improving user engagement and training adherence.

By implementing these steps, I aim to provide a robust analysis framework that accurately reflects users' training habits, allowing us to draw meaningful insights and support the development of personalized fitness interventions. This comprehensive approach ensures that our analysis accounts for all training days, even those that were not initially recorded, providing a more accurate picture of user activity.

In [13]:
# Convert 'date' to datetime
df_filtered_merged['date'] = pd.to_datetime(df_filtered_merged['date'])

# Calculate the amount of days trained per user
user_training_days = df_filtered_merged.groupby('user_programs_user_id')['date'].nunique().reset_index()
user_training_days.columns = ['user_programs_user_id', 'training_days']

# Filter users with 30 or less training days
users_30_days_or_less = user_training_days[user_training_days['training_days'] <= 30]

# Dataframe with the users filtered
df_filtered_30days = df_filtered_merged[df_filtered_merged['user_programs_user_id'].isin(users_30_days_or_less['user_programs_user_id'])]
df_filtered_30days.shape


(17707, 8705)

In [12]:
# Convert 'date' to datetime
df_filtered_merged['date'] = pd.to_datetime(df_filtered_merged['date'])

# Calculate the amount of days trained per user
user_training_days = df_filtered_merged.groupby('user_programs_user_id')['date'].nunique().reset_index()
user_training_days.columns = ['user_programs_user_id', 'training_days']

# Filter users with more than 30 and less or equal than 90 training days
users_30_to_90_days = user_training_days[(user_training_days['training_days'] > 30) & (user_training_days['training_days'] <= 90)]

# Dataframe with the users filtered
df_filtered_30_to_90_days = df_filtered_merged[df_filtered_merged['user_programs_user_id'].isin(users_30_to_90_days['user_programs_user_id'])]
df_filtered_30_to_90_days.shape

(21691, 8705)

In [11]:
# Convert 'date' to datetime
df_filtered_merged['date'] = pd.to_datetime(df_filtered_merged['date'])

# Calculate the amount of days trained per user
user_training_days = df_filtered_merged.groupby('user_programs_user_id')['date'].nunique().reset_index()
user_training_days.columns = ['user_programs_user_id', 'training_days']

# Filter users with more than 90 and less or equal than 180 training days
users_90_to_180_days = user_training_days[(user_training_days['training_days'] > 90) & (user_training_days['training_days'] <= 180)]

# Dataframe with the users filtered
df_filtered_90_to_180_days = df_filtered_merged[df_filtered_merged['user_programs_user_id'].isin(users_90_to_180_days['user_programs_user_id'])]
df_filtered_90_to_180_days.shape

(6051, 8705)

In [14]:
# Convert 'date' to datetime
df_filtered_merged['date'] = pd.to_datetime(df_filtered_merged['date'])

# Calculate the amount of days trained per user
user_training_days = df_filtered_merged.groupby('user_programs_user_id')['date'].nunique().reset_index()
user_training_days.columns = ['user_programs_user_id', 'training_days']

# Filter users with more than 180 and less or equal than 365 training days
users_180_to_365_days = user_training_days[(user_training_days['training_days'] > 180) & (user_training_days['training_days'] <= 365)]

# Dataframe with the users filtered
df_filtered_180_to_365_days = df_filtered_merged[df_filtered_merged['user_programs_user_id'].isin(users_180_to_365_days['user_programs_user_id'])]
df_filtered_180_to_365_days.shape

(571, 8705)

In [33]:
print(f'The number of user who have trained 30 days or less is:', df_filtered_30days.user_programs_user_id.nunique()) 
print(f'The number of user who have trained between 31 and 90 days is:', df_filtered_30_to_90_days.user_programs_user_id.nunique())
print(f'The number of user who have trained between 91 and 180 days is', df_filtered_90_to_180_days.user_programs_user_id.nunique())
print(f'The number of user who have trained between 181 and 365 days is', df_filtered_180_to_365_days.user_programs_user_id.nunique())

The number of user who have trained 30 days or less is: 2701
The number of user who have trained between 31 and 90 days is: 432
The number of user who have trained between 91 and 180 days is 52
The number of user who have trained between 181 and 365 days is 3


### **Complete With Zeros Non Training Days**

In this part of the analysis, I have focused on ensuring that each user's training record is complete by filling in non-training days with zeros. This step is crucial for accurate time series analysis and to avoid any misinterpretation of the user's training consistency. Here are the detailed steps I followed:

- **Identify First Training Day**:
    - For each user in the filtered datasets, I identified the first day they recorded a training session. This serves as the starting point for generating a complete date range for each user.

- **Generate Complete Date Ranges**:
    - Using the first training day as the starting point, I generated a complete date range for each user up to the specified number of days (30 days for users with up to 30 training days, 90 days for users with 31-90 training days, etc.).

- **Merge with Original Data and Fill Missing Days**:
    - I merged these generated date ranges with the original user training data to identify days that were not recorded.
    - For the missing days, I filled in the relevant columns with zeros, ensuring that there are no gaps in the training data.

- **Combine Completed Data**:
    - The completed data for each user was then combined into a single DataFrame. This new DataFrame includes all the original training data as well as the newly added rows for the non-training days filled with zeros.

This process ensures that each user has a continuous record of training activity, allowing for more accurate analysis and modeling. By filling in the non-training days with zeros, I can better understand user behavior and training patterns, which is essential for developing effective AI models and making informed recommendations.

In [38]:
# Ensure that 'date' is in datetime format
df_filtered_30days['date'] = pd.to_datetime(df_filtered_30days['date'])

# Get the first training day for each user
first_training_day = df_filtered_30days.groupby('user_programs_user_id')['date'].min().reset_index()
first_training_day.columns = ['user_programs_user_id', 'first_training_date']

# Create an empty DataFrame to store the completed data
completed_data_30days = pd.DataFrame()

# Iterate over each user and complete the missing days
for user_id, first_day in zip(first_training_day['user_programs_user_id'], first_training_day['first_training_date']):
    # Generate a date range from the first day up to 90 days later
    date_range = pd.date_range(start=first_day, periods=30)
    
    # Create a DataFrame with the date range and user_id
    user_dates = pd.DataFrame({'user_programs_user_id': user_id, 'date': date_range})
    
    # Merge with the original DataFrame to identify the days with and without training
    user_data = pd.merge(user_dates, df_filtered_30days[df_filtered_30days['user_programs_user_id'] == user_id], on=['user_programs_user_id', 'date'], how='left')
    
    # Fill NaN values (days without training) with zeros in the relevant columns
    user_data.fillna(0, inplace=True)
    
    # Add the completed data to the final DataFrame
    completed_data_30days = pd.concat([completed_data_30days, user_data])

# Reset the index of the final DataFrame
completed_data_30days.reset_index(drop=True, inplace=True)
print(completed_data_30days)

       user_programs_user_id       date session_executions_updated_at  \
0                        108 2021-06-11    2021-06-11 18:00:35.640406   
1                        108 2021-06-12                             0   
2                        108 2021-06-13                             0   
3                        108 2021-06-14                             0   
4                        108 2021-06-15                             0   
...                      ...        ...                           ...   
81025                  18174 2022-06-20                             0   
81026                  18174 2022-06-21                             0   
81027                  18174 2022-06-22                             0   
81028                  18174 2022-06-23                             0   
81029                  18174 2022-06-24                             0   

       1 leg bridge (left)_reps_1  1 leg bridge (left)_reps_10  \
0                             0.0                        

In [43]:
# Ensure that 'date' is in datetime format
df_filtered_30_to_90_days['date'] = pd.to_datetime(df_filtered_30_to_90_days['date'])

# Get the first training day for each user
first_training_day_30_to_90 = df_filtered_30_to_90_days.groupby('user_programs_user_id')['date'].min().reset_index()
first_training_day_30_to_90.columns = ['user_programs_user_id', 'first_training_date']

# Create an empty DataFrame to store the completed data
completed_data_30_to_90_days = pd.DataFrame()

# Iterate over each user and complete the missing days
for user_id, first_day in zip(first_training_day_30_to_90['user_programs_user_id'], first_training_day_30_to_90['first_training_date']):
    # Generate a date range from the first day up to 90 days later
    date_range = pd.date_range(start=first_day, periods=90)
    
    # Create a DataFrame with the date range and user_id
    user_dates = pd.DataFrame({'user_programs_user_id': user_id, 'date': date_range})
    
    # Merge with the original DataFrame to identify the days with and without training
    user_data = pd.merge(user_dates, df_filtered_30_to_90_days[df_filtered_30_to_90_days['user_programs_user_id'] == user_id], on=['user_programs_user_id', 'date'], how='left')
    
    # Fill NaN values (days without training) with zeros in the relevant columns
    user_data.fillna(0, inplace=True)
    
    # Add the completed data to the final DataFrame
    completed_data_30_to_90_days = pd.concat([completed_data_30_to_90_days, user_data])

# Reset the index of the final DataFrame
completed_data_30_to_90_days.reset_index(drop=True, inplace=True)
print(completed_data_30_to_90_days)

       user_programs_user_id       date session_executions_updated_at  \
0                        601 2021-06-10    2021-06-10 20:39:30.633499   
1                        601 2021-06-11                             0   
2                        601 2021-06-12                             0   
3                        601 2021-06-13                             0   
4                        601 2021-06-14    2021-06-14 21:09:12.676070   
...                      ...        ...                           ...   
38875                  16888 2022-07-09                             0   
38876                  16888 2022-07-10                             0   
38877                  16888 2022-07-11                             0   
38878                  16888 2022-07-12                             0   
38879                  16888 2022-07-13                             0   

       1 leg bridge (left)_reps_1  1 leg bridge (left)_reps_10  \
0                             0.0                        

In [44]:
# Ensure that 'date' is in datetime format
df_filtered_90_to_180_days['date'] = pd.to_datetime(df_filtered_90_to_180_days['date'])

# Get the first training day for each user
first_training_day_90_to_180 = df_filtered_90_to_180_days.groupby('user_programs_user_id')['date'].min().reset_index()
first_training_day_90_to_180.columns = ['user_programs_user_id', 'first_training_date']

# Create an empty DataFrame to store the completed data
completed_data_90_to_180_days = pd.DataFrame()

# Iterate over each user and complete the missing days
for user_id, first_day in zip(first_training_day_90_to_180['user_programs_user_id'], first_training_day_90_to_180['first_training_date']):
    # Generate a date range from the first day up to 180 days later
    date_range = pd.date_range(start=first_day, periods=180)
    
    # Create a DataFrame with the date range and user_id
    user_dates = pd.DataFrame({'user_programs_user_id': user_id, 'date': date_range})
    
    # Merge with the original DataFrame to identify the days with and without training
    user_data = pd.merge(user_dates, df_filtered_90_to_180_days[df_filtered_90_to_180_days['user_programs_user_id'] == user_id], on=['user_programs_user_id', 'date'], how='left')
    
    # Fill NaN values (days without training) with zeros in the relevant columns
    user_data.fillna(0, inplace=True)
    
    # Add the completed data to the final DataFrame
    completed_data_90_to_180_days = pd.concat([completed_data_90_to_180_days, user_data])

# Reset the index of the final DataFrame
completed_data_90_to_180_days.reset_index(drop=True, inplace=True)
print(completed_data_90_to_180_days)

      user_programs_user_id       date session_executions_updated_at  \
0                       603 2021-06-15    2021-06-15 06:51:57.272306   
1                       603 2021-06-16    2021-06-16 07:14:22.351211   
2                       603 2021-06-17                             0   
3                       603 2021-06-18    2021-06-18 07:21:37.276224   
4                       603 2021-06-19                             0   
...                     ...        ...                           ...   
9355                  13418 2022-06-26                             0   
9356                  13418 2022-06-27                             0   
9357                  13418 2022-06-28                             0   
9358                  13418 2022-06-29                             0   
9359                  13418 2022-06-30                             0   

      1 leg bridge (left)_reps_1  1 leg bridge (left)_reps_10  \
0                            0.0                          0.0   
1    

In [45]:
# Ensure that 'date' is in datetime format
df_filtered_180_to_365_days['date'] = pd.to_datetime(df_filtered_180_to_365_days['date'])

# Get the first training day for each user
first_training_day_180_to_365 = df_filtered_180_to_365_days.groupby('user_programs_user_id')['date'].min().reset_index()
first_training_day_180_to_365.columns = ['user_programs_user_id', 'first_training_date']

# Create an empty DataFrame to store the completed data
completed_data_180_to_365_days = pd.DataFrame()

# Iterate over each user and complete the missing days
for user_id, first_day in zip(first_training_day_180_to_365['user_programs_user_id'], first_training_day_180_to_365['first_training_date']):
    # Generate a date range from the first day up to 365 days later
    date_range = pd.date_range(start=first_day, periods=365)
    
    # Create a DataFrame with the date range and user_id
    user_dates = pd.DataFrame({'user_programs_user_id': user_id, 'date': date_range})
    
    # Merge with the original DataFrame to identify the days with and without training
    user_data = pd.merge(user_dates, df_filtered_180_to_365_days[df_filtered_180_to_365_days['user_programs_user_id'] == user_id], on=['user_programs_user_id', 'date'], how='left')
    
    # Fill NaN values (days without training) with zeros in the relevant columns
    user_data.fillna(0, inplace=True)
    
    # Add the completed data to the final DataFrame
    completed_data_180_to_365_days = pd.concat([completed_data_180_to_365_days, user_data])

# Reset the index of the final DataFrame
completed_data_180_to_365_days.reset_index(drop=True, inplace=True)
print(completed_data_180_to_365_days)

      user_programs_user_id       date session_executions_updated_at  \
0                      1718 2021-10-29    2021-10-29 13:07:02.352803   
1                      1718 2021-10-30    2021-10-30 04:36:24.227478   
2                      1718 2021-10-31    2021-10-31 05:47:59.501080   
3                      1718 2021-11-01    2021-11-01 05:51:05.385606   
4                      1718 2021-11-02    2021-11-02 14:16:08.048692   
...                     ...        ...                           ...   
1090                   2677 2022-10-24                             0   
1091                   2677 2022-10-25                             0   
1092                   2677 2022-10-26                             0   
1093                   2677 2022-10-27                             0   
1094                   2677 2022-10-28                             0   

      1 leg bridge (left)_reps_1  1 leg bridge (left)_reps_10  \
0                            0.0                          0.0   
1    

In [49]:
# Save the datasets in HDF5
completed_data_30days.to_hdf('../data/completed_data_30days.h5', key='df', mode='w')
completed_data_30_to_90_days.to_hdf('../data/completed_data_30_to_90_days.h5', key='df', mode='w')
completed_data_90_to_180_days.to_hdf('../data/completed_data_90_to_180_days.h5', key='df', mode='w')
completed_data_180_to_365_days.to_hdf('../data/completed_data_180_to_365_days.h5', key='df', mode='w')

In [50]:
# Read the datasets
completed_data_30days = pd.read_hdf('../data/completed_data_30days.h5', key='df')
completed_data_30_to_90_days = pd.read_hdf('../data/completed_data_30_to_90_days.h5', key='df')
completed_data_90_to_180_days = pd.read_hdf('../data/completed_data_90_to_180_days.h5', key='df')
completed_data_180_to_365_days = pd.read_hdf('../data/completed_data_180_to_365_days.h5', key='df')

print(f'30 days dataset shape:', completed_data_30days.shape)
print(f'30-90 days dataset shape:', completed_data_30_to_90_days.shape)
print(f'90-180 days dataset shape:', completed_data_90_to_180_days.shape)
print(f'180-365 days dataset shape:', completed_data_180_to_365_days.shape)

30 days dataset shape: (81030, 8705)
30-90 days dataset shape: (38880, 8705)
90-180 days dataset shape: (9360, 8705)
180-365 days dataset shape: (1095, 8705)
