**PROJECT TITLE**:  Multivariate health status to human activities
- **Date Created**: April 2 2024
- **Last Updated**: April 5 2024
- **Project Type**: Research Project
- **Author**: Aniekan Charles Ekanem
- **Designation**: Data Analyst

## 1 PROJECT INTRODUCTION

This project is curated to help address how the various health angles of a human being inter relates with one another and are being affected by their various human activities.  The various health angles that will be studied in this projects are:
- Distance covered
- Total steps
- Calories burnt
- Heart rate
- Sleep efficiency
- Weight
- Body Mass Index

This project is based on the writer's view from personal experience and from professional points of view and hence is subject to criticism and any correction, addition, etc is welcome.

Data used for this project is obtained from  [Fitbit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit/):

- The data is from a reliable source which makes it reliable.
- The data is from a first party source and is therefore original.
- The data is comprehensive enough to answer the business question.
- The data is complete.
- The data is cited making it more credible.

## 2 LOADING THE LIBRARIES AND DATASETS

In [1]:
# Importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import os
import glob

In [2]:
# Loading the datasets

folder_path = '/home/aniekan/Documents/datasets for data analysis practice/bellabeats project/'

# Getting a list of all CSV files in the folder
csv_files = glob.glob(os.path.join(folder_path, '*.csv'))

# Creating an empty dictionary to store DataFrames
dataframes = {}

# Iterating through each CSV file and loading it into a DataFrame
for file in csv_files:
    # Extract the file name without extension
    file_name = os.path.splitext(os.path.basename(file))[0]
    # Load the CSV file into a DataFrame and store it in the dictionary
    dataframes[file_name] = pd.read_csv(file)

# Now you can access each DataFrame using its corresponding file name
dailyActivity_merged = dataframes['dailyActivity_merged']
heartrate_seconds_merged = dataframes['heartrate_seconds_merged']
hourlySteps_merged = dataframes['hourlySteps_merged']
sleepDay_merged = dataframes['sleepDay_merged']
weightLogInfo_merged = dataframes['weightLogInfo_merged']
hourlyCalories_merged = dataframes['hourlyCalories_merged']
hourlyIntensities_merged = dataframes['hourlyIntensities_merged']
minuteCaloriesNarrow_merged = dataframes['minuteCaloriesNarrow_merged']
minuteCaloriesWide_merged = dataframes['minuteCaloriesWide_merged']
minuteIntensitiesNarrow_merged = dataframes['minuteIntensitiesNarrow_merged']
minuteIntensitiesWide_merged = dataframes['minuteIntensitiesWide_merged']
minuteMETsNarrow_merged = dataframes['minuteMETsNarrow_merged']
minuteSleep_merged = dataframes['minuteSleep_merged']
minuteStepsNarrow_merged = dataframes['minuteStepsNarrow_merged']
minuteStepsWide_merged = dataframes['minuteStepsWide_merged']

## 3 DATA PROCESSING

The above datasets are very large datasets and will need to undergo the following data processings:


- Data inspection including checking the structure of the datasets, checking descriptive statistics, checking for data types, checking for missing data, applying feature engineering, data cleaning, etc.  This is to prepare the data for the analysis process.
- Exploratory Data Analysis

**I. PROCESSING DAILYACTIVITY_MERGE DATASET**

In [3]:
# General viewing and inspection of the dailyActivity_merged dataset variables and data

dailyActivity_merged.head(5)

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


In [4]:
# Checking the dataypes of the dataset variables

dailyActivity_merged.dtypes

Id                            int64
ActivityDate                 object
TotalSteps                    int64
TotalDistance               float64
TrackerDistance             float64
LoggedActivitiesDistance    float64
VeryActiveDistance          float64
ModeratelyActiveDistance    float64
LightActiveDistance         float64
SedentaryActiveDistance     float64
VeryActiveMinutes             int64
FairlyActiveMinutes           int64
LightlyActiveMinutes          int64
SedentaryMinutes              int64
Calories                      int64
dtype: object

In [5]:
# Converting Id and ActivityDate datatype to string and date format respectively
dailyActivity_merged['Id'] = dailyActivity_merged['Id'].astype(str)
dailyActivity_merged['ActivityDate'] = pd.to_datetime(dailyActivity_merged['ActivityDate'])

In [6]:
#  Cross-checking the datatypes
dailyActivity_merged.dtypes

Id                                  object
ActivityDate                datetime64[ns]
TotalSteps                           int64
TotalDistance                      float64
TrackerDistance                    float64
LoggedActivitiesDistance           float64
VeryActiveDistance                 float64
ModeratelyActiveDistance           float64
LightActiveDistance                float64
SedentaryActiveDistance            float64
VeryActiveMinutes                    int64
FairlyActiveMinutes                  int64
LightlyActiveMinutes                 int64
SedentaryMinutes                     int64
Calories                             int64
dtype: object

In [7]:
# Checking for unique datatype

In [8]:
# Function to count unique data types in a column and return counts
def count_unique_data_types(col):
    data_type_count = len(set(map(type, col)))
    return data_type_count

In [9]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in dailyActivity_merged.columns:
    result[column] = count_unique_data_types(dailyActivity_merged[column])

# Printing the result with column names
print("Column Name: Count of Unique Data Types")
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
ActivityDate: 1
TotalSteps: 1
TotalDistance: 1
TrackerDistance: 1
LoggedActivitiesDistance: 1
VeryActiveDistance: 1
ModeratelyActiveDistance: 1
LightActiveDistance: 1
SedentaryActiveDistance: 1
VeryActiveMinutes: 1
FairlyActiveMinutes: 1
LightlyActiveMinutes: 1
SedentaryMinutes: 1
Calories: 1


In [10]:
# Checking the descriptive statistics of the dataset
dailyActivity_merged.describe()

Unnamed: 0,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
count,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0
mean,7637.910638,5.489702,5.475351,0.108171,1.502681,0.567543,3.340819,0.001606,21.164894,13.564894,192.812766,991.210638,2303.609574
std,5087.150742,3.924606,3.907276,0.619897,2.658941,0.88358,2.040655,0.007346,32.844803,19.987404,109.1747,301.267437,718.166862
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3789.75,2.62,2.62,0.0,0.0,0.0,1.945,0.0,0.0,0.0,127.0,729.75,1828.5
50%,7405.5,5.245,5.245,0.0,0.21,0.24,3.365,0.0,4.0,6.0,199.0,1057.5,2134.0
75%,10727.0,7.7125,7.71,0.0,2.0525,0.8,4.7825,0.0,32.0,19.0,264.0,1229.5,2793.25
max,36019.0,28.030001,28.030001,4.942142,21.92,6.48,10.71,0.11,210.0,143.0,518.0,1440.0,4900.0


**II. PROCESSING HEARTRATE_SECONDS_MERGED DATASET**

In [11]:
# General viewing and inspection of the heartrate_seconds_merged dataset variables and data

heartrate_seconds_merged.head(5)

Unnamed: 0,Id,Time,Value
0,2022484408,4/12/2016 7:21:00 AM,97
1,2022484408,4/12/2016 7:21:05 AM,102
2,2022484408,4/12/2016 7:21:10 AM,105
3,2022484408,4/12/2016 7:21:20 AM,103
4,2022484408,4/12/2016 7:21:25 AM,101


In [12]:
# Checking the datatypes of heartrate_seconds_merged dataset variables

heartrate_seconds_merged.dtypes

Id        int64
Time     object
Value     int64
dtype: object

In [13]:
# Converting the Id and Time datatype to 'string' and 'date format' data type respectively

# Extracting date from Time column
heartrate_seconds_merged['Time'] = pd.to_datetime(heartrate_seconds_merged['Time'])
heartrate_seconds_merged['ActivityDate'] = heartrate_seconds_merged['Time'].dt.date

# Rearranging columns
heartrate_seconds_merged = heartrate_seconds_merged[['Id', 'Time', 'ActivityDate', 'Value']]

# # Converting to the respective data types
heartrate_seconds_merged['Id'] = heartrate_seconds_merged['Id'].astype(str)
heartrate_seconds_merged['ActivityDate'] = pd.to_datetime(heartrate_seconds_merged['ActivityDate'])

In [14]:
#  Cross-checking the datatypes
heartrate_seconds_merged.dtypes

Id                      object
Time            datetime64[ns]
ActivityDate    datetime64[ns]
Value                    int64
dtype: object

In [15]:
# Checking for unique datatype

In [16]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in heartrate_seconds_merged.columns:
    result[column] = count_unique_data_types(heartrate_seconds_merged[column])

# Printing the result with column names
print("Column Name: Count of Unique Data Types")
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
Time: 1
ActivityDate: 1
Value: 1


In [17]:
# Checking the descriptive statistics of the dataset
heartrate_seconds_merged.describe()

Unnamed: 0,Value
count,2483658.0
mean,77.32842
std,19.4045
min,36.0
25%,63.0
50%,73.0
75%,88.0
max,203.0


In [18]:
# Number of rows with missing data
heartrate_seconds_merged.isna().any(axis=1).sum()

0

**III. PROCESSING HOURLYSTEPS_MERGED DATASET**

In [19]:
# General viewing and inspection of the hourlySteps_merged dataset variables and data

hourlySteps_merged.head(5)

Unnamed: 0,Id,ActivityHour,StepTotal
0,1503960366,4/12/2016 12:00:00 AM,373
1,1503960366,4/12/2016 1:00:00 AM,160
2,1503960366,4/12/2016 2:00:00 AM,151
3,1503960366,4/12/2016 3:00:00 AM,0
4,1503960366,4/12/2016 4:00:00 AM,0


In [20]:
# Checking the datatypes of heartrate_seconds_merged dataset variables

hourlySteps_merged.dtypes

Id               int64
ActivityHour    object
StepTotal        int64
dtype: object

In [21]:
# Converting the Id and ActivityHour datatype to 'string' and 'date format' data type respectively

# Extracting date from the ActivityHour column
hourlySteps_merged['ActivityHour'] = pd.to_datetime(hourlySteps_merged['ActivityHour'])
hourlySteps_merged['ActivityDate'] = hourlySteps_merged['ActivityHour'].dt.date

# Rearranging columns
hourlySteps_merged = hourlySteps_merged[['Id', 'ActivityHour', 'ActivityDate', 'StepTotal']]

# Converting to the respective data types
hourlySteps_merged['Id'] = hourlySteps_merged['Id'].astype(str)
hourlySteps_merged['ActivityDate'] = pd.to_datetime(hourlySteps_merged['ActivityDate'])

In [22]:
#  Cross-checking the datatypes
hourlySteps_merged.dtypes

Id                      object
ActivityHour    datetime64[ns]
ActivityDate    datetime64[ns]
StepTotal                int64
dtype: object

In [23]:
# Checking for unique datatype

In [24]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in hourlySteps_merged.columns:
    result[column] = count_unique_data_types(hourlySteps_merged[column])

# Printing the result with column names
print("Column Name: Count of Unique Data Types")
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
ActivityHour: 1
ActivityDate: 1
StepTotal: 1


In [25]:
# Checking the descriptive statistics of the dataset
hourlySteps_merged.describe()

Unnamed: 0,StepTotal
count,22099.0
mean,320.166342
std,690.384228
min,0.0
25%,0.0
50%,40.0
75%,357.0
max,10554.0


In [26]:
# Number of rows with missing data
hourlySteps_merged.isna().any(axis=1).sum()

0

**IV. PROCESSING SLEEPDAY_MERGED DATASET**

In [27]:
# General viewing and inspection of the heartrate_seconds_merged dataset variables and data

sleepDay_merged.head(5)

Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,4/12/2016 12:00:00 AM,1,327,346
1,1503960366,4/13/2016 12:00:00 AM,2,384,407
2,1503960366,4/15/2016 12:00:00 AM,1,412,442
3,1503960366,4/16/2016 12:00:00 AM,2,340,367
4,1503960366,4/17/2016 12:00:00 AM,1,700,712


With this dataset, areas that can be addressed includes: sleep patterns, identifying outliers, sleep data by time, sleep quality/efficiency, etc.  For this project on this dataset, I will be focusing on sleep efficiency

In [28]:
sleepDay_merged['SleepEfficiency'] = sleepDay_merged['TotalMinutesAsleep']/sleepDay_merged['TotalTimeInBed'] * 100
sleepDay_merged = sleepDay_merged[['Id', 'SleepDay', 'TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed', 'SleepEfficiency']]

In [29]:
sleepDay_merged

Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed,SleepEfficiency
0,1503960366,4/12/2016 12:00:00 AM,1,327,346,94.508671
1,1503960366,4/13/2016 12:00:00 AM,2,384,407,94.348894
2,1503960366,4/15/2016 12:00:00 AM,1,412,442,93.212670
3,1503960366,4/16/2016 12:00:00 AM,2,340,367,92.643052
4,1503960366,4/17/2016 12:00:00 AM,1,700,712,98.314607
...,...,...,...,...,...,...
408,8792009665,4/30/2016 12:00:00 AM,1,343,360,95.277778
409,8792009665,5/1/2016 12:00:00 AM,1,503,527,95.445920
410,8792009665,5/2/2016 12:00:00 AM,1,415,423,98.108747
411,8792009665,5/3/2016 12:00:00 AM,1,516,545,94.678899


In [30]:
# Checking the datatypes of heartrate_seconds_merged dataset variables

sleepDay_merged.dtypes

Id                      int64
SleepDay               object
TotalSleepRecords       int64
TotalMinutesAsleep      int64
TotalTimeInBed          int64
SleepEfficiency       float64
dtype: object

In [31]:
# Converting the Id and SleepDay datatype to 'string' and 'date format' data type respectively

# Extracting date from SleepDay column
sleepDay_merged['SleepDay'] = pd.to_datetime(sleepDay_merged['SleepDay']) # First convert to date format
sleepDay_merged['ActivityDate'] = sleepDay_merged['SleepDay'].dt.date # Extract the date part

# Rearranging columns
sleepDay_merged = sleepDay_merged[['Id', 'SleepDay', 'ActivityDate', 'TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed', 'SleepEfficiency']]

# Converting to the respective data types
sleepDay_merged['Id'] = sleepDay_merged['Id'].astype(str)
sleepDay_merged['ActivityDate'] = pd.to_datetime(sleepDay_merged['ActivityDate'])

In [32]:
sleepDay_merged.dtypes

Id                            object
SleepDay              datetime64[ns]
ActivityDate          datetime64[ns]
TotalSleepRecords              int64
TotalMinutesAsleep             int64
TotalTimeInBed                 int64
SleepEfficiency              float64
dtype: object

In [33]:
sleepDay_merged

Unnamed: 0,Id,SleepDay,ActivityDate,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed,SleepEfficiency
0,1503960366,2016-04-12,2016-04-12,1,327,346,94.508671
1,1503960366,2016-04-13,2016-04-13,2,384,407,94.348894
2,1503960366,2016-04-15,2016-04-15,1,412,442,93.212670
3,1503960366,2016-04-16,2016-04-16,2,340,367,92.643052
4,1503960366,2016-04-17,2016-04-17,1,700,712,98.314607
...,...,...,...,...,...,...,...
408,8792009665,2016-04-30,2016-04-30,1,343,360,95.277778
409,8792009665,2016-05-01,2016-05-01,1,503,527,95.445920
410,8792009665,2016-05-02,2016-05-02,1,415,423,98.108747
411,8792009665,2016-05-03,2016-05-03,1,516,545,94.678899


In [34]:
# Checking for unique datatype

In [35]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in sleepDay_merged.columns:
    result[column] = count_unique_data_types(sleepDay_merged[column])

# Printing the result with column names
print("Column Name: Count of Unique Data Types")
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
SleepDay: 1
ActivityDate: 1
TotalSleepRecords: 1
TotalMinutesAsleep: 1
TotalTimeInBed: 1
SleepEfficiency: 1


In [36]:
# Checking the descriptive statistics of the dataset
sleepDay_merged.describe()

Unnamed: 0,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed,SleepEfficiency
count,413.0,413.0,413.0,413.0
mean,1.118644,419.467312,458.639225,91.676921
std,0.345521,118.344679,127.101607,8.703885
min,1.0,58.0,61.0,49.836066
25%,1.0,361.0,403.0,91.21813
50%,1.0,433.0,463.0,94.312796
75%,1.0,490.0,526.0,96.068796
max,3.0,796.0,961.0,100.0


In [37]:
# Number of rows with missing data
sleepDay_merged.isna().any(axis=1).sum()

0

**V. PROCESSING WEIGHTLOGINFO_MERGED DATASET**

In [38]:
# General viewing and inspection of the heartrate_seconds_merged dataset variables and data

weightLogInfo_merged.head(5)

Unnamed: 0,Id,Date,WeightKg,WeightPounds,Fat,BMI,IsManualReport,LogId
0,1503960366,5/2/2016 11:59:59 PM,52.599998,115.963147,22.0,22.65,True,1462233599000
1,1503960366,5/3/2016 11:59:59 PM,52.599998,115.963147,,22.65,True,1462319999000
2,1927972279,4/13/2016 1:08:52 AM,133.5,294.31712,,47.540001,False,1460509732000
3,2873212765,4/21/2016 11:59:59 PM,56.700001,125.002104,,21.450001,True,1461283199000
4,2873212765,5/12/2016 11:59:59 PM,57.299999,126.324875,,21.690001,True,1463097599000


In [39]:
# Checking the datatypes of weightLogInfo_merged dataset variables

weightLogInfo_merged.dtypes

Id                  int64
Date               object
WeightKg          float64
WeightPounds      float64
Fat               float64
BMI               float64
IsManualReport       bool
LogId               int64
dtype: object

In [40]:
# Converting datatype of Id and Date to 'string' and 'date format' data type respectively

# Extracting date part from the Date column and assigning it to a new column 'ActivityDate'
weightLogInfo_merged['Date'] = pd.to_datetime(weightLogInfo_merged['Date'])
weightLogInfo_merged['ActivityDate'] = weightLogInfo_merged['Date'].dt.date

# Rearrange columns
weightLogInfo_merged = weightLogInfo_merged[['Id', 'Date', 'ActivityDate', 'WeightKg', 'WeightPounds', 'Fat', 'BMI', 'IsManualReport', 'LogId']]

# Converting to the respective datatypes
weightLogInfo_merged['Id'] = weightLogInfo_merged['Id'].astype(str)
weightLogInfo_merged['ActivityDate'] = pd.to_datetime(weightLogInfo_merged['ActivityDate'])

In [41]:
weightLogInfo_merged.dtypes

Id                        object
Date              datetime64[ns]
ActivityDate      datetime64[ns]
WeightKg                 float64
WeightPounds             float64
Fat                      float64
BMI                      float64
IsManualReport              bool
LogId                      int64
dtype: object

In [42]:
# Checking for unique datatype

In [43]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in weightLogInfo_merged.columns:
    result[column] = count_unique_data_types(weightLogInfo_merged[column])

# Printing the result with column names
print("Column Name: Count of Unique Data Types")
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
Date: 1
ActivityDate: 1
WeightKg: 1
WeightPounds: 1
Fat: 1
BMI: 1
IsManualReport: 1
LogId: 1


In [44]:
# Checking the descriptive statistics of the dataset
weightLogInfo_merged.describe()

Unnamed: 0,WeightKg,WeightPounds,Fat,BMI,LogId
count,67.0,67.0,2.0,67.0,67.0
mean,72.035821,158.811801,23.5,25.185224,1461772000000.0
std,13.923206,30.695415,2.12132,3.066963,782994800.0
min,52.599998,115.963147,22.0,21.450001,1460444000000.0
25%,61.400002,135.363832,22.75,23.959999,1461079000000.0
50%,62.5,137.788914,23.5,24.389999,1461802000000.0
75%,85.049999,187.503152,24.25,25.559999,1462375000000.0
max,133.5,294.31712,25.0,47.540001,1463098000000.0


In [45]:
# Number of rows with missing data
weightLogInfo_merged.isna().any(axis=1).sum()

65

The above shows there are 65 rows having at least one value missing and this can be observed from the 'Fat' column  with 'NaN' data content.  however, the 'Fat' column wouldn't be of any use and could be disregarded.  Instaed the 'BMI' column is enough to address any health issue on ground.

**VI. PROCESSING HOURLYCALORIES_MERGED DATASET**

In [46]:
# General viewing and inspection of the hourlyCalories_merged dataset variables and data

hourlyCalories_merged.head()

Unnamed: 0,Id,ActivityHour,Calories
0,1503960366,4/12/2016 12:00:00 AM,81
1,1503960366,4/12/2016 1:00:00 AM,61
2,1503960366,4/12/2016 2:00:00 AM,59
3,1503960366,4/12/2016 3:00:00 AM,47
4,1503960366,4/12/2016 4:00:00 AM,48


In [47]:
# Checking the datatypes dataset variables

hourlyCalories_merged.dtypes

Id               int64
ActivityHour    object
Calories         int64
dtype: object

In [48]:
# Converting datatype to 'string' and 'date format' datatype respectively

# Extracting the date part in the ActivityHour column and assigning it to another column "ActivityDate"
hourlyCalories_merged['ActivityHour'] = pd.to_datetime(hourlyCalories_merged['ActivityHour'])
hourlyCalories_merged['ActivityDate'] = hourlyCalories_merged['ActivityHour'].dt.date

# Rearranging columns
hourlyCalories_merged = hourlyCalories_merged[['Id', 'ActivityHour', 'ActivityDate', 'Calories']]

# Converting to the respective datatype
hourlyCalories_merged['Id'] = hourlyCalories_merged['Id'].astype(str)
hourlyCalories_merged['ActivityDate'] = pd.to_datetime(hourlyCalories_merged['ActivityDate'])

In [49]:
#  Cross-checking the datatypes
hourlyCalories_merged.dtypes

Id                      object
ActivityHour    datetime64[ns]
ActivityDate    datetime64[ns]
Calories                 int64
dtype: object

In [50]:
# Checking for unique datatype

In [51]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in hourlyCalories_merged.columns:
    result[column] = count_unique_data_types(hourlyCalories_merged[column])

# Printing the result with column names
print("Column Name: Count of Unique Data Types")
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
ActivityHour: 1
ActivityDate: 1
Calories: 1


In [52]:
# Checking the descriptive statistics of the dataset

hourlyCalories_merged.describe()

Unnamed: 0,Calories
count,22099.0
mean,97.38676
std,60.702622
min,42.0
25%,63.0
50%,83.0
75%,108.0
max,948.0


In [53]:
# Number of rows with missing data
hourlyCalories_merged.isna().any(axis=1).sum()

0

**VII. PROCESSING HOURLYINTENSITIES_MERGED DATASET**

In [54]:
# General viewing and inspection of the heartrate_seconds_merged dataset variables and data

hourlyIntensities_merged.head()

Unnamed: 0,Id,ActivityHour,TotalIntensity,AverageIntensity
0,1503960366,4/12/2016 12:00:00 AM,20,0.333333
1,1503960366,4/12/2016 1:00:00 AM,8,0.133333
2,1503960366,4/12/2016 2:00:00 AM,7,0.116667
3,1503960366,4/12/2016 3:00:00 AM,0,0.0
4,1503960366,4/12/2016 4:00:00 AM,0,0.0


In [55]:
# Checking the datatypes dataset variables

hourlyIntensities_merged.dtypes

Id                    int64
ActivityHour         object
TotalIntensity        int64
AverageIntensity    float64
dtype: object

In [56]:
# Converting datatype to 'string' and 'date format' datatype respectively

# Extracting the date part from the ActivityHour column and assigning it to a new column 'ActivityDate'
hourlyIntensities_merged['ActivityHour'] = pd.to_datetime(hourlyIntensities_merged['ActivityHour'])
hourlyIntensities_merged['ActivityDate'] = hourlyIntensities_merged['ActivityHour'].dt.date

# Rearranging columns
hourlyIntensities_merged = hourlyIntensities_merged[['Id', 'ActivityHour', 'ActivityDate', 'TotalIntensity', 'AverageIntensity']]

# Converting to the respective datatype
hourlyIntensities_merged['Id'] = hourlyIntensities_merged['Id'].astype(str)
hourlyIntensities_merged['ActivityDate'] = pd.to_datetime(hourlyIntensities_merged['ActivityDate'])

In [57]:
#  Cross-checking the datatypes

hourlyIntensities_merged.dtypes

Id                          object
ActivityHour        datetime64[ns]
ActivityDate        datetime64[ns]
TotalIntensity               int64
AverageIntensity           float64
dtype: object

In [58]:
# Checking for unique datatype

In [59]:
# Counting the number of unique data types in each column and store the result in a dictionary

result = {}
for column in hourlyIntensities_merged.columns:
    result[column] = count_unique_data_types(hourlyIntensities_merged[column])

print('Column Name: Count of Unique Data Types')
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
ActivityHour: 1
ActivityDate: 1
TotalIntensity: 1
AverageIntensity: 1


In [60]:
# Checking the descriptive statistics of the dataset

hourlyIntensities_merged.describe()

Unnamed: 0,TotalIntensity,AverageIntensity
count,22099.0,22099.0
mean,12.035341,0.200589
std,21.13311,0.352219
min,0.0,0.0
25%,0.0,0.0
50%,3.0,0.05
75%,16.0,0.266667
max,180.0,3.0


In [61]:
hourlyIntensities_merged['TotalIntensity'].describe()

count    22099.000000
mean        12.035341
std         21.133110
min          0.000000
25%          0.000000
50%          3.000000
75%         16.000000
max        180.000000
Name: TotalIntensity, dtype: float64

In [62]:
# Number of rows with missing data
hourlyIntensities_merged.isna().any(axis=1).sum()

0

**VIII. PROCESSING MINUTECALORIESNARROW_MERGED DATASET**

In [63]:
# General viewing and inspection of the heartrate_seconds_merged dataset variables and data

minuteCaloriesNarrow_merged.head()

Unnamed: 0,Id,ActivityMinute,Calories
0,1503960366,4/12/2016 12:00:00 AM,0.7865
1,1503960366,4/12/2016 12:01:00 AM,0.7865
2,1503960366,4/12/2016 12:02:00 AM,0.7865
3,1503960366,4/12/2016 12:03:00 AM,0.7865
4,1503960366,4/12/2016 12:04:00 AM,0.7865


In [64]:
# Checking the datatypes dataset variables

minuteCaloriesNarrow_merged.dtypes

Id                  int64
ActivityMinute     object
Calories          float64
dtype: object

In [65]:
# Converting datatype to 'string' and 'date format' datatype respectively

# Extracting the date part from the ActivityMinute column and assigning it to a new column 'ActivityDate'
minuteCaloriesNarrow_merged['ActivityMinute'] = pd.to_datetime(minuteCaloriesNarrow_merged['ActivityMinute'])
minuteCaloriesNarrow_merged['ActivityDate'] = minuteCaloriesNarrow_merged['ActivityMinute'].dt.date

# Rearranging columns
minuteCaloriesNarrow_merged = minuteCaloriesNarrow_merged[['Id','ActivityMinute','ActivityDate','Calories']]

# Converting to the respective datatype
minuteCaloriesNarrow_merged['Id'] = minuteCaloriesNarrow_merged['Id'].astype(str)
minuteCaloriesNarrow_merged['ActivityDate'] = pd.to_datetime(minuteCaloriesNarrow_merged['ActivityDate'])

In [66]:
#  Cross-checking the datatypes
minuteCaloriesNarrow_merged.dtypes

Id                        object
ActivityMinute    datetime64[ns]
ActivityDate      datetime64[ns]
Calories                 float64
dtype: object

In [67]:
# Checking for unique datatype

In [68]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in minuteCaloriesNarrow_merged.columns:
    result[column] = count_unique_data_types(minuteCaloriesNarrow_merged[column])

print('Column Name: Count of Unique Data Types')
for column, count in result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
ActivityMinute: 1
ActivityDate: 1
Calories: 1


In [69]:
# Checking the descriptive statistics of the dataset
minuteCaloriesNarrow_merged.describe()

Unnamed: 0,Calories
count,1325580.0
mean,1.62313
std,1.410447
min,0.0
25%,0.9357
50%,1.2176
75%,1.4327
max,19.74995


In [70]:
# Number of rows with missing data
minuteCaloriesNarrow_merged.isna().any(axis=1).sum()

0

**IX. PROCESSING MINUTECALORIESWIDE_MERGED DATASET**

In [71]:
# General viewing and inspection of the heartrate_seconds_merged dataset variables and data

minuteCaloriesNarrow_merged.head()

Unnamed: 0,Id,ActivityMinute,ActivityDate,Calories
0,1503960366,2016-04-12 00:00:00,2016-04-12,0.7865
1,1503960366,2016-04-12 00:01:00,2016-04-12,0.7865
2,1503960366,2016-04-12 00:02:00,2016-04-12,0.7865
3,1503960366,2016-04-12 00:03:00,2016-04-12,0.7865
4,1503960366,2016-04-12 00:04:00,2016-04-12,0.7865


In [72]:
# Checking the datatypes dataset variables

minuteCaloriesNarrow_merged.dtypes

Id                        object
ActivityMinute    datetime64[ns]
ActivityDate      datetime64[ns]
Calories                 float64
dtype: object

In [73]:
# Converting datatype to 'string' and 'date format' datatype respectively

# Extracting the date part from 'ActivityMinute' column and assigning it to a new column 'ActivityDate'
minuteCaloriesNarrow_merged['ActivityMinute'] = pd.to_datetime(minuteCaloriesNarrow_merged['ActivityMinute'])
minuteCaloriesNarrow_merged['ActivityDate'] = minuteCaloriesNarrow_merged['ActivityMinute'].dt.date

# Rearranging column
minuteCaloriesNarrow_merged = minuteCaloriesNarrow_merged[['Id','ActivityMinute','ActivityDate','Calories']]

# Converting to the respective datatype
minuteCaloriesNarrow_merged['Id'] = minuteCaloriesNarrow_merged['Id'].astype(str)
minuteCaloriesNarrow_merged['ActivityDate'] = pd.to_datetime(minuteCaloriesNarrow_merged['ActivityDate'])

In [74]:
# Cross-checking the datatypes
minuteCaloriesNarrow_merged.dtypes

Id                        object
ActivityMinute    datetime64[ns]
ActivityDate      datetime64[ns]
Calories                 float64
dtype: object

In [75]:
# Checking for unique datatype

In [76]:
# Counting the number of unique data types in each column and store the result in a dictionary
result = {}
for column in minuteCaloriesNarrow_merged.columns:
    result[column] = count_unique_data_types(minuteCaloriesNarrow_merged[column])

print('Column Name: Count of Unique Data Types')
for column, count in  result.items():
    print(f"{column}: {count}")

Column Name: Count of Unique Data Types
Id: 1
ActivityMinute: 1
ActivityDate: 1
Calories: 1


In [77]:
# Checking the descriptive statistics of the dataset
minuteCaloriesNarrow_merged.describe()

Unnamed: 0,Calories
count,1325580.0
mean,1.62313
std,1.410447
min,0.0
25%,0.9357
50%,1.2176
75%,1.4327
max,19.74995


In [78]:
# Number of rows with missing data
minuteCaloriesNarrow_merged.isna().any(axis=1).sum()

0