# Process Notebook --- Personal Product Notebook

### Author: Yunshuo Zhang | SID: 500025673
#### Start date: 21/10/2022    
#### End date: 21/10/2022

## 1. Introduction


**Data Collection:** 

Most of these were created by the QS Access app for IoS to access step data from Apple Health. Ones ending in "detail" were accessed directly from Apple Health and processed with a python script to convert the XML to CSV.

**Data Ownership:** Apple Health

--------------------------------------------------------------------------------------------------------------

**Stakeholder:** 

The stakeholders in this research are the participants, the health analysis team, Apple Health, and the health promotion organization 

1. For participants, the result of this research is beneficial for them to know about their health compliance of themselves and they could improve their health compliance by following our suggestion

2. For the health analysis team, We can exchange research results and observe the general monthly steps of participants. The data cleaning, data sorting, and data visualization methods used in the research can also be used in other related research.

3. For the health promotion organization, they can roughly determine the month of health promotion by observing our research results, and give reasonable health suggestions based on the analysis of participants' monthly exercise

4. For Apple Health, could use the result to remind users to take more steps and monitor the campaign.

--------------------------------------------------------------------------------------------------------------

**Potential Conflicts of interest:**

1. For participants, if they use our product notebook to detect the monthly trend of their own steps and find that they have reached the standard this month, it is likely that the participants will become lazy or satisfied, leading to a decline in the rate of reaching the standard of the later steps

2. For the health analysis team, our research results or research directions may conflict. The method used in our research may not be applicable to their research data set.

3. For some health promotion organizations and Apple health, our research department may conflict with their interests. For example, when customers know that their health is up to standard, it is difficult for the publicity staff of the gym to promote fitness courses to customers, or they will not follow the plan that Apple Health provided.

--------------------------------------------------------------------------------------------------------------

**Anonymisation:** 

The data only includes the step data from the start of this year. We have already got the informed consent of the participants and the datasets are named participant IDs. Hence, we did this analysis without knowing the age, gender, ethnicity, favorite breakfast food, or anything else which ensured anonymization and reduced the risk of re-identification of data with the people

--------------------------------------------------------------------------------------------------------------

**Data Management:** 

We manage data by distinguishing between different data sets. Data cleaning is mainly performed for each dataset. In addition, under different sub-problems, data processing and analysis are carried out with each data set as the sub-title. Finally, combine the results of each data set to integrate comparative analysis.

--------------------------------------------------------------------------------------------------------------

**Purpose:** 

1. To reflect the real health compliance of the participant each month and whether the participants insist on taking steps and meeting the health standard.

2. To show whether the exam months, semester months, or months with more holidays will affect the number of steps of the participant and how they will affect them.

3. To explore the impact of seasonal changes on participants’ steps and find out the season that participants prefer to take more steps

--------------------------------------------------------------------------------------------------------------

**Importance:** 

Analyzing participants' data from five different perspectives can more clearly and detailedly reflect the monthly trend

1. Observing the proportion of the number of days reaching the standard each month is more conducive to reflecting the real health compliance of the participant each month. It is possible that the total number of steps in a month may reach the health standard due to the excessive number of steps in a few days, while the number of steps in most days does not reach the health standard. Hence, just looking at the total number of steps in a month does not represent the compliance of healthy steps of participants.

2. By showing the analysis of the number of steps in three special time periods, we can help better plan their exercise time. Moreover, motion monitoring software like Apple Health can also make appropriate reminders and plans to help participants plan their time reasonably

3. By exploring seasonal changes and differences, it is possible to better infer the season in which participants are more accustomed to exercise. At the same time, it is also conducive to the seasonal publicity of health promotion organizations

--------------------------------------------------------------------------------------------------------------

**Uncertainty:**

1. Device:

We're not sure how the iPhone and Apple Watch count steps. In daily use, some common hand movements can increase the number of steps. Sometimes, the participants have no obvious movement or contact problems with the equipment. The pedometer did not rise even though the participants were walking. Therefore, the data provided by the equipment can only be used as a reference for our investigation and cannot be completely relied on.

2. Participant:

We do not know the participants' preference for carrying watches and mobile phones, and there will also be accidents. For example, participants do not bring counting equipment when going out or their mobile phones are placed in a static state when exercising, which will affect our results.

3. Data Background:

The detailed information of the participant is unknown. For example, we do not know the participant's age, nationality, and other information. Therefore, in sub-question 1, we assume that the participant is a normal adult. In the rest of the sub-questions, we can only use the public holidays in Australia, with seasonal months as a reference. In addition, the time of each school's semester month and examination month will be different. We use the most common date as a reference for research


--------------------------------------------------------------------------------------------------------------


### 1.1 Import package and datasets

Import the packages

In [1]:
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Set the default renderers of pio to be "svg" so the visualization generated by cufflinks could be displayed on GitHub

Import the datasets that are used to analyze, which are Participant-03.csv, Participant-01.csv and Participant-07.csv

In [2]:
pio.renderers
pio.renderers.default = "svg"

In [3]:
data03 = pd.read_csv("Participant-03.csv", sep=",")

data01 = pd.read_csv("Participant-01.csv", sep = ",")

data07 = pd.read_csv("Participant-07.csv", sep = ",")


### 1.2 Preliminary observation data

Through observing the head and tail of each dataset, to preliminarily understanding the composition of data.

In [4]:
data03.head()

Unnamed: 0,Start,Finish,Steps (count)
0,31/12/2021 23:00,01/01/2022 0:00,0.0
1,01/01/2022 0:00,01/01/2022 1:00,14.0
2,01/01/2022 1:00,01/01/2022 2:00,0.0
3,01/01/2022 2:00,01/01/2022 3:00,0.0
4,01/01/2022 3:00,01/01/2022 4:00,0.0


In [5]:
data03.tail()

Unnamed: 0,Start,Finish,Steps (count)
6015,08/09/2022 13:00,08/09/2022 14:00,0.0
6016,08/09/2022 14:00,08/09/2022 15:00,893.959035
6017,08/09/2022 15:00,08/09/2022 16:00,534.040965
6018,08/09/2022 16:00,08/09/2022 17:00,390.020268
6019,08/09/2022 17:00,08/09/2022 18:00,1191.979732


According to the p3 dataset, it record the start time and finish time pf participants taking steps and the count of steps in each one hour period. The finish time range is between 01/01/2022 0:00 and 08/09/2022 18:00

In [6]:
data01.head()

Unnamed: 0,Start,Finish,Steps (count)
0,31-Dec-2021 23:00,01-Jan-2022 00:00,0.0
1,01-Jan-2022 00:00,01-Jan-2022 01:00,0.0
2,01-Jan-2022 01:00,01-Jan-2022 02:00,0.0
3,01-Jan-2022 02:00,01-Jan-2022 03:00,0.0
4,01-Jan-2022 03:00,01-Jan-2022 04:00,0.0


In [7]:
data01.tail()

Unnamed: 0,Start,Finish,Steps (count)
5862,02-Sep-2022 04:00,02-Sep-2022 05:00,9.0
5863,02-Sep-2022 05:00,02-Sep-2022 06:00,0.0
5864,02-Sep-2022 06:00,02-Sep-2022 07:00,410.584935
5865,02-Sep-2022 07:00,02-Sep-2022 08:00,1153.415065
5866,02-Sep-2022 08:00,02-Sep-2022 09:00,1.0


According to the p1 dataset, it record the start time and finish time pf participants taking steps and the count of steps in each one hour period. The finish time range is between 01/01/2022 0:00 and 02/09/2022 09:00

In [8]:
data07.head()

Unnamed: 0,Start,Finish,Steps (count)
0,01-Jan-2022 00:00,01-Jan-2022 01:00,32.0
1,01-Jan-2022 01:00,01-Jan-2022 02:00,0.0
2,01-Jan-2022 02:00,01-Jan-2022 03:00,0.0
3,01-Jan-2022 03:00,01-Jan-2022 04:00,0.0
4,01-Jan-2022 04:00,01-Jan-2022 05:00,0.0


In [9]:
data07.tail()

Unnamed: 0,Start,Finish,Steps (count)
6128,13-Sep-2022 07:00,13-Sep-2022 08:00,798.0
6129,13-Sep-2022 08:00,13-Sep-2022 09:00,139.0
6130,13-Sep-2022 09:00,13-Sep-2022 10:00,392.0
6131,13-Sep-2022 10:00,13-Sep-2022 11:00,328.0
6132,13-Sep-2022 11:00,13-Sep-2022 12:00,0.0


According to the data07 dataset, it record the start time and finish time pf participants taking steps and the count of steps in each one hour period. The finish time range is between 01/01/2022 0:00 and 13/09/2022 12:00.


## 2. Data Clean

### 2.1 Dataset: Participant_03.csv

Check the NaN value in the dataset

In [10]:
print(data03.isnull().sum())

Start            0
Finish           0
Steps (count)    0
dtype: int64


Each step count is based on one hour period. The first period start at 31/12/2021 23:00 which only has one hour in Decmember of 2021, so we count it as January of 2022 by aggregate the month data based on the Finish time.

Firstly, extract the months and date from the Finish time and store it as new columns called "month" and "date" respectively. Then, drop the "Start" and "Finish" columns due to they will not be used in the later analysis.

In [11]:
data03['month'] = pd.to_datetime(data03['Finish'], dayfirst=True).dt.month 
data03['date'] = pd.to_datetime(data03['Finish'], dayfirst=True).dt.date
data03 = data03.drop('Start', axis=1)
data03 = data03.drop('Finish', axis=1)
data03

Unnamed: 0,Steps (count),month,date
0,0.000000,1,2022-01-01
1,14.000000,1,2022-01-01
2,0.000000,1,2022-01-01
3,0.000000,1,2022-01-01
4,0.000000,1,2022-01-01
...,...,...,...
6015,0.000000,9,2022-09-08
6016,893.959035,9,2022-09-08
6017,534.040965,9,2022-09-08
6018,390.020268,9,2022-09-08


Based on the "date" we extracted, we aggregate the daiily step count by calculating the sum and follow with its month and form it as a new data frame called "cleaned_date". 

Furthermore, there are only 8 days data of September which will affect the result of month to month trend analysis. Hence, remove the September data.

In [12]:
cleaned_date = data03.groupby('date').agg(total_steps_per_day=pd.NamedAgg(column="Steps (count)", aggfunc="sum"),
                                      month=pd.NamedAgg(column="month", aggfunc="max"))

cleaned_date = cleaned_date.drop(cleaned_date[cleaned_date.month == 9].index)
cleaned_date

Unnamed: 0_level_0,total_steps_per_day,month
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,1614.0,1
2022-01-02,5822.0,1
2022-01-03,1959.0,1
2022-01-04,61.0,1
2022-01-05,215.0,1
...,...,...
2022-08-27,2295.0,8
2022-08-28,1118.0,8
2022-08-29,12679.0,8
2022-08-30,9600.0,8


This cleaned_date could be used to calculate the monthly steps and in the sub-question analyze.

### 2.2 Dataset: Participant_01.csv
Duplicate process as before

Check the NaN value

In [13]:
print(data01.isnull().sum())

Start            0
Finish           0
Steps (count)    0
dtype: int64


Extract month and date. Then, drop "Start" and "Finish" column

In [14]:
data01['month'] = pd.to_datetime(data01['Finish'], dayfirst=True).dt.month 
data01['date'] = pd.to_datetime(data01['Finish'], dayfirst=True).dt.date
data01 = data01.drop('Start', axis=1)
data01 = data01.drop('Finish', axis=1)
data01

Unnamed: 0,Steps (count),month,date
0,0.000000,1,2022-01-01
1,0.000000,1,2022-01-01
2,0.000000,1,2022-01-01
3,0.000000,1,2022-01-01
4,0.000000,1,2022-01-01
...,...,...,...
5862,9.000000,9,2022-09-02
5863,0.000000,9,2022-09-02
5864,410.584935,9,2022-09-02
5865,1153.415065,9,2022-09-02


Aggregate the total steps per day and drop September data

In [15]:
cleaned_date_p1 = data01.groupby('date').agg(total_steps_per_day=pd.NamedAgg(column="Steps (count)", aggfunc="sum"),
                                      month=pd.NamedAgg(column="month", aggfunc="max"))

cleaned_date_p1 = cleaned_date_p1.drop(cleaned_date_p1[cleaned_date_p1.month == 9].index)
cleaned_date_p1

Unnamed: 0_level_0,total_steps_per_day,month
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,11500.000000,1
2022-01-02,10446.000000,1
2022-01-03,6294.000000,1
2022-01-04,14411.000000,1
2022-01-05,11026.000000,1
...,...,...
2022-08-27,13906.000000,8
2022-08-28,11704.000000,8
2022-08-29,8283.918443,8
2022-08-30,8899.000000,8


This cleaned_date_p1 could be used to calculate the monthly steps and in the sub-question analyze.

### 2.3 Dataset: Participant_07.csv

Duplicate process as before

Check NaN

In [16]:
print(data07.isnull().sum())

Start            0
Finish           0
Steps (count)    0
dtype: int64


Extract month and date. Then, drop "Start" and "Finish" column

In [17]:
data07['month'] = pd.to_datetime(data07['Finish'], dayfirst=True).dt.month 
data07['date'] = pd.to_datetime(data07['Finish'], dayfirst=True).dt.date
data07 = data07.drop('Start', axis=1)
data07 = data07.drop('Finish', axis=1)
data07

Unnamed: 0,Steps (count),month,date
0,32.0,1,2022-01-01
1,0.0,1,2022-01-01
2,0.0,1,2022-01-01
3,0.0,1,2022-01-01
4,0.0,1,2022-01-01
...,...,...,...
6128,798.0,9,2022-09-13
6129,139.0,9,2022-09-13
6130,392.0,9,2022-09-13
6131,328.0,9,2022-09-13


Aggregate the total steps per day and drop September data

In [18]:
cleaned_date_p7 = data07.groupby('date').agg(total_steps_per_day=pd.NamedAgg(column="Steps (count)", aggfunc="sum"),
                                      month=pd.NamedAgg(column="month", aggfunc="max"))

cleaned_date_p7 = cleaned_date_p7.drop(cleaned_date_p7[cleaned_date_p7.month == 9].index)
cleaned_date_p7

Unnamed: 0_level_0,total_steps_per_day,month
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,6758.000000,1
2022-01-02,7982.000000,1
2022-01-03,7294.000000,1
2022-01-04,5918.851895,1
2022-01-05,11143.148105,1
...,...,...
2022-08-27,6133.000000,8
2022-08-28,8702.888453,8
2022-08-29,8034.911086,8
2022-08-30,11837.384971,8


This cleaned_date_p7 could be used to calculate the monthly steps and in the sub-question analyze.

## 3 Driving quesiton: What are the trends in the month to month steps data?
### 3.1 Sub-Question: What is the proportion of days reaching the health standard every month？


#### 3.1.1 Dataset: Participant_03.csv

Healthy steps reference link: https://ijbnpa.biomedcentral.com/articles/10.1186/1479-5868-8-79

According to the research, 10,000 steps/day is reasonable for audlt population and the estimates of habitual activity levels equate to 7,000 to 11,000 steps/day. We assume the participant 3 is a normal adult and Set the minimum number of healthy steps as 7000 steps per day, which is equivalent to 210000 steps per month. 

We want to explore the proportion of days that meet the health standard every month. By exploring the ratio, we can observe the proportion of healthy steps of participants in each month to help them further understand their own steps

Based on the cleaned_date, we aggregate the monthly step count by calculating the sum and calculate the healthy standard by multiplying the total days in a month and standard daily steps.

In [19]:
clean_month_steps = cleaned_date.groupby('month').agg(total_steps_per_month=pd.NamedAgg(column="total_steps_per_day", aggfunc="sum"))
clean_month_date = cleaned_date.groupby('month').count()
clean_healthy = clean_month_date * 7000

In [20]:
clean_p3 = pd.concat([clean_month_steps, clean_healthy], axis=1, join="inner")
clean_p3.columns.values[1] = "health_standard"
clean_p3.round(0).astype(int)

Unnamed: 0_level_0,total_steps_per_month,health_standard
month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,52121,217000
2,211422,196000
3,205875,217000
4,146538,210000
5,218108,217000
6,162690,210000
7,119113,217000
8,250056,217000


After rounding the data to one decimal place, the data is cleaner and more organized.

In [21]:
import pandas as pd
import cufflinks as cf
import numpy as np

In [22]:
cf.set_config_file(theme='pearl', sharing='public', offline=True)

Find the total number of days that over 7000 steps in each month

In [23]:
cleaned_count = cleaned_date.groupby('month')['total_steps_per_day'].apply(lambda x: x[x >= 7000].count())
cleaned_count = pd.DataFrame(data = cleaned_count)
cleaned_count.columns.values[0] = "total_days_over_7000"
cleaned_count

Unnamed: 0_level_0,total_days_over_7000
month,Unnamed: 1_level_1
1,1
2,20
3,19
4,13
5,16
6,13
7,8
8,19


Calculate the the proportion by dividing the number of days over 7000 steps by the total number of days in a month and store it in the "proportion" dataframe

In [24]:
proportion = cleaned_count['total_days_over_7000'] / clean_month_date['total_steps_per_day']
proportion = pd.DataFrame(data = proportion)
proportion.columns = proportion.columns.astype(str)
proportion.columns.values[0] = "health_proportion(%)"
proportion

Unnamed: 0_level_0,health_proportion(%)
month,Unnamed: 1_level_1
1,0.032258
2,0.714286
3,0.612903
4,0.433333
5,0.516129
6,0.433333
7,0.258065
8,0.612903


Multiply the healthy proportion by 100 and round it to 2 decimals place.

In [25]:
proportion = proportion * 100
proportion.round(2)

Unnamed: 0_level_0,health_proportion(%)
month,Unnamed: 1_level_1
1,3.23
2,71.43
3,61.29
4,43.33
5,51.61
6,43.33
7,25.81
8,61.29


In [26]:
p3_mean = proportion['health_proportion(%)'].mean()
p3_mean

45.165130568356375

Below we created an interactive area chart covering the area under the health proportion of each month.

In [27]:
proportion.figure(y="health_proportion(%)",
               fill=True,
               xTitle="Month", yTitle="Healthy Proportion（%）", title="Figure 1: Health proportion in each month of Participant03"
               )


Figure 1 is an area chart that represents the health proportion in each month of participant 03, the x-axis is the month and the y-axis is the proportion in percentage. It shows the trend of proportion change in each month.

Also, we created an interactive heat map to observe the proportion based on the intensity of the blue color.

In [28]:
proportion.figure(kind="heatmap",
                   colorscale="Blues",xTitle="Month",title="Figure 2: Health proportion intensity of Participant03",
                   dimensions=(700,500))

Figure 2 is a heat chart that represents the health proportion intensity of participant 03 where the x-axis represents months and the intensity of color represents the healthy proportion of each month. It uses the intensity of blue to show the difference in the health proportion in each month.

Compare the monthly total steps with the health standard and generate spread chart to observe the total steps month to month trend

In [29]:
clean_p3.figure(kind="spread", keys=["total_steps_per_month", "health_standard"],
               title="Figure 3: Participant 3 Monthly Steps and Standard Spread Chart")


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.



Figure 3 is a spread chart that compares the real total steps in each month with the total steps of the health standard where the x-axis is the month and the y-axis is the number of steps. The orange line represents the total steps in each month and the blue line represents the health standard. The spread chart shows the difference between the two lines

According to figure 1 and figure 2, the proportion of days reaching the health standard in January is the lowest which is only about 3%, and the color in the heat map is close to white which means lower than 10%. The percentage of healthy steps reaching the standard in February was the highest which is approximately 70%, and the color in the picture is also the darkest. March and August were also relatively high at about 60 percent. The average proportion is around 45%.

In addition, based on figure 3, even though the total steps in May and August reach the health standard, only approximately 50%-60% of days in those months reach the daily standard. Hence, only observing the total steps cannot reflect the real health status. Furthermore, there is a large gap in January, April, June, and July which are the months with a small proportion of healthy days. 

In conclusion, the participant has extremely irregular exercise habits and the health standards are not met in most months. Moreover, the proportion trend and the total steps trend fluctuated greatly. We suggest that the participant should increase the exercise and steps in most months, and try to maintain a healthy standard every day instead of focusing on a few days



#### 3.1.2  Dataset: Participant_01.csv


Duplicate process as before

Aggregate the total steps per month and the healthy standard steps

In [30]:
clean_month_steps = cleaned_date_p1.groupby('month').agg(total_steps_per_month=pd.NamedAgg(column="total_steps_per_day", aggfunc="sum"))
clean_month_date = cleaned_date_p1.groupby('month').count()
clean_healthy = clean_month_date * 7000

In [31]:
clean_p1 = pd.concat([clean_month_steps, clean_healthy], axis=1, join="inner")
clean_p1.columns.values[1] = "health_standard"
clean_p1.round(0).astype(int)

Unnamed: 0_level_0,total_steps_per_month,health_standard
month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,327437,217000
2,236492,196000
3,243643,217000
4,224814,210000
5,220166,217000
6,301786,210000
7,254796,217000
8,283125,217000


Calculate the total number of the days that over 7000 steps

In [32]:
cleaned_count_p1 = cleaned_date_p1.groupby('month')['total_steps_per_day'].apply(lambda x: x[x >= 7000].count())
cleaned_count_p1 = pd.DataFrame(data = cleaned_count_p1)
cleaned_count_p1.columns.values[0] = "total_days_over_7000"
cleaned_count_p1

Unnamed: 0_level_0,total_days_over_7000
month,Unnamed: 1_level_1
1,24
2,18
3,18
4,15
5,13
6,20
7,21
8,24


Calculate the proportion

In [33]:
proportion = cleaned_count_p1['total_days_over_7000'] / clean_month_date['total_steps_per_day']
proportion = pd.DataFrame(data = proportion)
proportion.columns = proportion.columns.astype(str)
proportion.columns.values[0] = "health_proportion(%)"
proportion

Unnamed: 0_level_0,health_proportion(%)
month,Unnamed: 1_level_1
1,0.774194
2,0.642857
3,0.580645
4,0.5
5,0.419355
6,0.666667
7,0.677419
8,0.774194


In [34]:
proportion = proportion * 100
proportion.round(2)

Unnamed: 0_level_0,health_proportion(%)
month,Unnamed: 1_level_1
1,77.42
2,64.29
3,58.06
4,50.0
5,41.94
6,66.67
7,67.74
8,77.42


Average rpoportion:

In [35]:
p1_mean = proportion['health_proportion(%)'].mean()
p1_mean

62.94162826420891

Proportion area chart of participant 01

In [36]:
proportion.figure(y="health_proportion(%)",
               fill=True,
               xTitle="Month", yTitle="Health Proportion（%）", title="Figure 4: Healthy proportion in each month of Participant01"
               )

Heatmap of participant 01

In [37]:
proportion.figure(kind="heatmap",
                   colorscale="Reds",xTitle="Month",title="Figure 5: Healthy proportion intensity of Participant01",
                   dimensions=(700,500))

Build a spread chart of total steps month to month trend

In [38]:
clean_p1.figure(kind="spread", keys=["total_steps_per_month", "health_standard"],
               title="Figure 6: Participant 3 Monthly Steps and Standard Spread Chart")


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.



According to figure 4 and figure 5, the proportion of days reaching the health standard in May is the lowest which is about 40%, and the color in the heat map is close to white. The percentage of healthy steps reaching the standard in January and August was relatively higher than others which are almost 80%, and the color in the picture is also darker. In addition, the change of other months' proportion is relatively stable, and the value is relatively moderate which were all more than 50%. The average proportion is around 63%. Furthermore, based on Figure 6, all the months are over the health standard. January and June exceed the standard of approximately 100k steps.

In conclusion, the participant reached the health standard every month, but the proportion of healthy days per month fluctuated slightly. Also, the average proportion is around 63% which is higher than participant03. We suggest the participant could mainly increase the daily steps between March and May. However, compared with Participant 3, the activity level of participants is relatively high

#### 3.1.3 Datasets: Participant_07.csv


Aggregate the total steps per month and the healthy standard steps

In [39]:
clean_month_steps = cleaned_date_p7.groupby('month').agg(total_steps_per_month=pd.NamedAgg(column="total_steps_per_day", aggfunc="sum"))
clean_month_date = cleaned_date_p7.groupby('month').count()
clean_healthy = clean_month_date * 7000

In [40]:
clean_p7 = pd.concat([clean_month_steps, clean_healthy], axis=1, join="inner")
clean_p7.columns.values[1] = "health_standard"
clean_p7.round(0).astype(int)

Unnamed: 0_level_0,total_steps_per_month,health_standard
month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,231945,217000
2,226701,196000
3,251729,217000
4,226522,210000
5,232167,217000
6,221728,210000
7,230560,217000
8,261579,217000


Calculate the total number of the days that over 7000 steps

In [41]:
cleaned_count_p7 = cleaned_date_p7.groupby('month')['total_steps_per_day'].apply(lambda x: x[x >= 7000].count())
cleaned_count_p7 = pd.DataFrame(data = cleaned_count_p7)
cleaned_count_p7.columns.values[0] = "total_days_over_7000"
cleaned_count_p7

Unnamed: 0_level_0,total_days_over_7000
month,Unnamed: 1_level_1
1,18
2,20
3,19
4,20
5,19
6,15
7,18
8,24


Calculate the proportion

In [42]:
proportion = cleaned_count_p7['total_days_over_7000'] / clean_month_date['total_steps_per_day']
proportion = pd.DataFrame(data = proportion)
proportion.columns = proportion.columns.astype(str)
proportion.columns.values[0] = "health_proportion(%)"
proportion

Unnamed: 0_level_0,health_proportion(%)
month,Unnamed: 1_level_1
1,0.580645
2,0.714286
3,0.612903
4,0.666667
5,0.612903
6,0.5
7,0.580645
8,0.774194


In [43]:
proportion = proportion * 100
proportion.round(2)

Unnamed: 0_level_0,health_proportion(%)
month,Unnamed: 1_level_1
1,58.06
2,71.43
3,61.29
4,66.67
5,61.29
6,50.0
7,58.06
8,77.42


Average poportion:

In [44]:
p7_mean = proportion['health_proportion(%)'].mean()
p7_mean

63.02803379416283

Proportion area chart of participant 07

In [45]:
proportion.figure(y="health_proportion(%)",
               fill=True,
               xTitle="Month", yTitle="Health Proportion（%）", title="Figure 7: Healthy proportion in each month of Participant07"
               )

Heatmap of participant 07

In [46]:
proportion.figure(kind="heatmap",
                   colorscale="Greens",xTitle="Month",title="Figure 8: Healthy proportion intensity of Participant07",
                   dimensions=(700,500))

Build a spread chart of total steps month to month trend

In [47]:
clean_p7.figure(kind="spread", keys=["total_steps_per_month", "health_standard"],
               title="Figure 9: Participant 3 Monthly Steps and Standard Spread Chart")


The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.



According to figure 7 and figure 8, the proportion of days reaching the health standard in June is the lowest which is about 50%, and the color in the heat map is close to white. The percentage of healthy steps reaching the standard in February and August were relatively higher than others which are 71% and 77% approximately, and the color in the picture is also darker. Also, the trend of months' proportion almost corresponding to the health standard trend and all exceed the health standard steps. The average proportion is also around 63%.

Therefore, The participant exceeded the health standard every month and the trend of total steps was similar to that of the health standard. Although the proportion of healthy days per month fluctuates a little, the trend is relatively stable compared with other participants. We suggest the participant could mainly increase the daily steps in most of the months and try to reach a healthy standard every day.

## Conclusion

The average proportion between the three participants

In [48]:
average = np.array([p3_mean,p1_mean,p7_mean])
average.mean()

57.044930875576036

In conclusion, Participant03 had relatively low activity in exercise and did not reach the minimum health standard for most of the months. In contrast, Participant01 and Participant07 had similar average proportions and all met the minimum health standards every month. However, the average proportion of both participants is approximate 63% and the highest proportion is only about 77%. Therefore, no participant can reach the minimum health standard every day in a month, and the average number of healthy days per month among the three participants is about 57%.
