# Women's Health Study Accelerometry Data

This notebook goes over the code needed to reproduce the results from Table 3, using WiSER with the Women's Health Study (WHS) accelerometry data. We also compare its use to fitting a linear mixed effects model via MixedModels.jl.

#### Packages and Reproducibility

Julia allows for easy reproducibility, by including a `Manifest.toml` and `Project.toml` pair, the user can simply run `] activate .` and the correct environment with dependencies used will run.  

In [1]:
]activate .

[32m[1m Activating[22m[39m environment at `~/WiSER_Reproduce/womens_health_study_accelerometry_analysis/Project.toml`


Note: We use the KNITRO solver in our analysis, which requires a KNITRO license. If you wish to run the analysis without it, you can use another solver, but the results will be slightly different. Commented code is given to do this.

## Availability & Description

Due to confidentiality concerns, access to the WHS Accelerometry dataset is only available through the National Institutes of Health (NIH) database of Genotypes and Phenotypes (dbGaP). Researchers can apply for acceess to download this dataset through dbGaP. 

The URL for the webpage is https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001964.v1.p1 and the dbGaP Study Accession identifier is phs001964.v1.p1. This page includes a description of the dataset, study, and details on how to request access to the data. We cannot give more details on the data due to dbGaP's data use agreement. 

Due to data confidentiality concerns, we supress output of the dataframes that show subject-level data. 

This notebook goes over code, that when used with the dbGAP's WHS Accelerometry data, can reproduce results in the paper (Table 3).

In [2]:
versioninfo()

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8


## Data Cleaning

The following goes through the steps to clean the data once it is downloaded from dbGaP.

In [3]:
using DataFrames, CSV, StatsBase, Statistics, CodecZlib, Dates
ENV["COLUMNS"]=1000

1000

Import the data and add some variables. 

The following assumes you have downloaded the data from dbGaP and are in the downloaded data folder:

In [4]:
pathofdata = "WHS_Accelerometer_phs001964/PhenoGenotypeFiles/RootStudyConsentSet_phs001964.WHS_Accelerometer.v1.p1.c1.GRU/PhenotypeFiles/"

accelerometer_subject = open(pathofdata * "phs001964.v1.pht009959.v1.p1.WHS_Accelerometer_Subject.MULTI.txt.gz") do io
    DataFrame!(CSV.File(GzipDecompressorStream(io), delim="\t", comment = "#", ignoreemptylines=true))
    end

accelerometer_mins = open(pathofdata * "phs001964.v1.pht009963.v1.p1.c1.WHS_Accelerometer_60sec.GRU.txt.gz") do io
#     CSV.read(GzipDecompressorStream(io), delim="\t", comment = "#", ignoreemptylines=true)
    DataFrame!(CSV.File(GzipDecompressorStream(io), delim="\t", comment = "#", ignoreemptylines=true))
end

sum_data = open(pathofdata * "phs001964.v1.pht009960.v1.p1.c1.WHS_Accelerometer_d20180514_Pub.GRU.txt.gz") do io
#      CSV.read(GzipDecompressorStream(io), delim="\t", comment = "#", ignoreemptylines=true)
    DataFrame!(CSV.File(GzipDecompressorStream(io), delim="\t", comment = "#", ignoreemptylines=true))
end


smoking_pub = open(pathofdata * "phs001964.v1.pht009961.v1.p1.c1.WHS_Accelerometer_Smoking_Pub.GRU.txt.gz") do io
     DataFrame!(CSV.File(GzipDecompressorStream(io), delim="\t", comment = "#", ignoreemptylines=true))
end

# get hour of day 
accelerometer_mins[!, :hour] = Dates.hour.(accelerometer_mins[!, :timeHMS]) ;
# get minute of the day
accelerometer_mins[!, :mins] = Dates.minute.(accelerometer_mins[!, :timeHMS]);
# group every 5 minutes for each hour
accelerometer_mins[!, :mingroup] = floor.(accelerometer_mins[!, :mins] / 5) #every 5 minutes
# create smoking variable
smoking_pub[!, :smoker] = map(x -> ismissing(x) ? missing : x == 1 ? "never" : x == 2 ? "past" : "current",
    smoking_pub[!, :smoke])
# create race variable from numeric definitions 
sum_data[!, :RACE] = map(x -> ismissing(x) ? missing : x == 1 ? "white" : x == 2 ? "hispanic" : 
    x == 3 ? "african american" : x == 4 ? "asian" : x == 5 ? "native american" : "other",
    sum_data[!, :RACE]);


# subset to just these variables

keepvars = [:dbGaP_Subject_ID
 :Subject_ID
 :wday
 :ep60_maxsteps
 :m_total
 :sum_valid
:EP60_total_steps
 :compliant
:RACE
:stairs
 :genhealth
 :bmi
 :ageaccel
 :day_worn
 :season]

# get data we need from the daily summary data
sum_data = sum_data[!, keepvars];

│   caller = Header at header.jl:123 [inlined]
└ @ Core /home/cgerman/.julia/packages/CSV/MKemC/src/header.jl:123


In [5]:
# summarize data by hour, to get steps and vector magnitude 
summarized_hour = combine(DataFrames.groupby(accelerometer_mins, 
        [:Subject_ID, :wday, :hour]), :steps => sum, :count_vm => sum);

In [6]:
#Get best top p hours 
function findtop_p(x, p)
    sortedx = sort(x, rev=true)
    return ind1 = findfirst(x .== sortedx[p]) - 1 #hours start at 1, subtract 1 to get true hour 
end

# get max hour for each person, each day 
maxhours = combine(DataFrames.groupby(summarized_hour, 
        [:Subject_ID, :wday]),
        :count_vm_sum => (x -> findtop_p(x, 1)) => :maxvm_hour1,
        :count_vm_sum => (x -> findtop_p(x, 2)) => :maxvm_hour2);

In [7]:
# get the minute data for the top hour found above for each person, each day
tophour_data = DataFrames.innerjoin(accelerometer_mins, maxhours[!, 1:3];
    on = [:Subject_ID, :wday, :hour] .=> [:Subject_ID, :wday, :maxvm_hour1],
    makeunique = false,
    validate = (false, false));

In [8]:
# get the minute data for the second highest hour found above for each person, each day
top2hour_data = DataFrames.innerjoin(accelerometer_mins, maxhours[!, [1;2;4]];
    on = [:Subject_ID, :wday, :hour] .=> [:Subject_ID, :wday, :maxvm_hour2],
    makeunique = false,
    validate = (false, false));

In [9]:
# combine top 1 hour and top 2 hour data together
top2hour_data = vcat(tophour_data, top2hour_data);

In [10]:
# sum steps over each 5 minutes to use as outcome variable
summarized_5min = combine(DataFrames.groupby(top2hour_data, 
        [:Subject_ID, :wday, :hour, :mingroup]),
    :day_worn => first => :day_worn, :season => first => :season, :steps => sum => :steps);

In [11]:
# add the daily summary data
top2hour_sumdatacomb = DataFrames.leftjoin(summarized_5min, sum_data; on = [:Subject_ID, :wday, :day_worn, :season], makeunique = false,
         indicator = nothing, validate = (false, false));

In [12]:
# add the smoking data
top2hourdata = DataFrames.leftjoin(top2hour_sumdatacomb, smoking_pub; on = [:Subject_ID, :dbGaP_Subject_ID], makeunique = false,
         indicator = nothing, validate = (false, false));

In [None]:
# Don't run twice 

top2hourdata[!, :steps] = Float64.(top2hourdata[!, :steps]);
top2hourdata[!, :smoker] = map(x -> ismissing(x) ? missing : x == 1 ? "Never" : x == 2 ? "Past" : "Current",
    top2hourdata[!, :smoke])
top2hourdata[!, :smoker] = levels!(CategoricalArray(top2hourdata[!, :smoker]),
    ["Never"; "Past"; "Current"]);

top2hourdata[!, :RACE] = map(x -> ismissing.(x) ? missing : titlecase(String(x)),
    top2hourdata[!, :RACE])
top2hourdata[!, :RACE] = levels!(CategoricalArray(top2hourdata[!, :RACE]),
    ["White"; "African American"; "Asian"; "Hispanic"; "Native American"; "Other"])

top2hourdata[!, :wday] = levels!(CategoricalArray(top2hourdata[!, :wday]),
    ["Sun"; "Mon"; "Tues"; "Wed"; "Thurs"; "Fri"; "Sat"])

top2hourdata[!, :season] = map(x -> ismissing(x) ? missing : x == 1 ? "Winter" :
    x == 2 ? "Spring" : x== 3 ?  "Summer" : "Autumn", top2hourdata[!, :season]);

top2hourdata[!, :Weekend] = map(x -> ismissing(x) ? missing :
    x in ["Mon"; "Tues"; "Wed"; "Thurs"; "Fri"] ? "Weekday" :
    "Weekend", top2hourdata[!, :wday]);

# make names more presentable to final names 
renamenames = ["Subject_ID"
 "Wday"
 "Hour"
 "mingroup"
 "Day_worn"
 "Season"
 "Steps"
 "dbGaP_Subject_ID"
 "ep60_maxsteps"
 "m_total"
 "sum_valid"
 "EP60_total_steps"
 "Compliant"
 "Race"
 "Stairs"
 "Genhealth"
 "BMI"
 "Age"
 "Smoke"
 "Smoker"
 "Weekend"] 
rename!(top2hourdata, renamenames);

In [14]:
# Drop 0s and log10 transform steps 
keepinds = findall(top2hourdata[!, :Steps] .> 0.0)
top2hour_restricted = top2hourdata[keepinds, :]
top2hour_restricted[!, :Transformed_steps] = log10.(top2hour_restricted[!, :Steps]);
# optionally save this dataset 
CSV.write("WHS_final_cleaned.csv", top2hour_restricted);

## Analysis

As stated in the paper, we use the Knitro solver. If you do not have access to the knitro solver, you can remove solver and KNITRO and it will run, with slightly different but very similar results.

The following produce the results found in Table 3 of the paper. 

In [17]:
using DataFrames, CSV, WiSER, MixedModels, KNITRO
ENV["COLUMNS"]=1000 #extends the number of columns printed when displaying a dataframe. 

#load in data
WHSdata = DataFrame!(CSV.File("WHS_final_cleaned.csv"));

# set reference levels
WHSdata[!, :Smoker] = levels!(CategoricalArray(WHSdata[!, :Smoker]),
    ["Never"; "Past"; "Current"]);

WHSdata[!, :Race] = levels!(CategoricalArray(WHSdata[!, :Race]),
    ["White"; "African American"; "Asian"; "Hispanic"; "Native American"; "Other"])

WHSdata[!, :Wday] = levels!(CategoricalArray(WHSdata[!, :Wday]),
    ["Sun"; "Mon"; "Tues"; "Wed"; "Thurs"; "Fri"; "Sat"]);

┌ Info: Precompiling WiSER [2ff19380-1883-49fc-9d10-450face6b90c]
└ @ Base loading.jl:1260
┌ Info: Precompiling MixedModels [ff71e718-51f3-5ec2-a782-8ffcbfa3c316]
└ @ Base loading.jl:1260
┌ Info: Precompiling KNITRO [67920dd8-b58e-52a8-8622-53c4cffbe346]
└ @ Base loading.jl:1260


In [18]:
# Write a function to compare mixed models with WiSER
function comparemixedmodel(mixedmodel, wsvarmodel)
    coefnames = MixedModels.coefnames(mixedmodel)
    mixedbeta = mixedmodel.β
    mixedbetapval = MixedModels.coeftable(mixedmodel).cols[4]
    wsvarbeta = wsvarmodel.β
    wsvarbetapval = WiSER.coeftable(wsvarmodel).cols[4][1:wsvarmodel.p] 
    return DataFrame(coefnames = coefnames, mixedbeta = mixedbeta,
        mixedbetapval = mixedbetapval, wsvarbeta = wsvarbeta,
        wsvarbetapval = wsvarbetapval)
end

comparemixedmodel (generic function with 1 method)

The following constructs and fits the model. The `Optimization unsuccessful` warnings can be ignored because KNITRO by default uses a very stringent convergence criterion. FeasibleApproximate indicates the solution is adequate. Other nonlinear optimization solvers such as IPOPT will return `Optimal` status.

In [19]:
wisermodel_transformed = WSVarLmmModel(
    @formula(Transformed_steps ~ 1 + BMI + Wday + Hour + 
                Race + Stairs + Age + Smoker + Season + m_total),
    @formula(Transformed_steps ~ 1 + Day_worn), 
    @formula(Transformed_steps ~ 1 + BMI + Wday + Hour + 
                Race + Age + Smoker), 
                :Subject_ID, WHSdata);
@time WiSER.fit!(wisermodel_transformed, KNITRO.KnitroSolver(outlev=0, ftol = 2), parallel = false, runs = 4)

### IF NO KNITRO LICENSE, comment out last line above and run:
# solver = Ipopt.IpoptSolver(print_level=0, watchdog_shortened_iter_trigger=3, max_iter=100)
# @time WiSER.fit!(wisermodel_transformed, solver, parallel = false, runs = 4)

run = 1, ‖Δβ‖ = 0.031092, ‖Δτ‖ = 0.103484, ‖ΔL‖ = 0.003194, status = Optimal, time(s) = 7.289397
run = 2, ‖Δβ‖ = 0.003093, ‖Δτ‖ = 0.025854, ‖ΔL‖ = 0.000621, status = FeasibleApproximate, time(s) = 7.664289


└ @ WiSER /home/cgerman/.julia/packages/WiSER/tXr2S/src/fit.jl:63


run = 3, ‖Δβ‖ = 0.000335, ‖Δτ‖ = 0.002050, ‖ΔL‖ = 0.000057, status = FeasibleApproximate, time(s) = 7.005404


└ @ WiSER /home/cgerman/.julia/packages/WiSER/tXr2S/src/fit.jl:63


run = 4, ‖Δβ‖ = 0.000030, ‖Δτ‖ = 0.000203, ‖ΔL‖ = 0.000005, status = FeasibleApproximate, time(s) = 14.505302


└ @ WiSER /home/cgerman/.julia/packages/WiSER/tXr2S/src/fit.jl:63


 41.520894 seconds (8.05 M allocations: 388.029 MiB)



Within-subject variance estimation by robust regression (WiSER)
Number of individuals/clusters: 15390
Total observations: 2314611

Fixed-effects parameters:
────────────────────────────────────────────────────────────────────────
                                 Estimate   Std. Error       Z  Pr(>|Z|)
────────────────────────────────────────────────────────────────────────
β1: (Intercept)               2.86951      0.0281224    102.04    <1e-99
β2: BMI                      -0.0136865    0.000337856  -40.51    <1e-99
β3: Wday: Mon                 0.063499     0.00263896    24.06    <1e-99
β4: Wday: Tues                0.0516142    0.00272096    18.97    <1e-79
β5: Wday: Wed                 0.0474884    0.00276237    17.19    <1e-65
β6: Wday: Thurs               0.0421016    0.00274567    15.33    <1e-52
β7: Wday: Fri                 0.053267     0.00272439    19.55    <1e-84
β8: Wday: Sat                 0.0516685    0.00259806    19.89    <1e-87
β9: Hour                     -0.0058610

In [20]:
@time mixedmodel_transformed = fit(LinearMixedModel, 
        @formula(Transformed_steps ~ 1 + BMI + Wday + Hour + 
            Race + Stairs + Age + Smoker + Season + m_total + (1 + Day_worn|Subject_ID)),
        WHSdata)

 24.194260 seconds (53.44 M allocations: 9.293 GiB, 22.26% gc time)


Linear mixed model fit by maximum likelihood
 Transformed_steps ~ 1 + BMI + Wday + Hour + Race + Stairs + Age + Smoker + Season + m_total + (1 + Day_worn | Subject_ID)
     logLik       -2 logLik         AIC            BIC      
 -1.9976733×10⁶ 3.99534661×10⁶ 3.99539861×10⁶ 3.99572763×10⁶

Variance components:
              Column     Variance    Std.Dev.    Corr.
Subject_ID (Intercept)  0.0614063036 0.24780295
           Day_worn     0.0013593018 0.03686871 -0.59
Residual                0.3198837226 0.56558264
 Number of obs: 2314611; levels of grouping factors: 15390

  Fixed-effects parameters:
──────────────────────────────────────────────────────────────────────
                            Estimate    Std.Error     z value  P(>|z|)
──────────────────────────────────────────────────────────────────────
(Intercept)              2.8692       0.0262518    109.295      <1e-99
BMI                     -0.0136871    0.0003436    -39.8344     <1e-99
Wday: Mon                0.0632695    0.

In [21]:
dfcompare_transformed = comparemixedmodel(mixedmodel_transformed, wisermodel_transformed) 

Unnamed: 0_level_0,coefnames,mixedbeta,mixedbetapval,wsvarbeta,wsvarbetapval
Unnamed: 0_level_1,String,Float64,Float64,Float64,Float64
1,(Intercept),2.8692,0.0,2.86951,0.0
2,BMI,-0.0136871,0.0,-0.0136865,0.0
3,Wday: Mon,0.0632695,0.0,0.063499,6.23574e-128
4,Wday: Tues,0.0512991,4.48154e-248,0.0516142,3.06983e-80
5,Wday: Wed,0.047101,2.7410599999999998e-204,0.0474884,3.08931e-66
6,Wday: Thurs,0.04182,1.54973e-161,0.0421016,4.54255e-53
7,Wday: Fri,0.0529619,4.58897e-267,0.053267,3.97417e-85
8,Wday: Sat,0.0518392,1.0021000000000001e-274,0.0516685,5.236399999999999e-88
9,Hour,-0.00583023,0.0,-0.00586103,8.422220000000001e-154
10,Race: African American,-0.0541292,8.86671e-05,-0.0541461,0.000153524


#### Supplementary Table S.3

The following obtains the results of summary statistics found in Supplementary Table S.3.

In [22]:
using DataFrames, CSV, StatsBase, Statistics

keepvars = [:BMI; :Steps; :Wday; :Hour; :Race; :Age; :Smoker; :Season; :Day_worn; :m_total; :Stairs]
WHSdata = DataFrame!(CSV.File("WHS_final_cleaned.csv"));
descrstats = dropmissing(WHSdata, keepvars)
describe(descrstats[!, [:Steps; :m_total; :BMI; :Genhealth; :Age; :Day_worn; :Season;
                :Stairs]], :mean, :std, :min, :q25, :median, :q75, :max)

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max
Unnamed: 0_level_1,Symbol,Union…,Union…,Any,Union…,Union…,Union…,Any
1,Steps,93.6051,134.945,1.0,19.0,48.0,96.0,1173.0
2,m_total,884.457,123.118,1,829.0,896.0,954.0,1440
3,BMI,26.0846,4.92855,13.7318,22.5964,25.2939,28.7962,58.4195
4,Genhealth,2.02188,0.755191,1,1.0,2.0,2.0,4
5,Age,71.4021,5.61028,62,67.0,70.0,75.0,89
6,Day_worn,3.96429,1.98116,1,2.0,4.0,6.0,8
7,Season,,,Autumn,,,,Winter
8,Stairs,2.61839,1.48501,1,1.0,2.0,4.0,6


In [23]:
# Race
countmap(combine(DataFrames.groupby(descrstats, :Subject_ID), :Race => first)[!, 2]),
proportionmap(combine(DataFrames.groupby(descrstats, :Subject_ID), :Race => first)[!, 2])

(Dict{CategoricalValue{String,UInt32},Int64}("African American" => 228,"Asian" => 177,"Other" => 17,"Native American" => 27,"White" => 14806,"Hispanic" => 135), Dict{CategoricalValue{String,UInt32},Float64}("African American" => 0.014814814814814815,"Asian" => 0.011500974658869395,"Other" => 0.0011046133853151398,"Native American" => 0.0017543859649122807,"White" => 0.962053281351527,"Hispanic" => 0.008771929824561403))

In [24]:
# Days worn
mean(combine(DataFrames.groupby(descrstats, :Subject_ID), :Day_worn => maximum)[!, 2]), 
std(combine(DataFrames.groupby(descrstats, :Subject_ID), :Day_worn => maximum)[!, 2])

(6.867316439246264, 0.5467940202914422)

In [25]:
# Day of week
countmap(descrstats[!, :Wday]), proportionmap(descrstats[!, :Wday])

(Dict{CategoricalValue{String,UInt32},Int64}("Thurs" => 332057,"Wed" => 334061,"Tues" => 329882,"Sun" => 324040,"Mon" => 331770,"Fri" => 333073,"Sat" => 329728), Dict{CategoricalValue{String,UInt32},Float64}("Thurs" => 0.1434612554766222,"Wed" => 0.14432705970895326,"Tues" => 0.14252157273943655,"Sun" => 0.13999760650925794,"Mon" => 0.14333726055911772,"Fri" => 0.14390020612534893,"Sat" => 0.14245503888126343))

In [26]:
# Season
countmap(descrstats[!, :Season]), proportionmap(descrstats[!, :Season])

(Dict("Summer" => 749037,"Autumn" => 560744,"Spring" => 545729,"Winter" => 459101), Dict("Summer" => 0.32361247743141286,"Autumn" => 0.24226273874962143,"Spring" => 0.23577568757773984,"Winter" => 0.19834909624122585))

In [27]:
# Smoking Status
countmap(combine(DataFrames.groupby(descrstats, :Subject_ID), :Smoker => first)[!, 2]),
    proportionmap(combine(DataFrames.groupby(descrstats, :Subject_ID), :Smoker => first)[!, 2])

(Dict{CategoricalValue{String,UInt32},Int64}("Past" => 7082,"Current" => 548,"Never" => 7760), Dict{CategoricalValue{String,UInt32},Float64}("Past" => 0.46016894087069526,"Current" => 0.035607537361923326,"Never" => 0.5042235217673814))