In [5]:
library(tidyverse, quiet=T)
library(caret, quiet=T)

In [14]:
dat0 <- read.csv("../datasets/Ultrasound/training_ultrasound.csv")

# Use Hadlok equation to estimate fetal weight from the 4 key ultrasound measurements
dat.raw = dat0 %>%
    mutate(
        LOG10.FWT.GM = 1.3596 + 0.0064*HCIRCM + 0.0424*ABCIRCM + 0.174*FEMURCM + 0.00061*BPDCM*ABCIRCM - 0.00386*ABCIRCM*FEMURCM,
        WTKG.estimate = ifelse(AGEDAYS<1 ,(10^LOG10.FWT.GM)/1000 ,WTKG) ,
        Study = paste('Study', STUDYID)
        )

head(dat.raw)
cat("Total number of unique entries per column:\n")
dat.raw %>% summarise_each(funs(n_distinct(.)))


cat("Distribution of measurements by subject")
dat.raw %>% count(SUBJID) %>% mutate(tot=n) %>% count(tot) %>% t

STUDYID,SUBJID,SEXN,SEX,GAGEBRTH,BIRTHWT,BIRTHLEN,BIRTHHC,DELIVERY,PARITY,⋯,FLAZ,BHC_Z,BLEN_Z,BWT_Z,BWT_40,BLEN_40,BHC_40,LOG10.FWT.GM,WTKG.estimate,Study
1,1002,2,Female,276,3540,50.3,,Category 2.0,1,⋯,2.142646,,0.8916001,0.8604704,3.614882,50.61003,,3.486183,3.06325,Study 1
1,1002,2,Female,276,3540,50.3,,Category 2.0,1,⋯,,,0.8916001,0.8604704,3.614882,50.61003,,,3.54,Study 1
1,1002,2,Female,276,3540,50.3,,Category 2.0,1,⋯,,,0.8916001,0.8604704,3.614882,50.61003,,,10.74,Study 1
1,1003,1,Male,280,3100,50.3,,Category 2.0,1,⋯,1.616571,,0.235298,-0.7255638,3.1,50.3,,3.23363,1.712499,Study 1
1,1003,1,Male,280,3100,50.3,,Category 2.0,1,⋯,1.495569,,0.235298,-0.7255638,3.1,50.3,,3.23363,1.712499,Study 1
1,1003,1,Male,280,3100,50.3,,Category 2.0,1,⋯,1.1069,,0.235298,-0.7255638,3.1,50.3,,3.419602,2.627857,Study 1


Total number of unique entries per column:


STUDYID,SUBJID,SEXN,SEX,GAGEBRTH,BIRTHWT,BIRTHLEN,BIRTHHC,DELIVERY,PARITY,⋯,FLAZ,BHC_Z,BLEN_Z,BWT_Z,BWT_40,BLEN_40,BHC_40,LOG10.FWT.GM,WTKG.estimate,Study
2,2525,2,2,61,229,92,40,7,8,⋯,1656,193,717,1973,1932,710,185,6777,7217,2


Distribution of measurements by subject

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
tot,1,3,4,5,6,7,8,9,10,11,12,13,14,15
nn,34,220,449,469,231,97,116,220,250,292,84,46,11,6


# Data preparation and cleaning

Procedure:

- remove all samples with only one measurement
- These attributes seem to be related to size: 
    - ABCIRCM (Abdominal Circumference)
    - HCIRCM (Head Circumference)
    - BPDCM (Biparietal diameter)
    - FEMURCM (Femur Length)
    
    

In [34]:
# clean data, remove samples with only one observation

dat = dat %>% group_by(SUBJID) %>%
    mutate(tot.measurements = n()) %>%
    filter(tot.measurements >1)

dat.preproc = dat %>% 
    ungroup %>% 
    select(ABCIRCM, HCIRCM, BPDCM, FEMURCM) %>% 
    mutate_each(funs(as.numeric(.))) %>%
    preProcess(., method=c("center", "scale", "knnImpute", "pca"))

#    preProcess(., method=c("center", "scale", "knnImpute", "YeoJohnson", "pca"))

dat.preproc

Created from 7940 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)
  - 5 nearest neighbor imputation (4)
  - principal component signal extraction (4)
  - scaled (4)

PCA needed 2 components to capture 95 percent of the variance

In [35]:
dat.preproc$rotation


Unnamed: 0,PC1,PC2
ABCIRCM,-0.4988103,-0.7221488
HCIRCM,-0.501041,0.3786742
BPDCM,-0.5001775,0.5425691
FEMURCM,-0.4999687,-0.2018064
