## Case study. Lead exposure effects on Neurological and Psychological function in Children. 

####This example has been taken from Fundamentals of Biostatistics 7ed. 

This is a study that invetigates the effect of lead exposure on Children using several biometrical measurements. The study was carried out by Dr. Philip Landrigan at Mt. Sinai Medical Center. In this dataset, there are multiple measurements that try to disentangle medical outcomes with environmental effectors.

lead's blood levels were measured on kids of two groups, group 1 a set of children with lead levels less than 40 ug/mL and group 2 kids that had lead levels higher or equal than 40ug/mL.

We will evaluate two different outcome variables:
A. The number of finger-wrist taps in the dominant hand in 10 seconds (MAXFWT) - this is a measurement of Neurodevelopment function
B. the Wechsler full-scale IQ score (IQF) - This is a measurement of intellectual development.


A. Load and evaluate the data, then run a t-test that compares each variable between the cases and controls. (Remember to evaluate equality of variance first)

In [None]:
comp = read.csv("LEAD.DAT.csv")


In [None]:
##Subset based on the index of the variables
##you can subset also using the column names
lead_sub = comp[,c(21,40)]

#remove missing data == 99
lead_tap = lead_sub[lead_sub$maxfwt!=99,]
nrow(lead_tap)

In [None]:
##subset IQ variable and group
lead_iq = comp[,c(18,21)]

In [None]:
##make the group for both datasets a factor variable
lead_tap$Group = as.factor(lead_tap$Group)
lead_iq$Group = as.factor(lead_iq$Group)

In [None]:
##Produce Boxplots
library(car)
Boxplot(lead_tap$maxfwt ~ lead_tap$Group)

In [None]:
Boxplot(lead_iq$iqf ~ lead_iq$Group)

In [None]:
##Evaluate equality of the variances between the two groups on 
##the tapping test outcome
var.test(lead_tap$maxfwt ~ lead_tap$Group)

In [None]:
##Evaluate equality of the variances between the two groups on 
##the IQ test outcome
var.test(lead_iq$iqf ~ lead_iq$Group)

In [None]:
##As the var test was non significant, we can run the t-test with
##equal variances
t.test(lead_tap$maxfwt ~ lead_tap$Group,var.equal = T)

In [None]:
t.test(lead_iq$iqf ~ lead_iq$Group,var.equal = T)

In [None]:
##Is the data normally distributed?? are the bad outliers?
qqnorm(lead_tap$maxfwt)

In [None]:
##Another way to evaluate outlier
stem(lead_tap$maxfwt[lead_iq$Group==2])

In [None]:
##Get mean, sd, and counts by group using dplyr
library(dplyr)
lead_tap %>% group_by(Group) %>% summarise(Mean = mean(maxfwt, na.rm=TRUE))

In [None]:
lead_tap %>% group_by(Group) %>% summarise(SD = sd(maxfwt, na.rm=TRUE))

In [None]:
lead_tap %>% group_by(Group) %>% count()

### To Calculate ESD:

$$max_{i=1,....,n}|x_i - \bar{x}| / S $$

#### Compute the ESD statistic for the finger–wrist tapping scores for the control group.

$\bar{x}$ group 1 = 54.4 s = 12.05
### There are two outliers to evaluate 13 and 84, we can calculate the absolute difference between each value and the mean to see which is the most distant outlier.

In [None]:
abs(13 - 54.4)
abs(84-54.4)

### we select 13 and calculate the ESD for it.

In [None]:
41.4/12.1

### We then use the ESD table to obtain the critical value for this distribution

$ESD_{(64,0.95)} = 3.22$

### as 3.42 $>$ 3.22, then we reject the null hypothesis as the ESD statistics for the value 13 in the finger-wrist tapping control group is larger than the critical value for such distribution

### What would be the ideal range of deviations from the standard normal distribution if there were no outliers?
We can calculate this value using:

$ 100\% * \dfrac{n}{n+1} $th percentile

In [None]:
##evaluate the ideal critical value with no outliers for the control group
100 * 64/65 #= 98.46

In [None]:
##Calculating the Standardize value we get that the ideal range is 2.17 
##for the control group with no outliers
qnorm(0.985, mean=0, sd=1)

### If we evaluate the expose group the same way, we get:
$\bar{x}$ group 2 = 47.4 s = 13.2 n = 35
The minimum and maximum values are 13 and 83

In [None]:
abs(13 - 47.4)
abs(83-47.4)

## We therefore select 84 as the biggest outlier, we calculate ESD and evaluate against the critical value

In [None]:
abs(83-47.4)/13.2

$ESD_{(35,0.95)} = 2.98$

### as 2.70 $<$ 2.98, then we do not reject the null hypothesis as the ESD statistics for the value 84 in the finger-wrist tapping exposed group is smaller than the critical value for such distribution

### However, this does not seem to coincide with all the descriptive statistics (qqplot, stem plot, boxplot, etc). There might be a masking problem due to the effect of multiple outliers on the variance of the distributions. We should use the multi-level approach.

## How many outliers should we evaluate? we can use an upper bound of min([n/10], 5).

#### From the fundamentals of biostatistics textbook "If there are more than five outliers in a data set, then we most likely have an underlying nonnormal distribution, unless the sample size is very large."

In [None]:
35/10 = 3.5 
##3 outliers

In [None]:
exp = lead_tap %>% filter(Group == 2)
nrow(exp)
min(exp$maxfwt)
max(exp$maxfwt)
mean(exp$maxfwt)
sd(exp$maxfwt)

In [None]:
abs(13 - 47.2)
abs(84-47.2)

In [None]:
exp2 = exp[!(exp$maxfwt==83),]
nrow(exp2)
min(exp2$maxfwt)
max(exp2$maxfwt)
mean(exp2$maxfwt)
sd(exp2$maxfwt)

In [None]:
abs(13 - 46.4)
abs(70-46.4)

In [None]:
abs(13 - 46.4)/11.8

### We then remove 13 from the sample and evaluate the next set of outliers

In [None]:
exp3 = exp2[!(exp2$maxfwt==13),]
nrow(exp3)
min(exp3$maxfwt)
max(exp3$maxfwt)
mean(exp3$maxfwt)
sd(exp3$maxfwt)

In [None]:
abs(14 - 47.4)
abs(70 - 47.4)

### We select 14 as out last outlier.

In [None]:
abs(14 - 47.4)/10.35

$ESD_{(33,0.95)} = 2.98$

### as 3.22 $>$ 2.98, then we reject the null hypothesis as the ESD statistics for the value 14 in the finger-wrist tapping exposed group is larger than the critical value for such distribution. Therefore, we determine that all three outliers are statistically different from the distribution.

## Next we could remove the 3 outliers from the exposed group and evaluate the control group and run the t-test again without the outliers to evaluate the parameters between the two distributions

#### For the controls, we calculate how many outliers to evaluate:

min([64/10], 5) = min(6, 5) = 5

#### And proceed to the same procedure as before.

## Note that on the next table the last outlier showed a nonsignificant value, after that you go back to the previous value and evaluate significance and so on until you find a significan value, in this case the first value, which is the same we obtain using the single outlier approach.

![title](Table_controls_outliers.png)