P-values for each individual row #120

Vonyx1000 · 2024-01-23T02:37:35Z

Hello Benjamin!

Thank you so much for this wonderful package and your incredible work. I have read your guide titled "Using the table1 Package to Create HTML Tables of Descriptive Statistics" and have run into a question in regards to the section titled "Example: a column of p-values".

I have used this to add a column of p-values in the past, but I run into an issue with it using the entire column for each p-value. Here is a screenshot from your guide:

It generates a p-value that is performing a chi-square of the entire "Race" column (ie. "White", "Black", AND "Hispanic") of "<0.001". If I want to look at each specific race and how it varies for the Control and Treatment groups, I would not be able to do that.

In that case, I would need a separate p-value for "White", a separate p-value for "Black, and a separate p-value for "Hispanic" (three separate chi-square analyses). I have tried different solutions to separate the chi-square analyses but none of them have worked, and I am quite unfamiliar with how to do this properly in table1. This is how the output would ideally look (chi-square analysis for each race category separately) if I could get it to work:

Is it possible for table1 to do this? Or is it a lost cause?

Thank you so much in advance.

Warmest regards,
Vaish

benjaminrich · 2024-01-24T13:44:00Z

It is possible. Can you explain exactly how you are computing the p-value, like how did you get 0.011 for the third p-value (Hispanic)?

Vonyx1000 · 2024-01-25T05:30:34Z

I'm sorry, I think I did the chi-square math wrong when I tried to do the Hispanic variables in the example table I sent. I am not a statistician by any means, so please take the exact math with a grain a salt.

Here is how you would do it just mathematically:

The formula involves the observed counts in each cell of the table, and it compares the observed counts to the counts that would be expected if there was no association between the variables.

For larger contingency tables, the formula becomes more complex, but statistical software like R should be able to do it.

For the "Hispanic" variable, the observed counts in the contingency table are as follow:

So, the correct chi-square test statistic is approximately 8.547. The degrees of freedom is 1 (for a 2x2 table). Now, when comparing this statistic to the chi-square distribution with 1 degree of freedom, the correct p-value is approximately 0.0035. (sorry the 0.011 was incorrect)

In R, you could make the contingency table and then feed it into the chisq.test() function like this

counts <- matrix(c(368, 61, 174, 11), nrow = 2, byrow = TRUE) t <- chisq.test(counts, correct = FALSE) print(t)

The "correct = FALSE" turns off Yates correction for small sample size, which you would typically want to be TRUE but for this example I made it false so it lines up with the math I did above. I tried repeating it with the code from your guide and there are definitely much better (and probably more statistically accurate) ways to do this, but this is what I ended up with:

library(MatchIt)
data(lalonde)
lalonde$treat <- factor(lalonde$treat, levels=c(0, 1), labels=c("Control", "Treatment"))
lalonde$race <- factor(lalonde$race, levels=c("white", "black", "hispan"), labels=c("White", "Black", "Hispanic"))
label(lalonde$race) <- "Race"

table1(~ race | treat, data = lalonde)

hispanic <- table(lalonde$treat, lalonde$race %in% c("White", "Black"))
print(hispanic)
p <- chisq.test(hispanic, correct = FALSE)
print(p)

You should also be able to do this with more than 2 variables but I am unsure of how the code would look for that. chisq.test() should support doing that.

benjaminrich · 2024-01-25T18:12:16Z

Thanks for the detailed explanation. I just wanted to be sure because I wasn't getting 0.011, which turned out to be incorrect. So, here's how you can do it (basically, I just modified one line from the example in the vignette):

pvalue <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- t.test(y ~ g)$p.value
    } else {
        # For categorical variables, perform individual chi-squared tests for each category
        p <- sapply(levels(y), function(z) chisq.test(table(y==z, g), correct=F)$p.value)
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

table1(~ age + race | treat, data=lalonde, overall=F, extra.col=list(`P-value`=pvalue))

Just a note of caution though, that the p-values for the different categories won't be independent of each other.

EDIT: Note that I used correct=F, but you can change it to T if you want.

Vonyx1000 · 2024-01-26T05:35:39Z

What do you mean when you say that the p-values for the different categories will not be independent of each other?

Thank you so much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P-values for each individual row #120

P-values for each individual row #120

Vonyx1000 commented Jan 23, 2024

benjaminrich commented Jan 24, 2024

Vonyx1000 commented Jan 25, 2024

benjaminrich commented Jan 25, 2024 •

edited

Vonyx1000 commented Jan 26, 2024

P-values for each individual row #120

P-values for each individual row #120

Comments

Vonyx1000 commented Jan 23, 2024

benjaminrich commented Jan 24, 2024

Vonyx1000 commented Jan 25, 2024

benjaminrich commented Jan 25, 2024 • edited

Vonyx1000 commented Jan 26, 2024

benjaminrich commented Jan 25, 2024 •

edited