Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P-values for each individual row #120

Open
Vonyx1000 opened this issue Jan 23, 2024 · 4 comments
Open

P-values for each individual row #120

Vonyx1000 opened this issue Jan 23, 2024 · 4 comments

Comments

@Vonyx1000
Copy link

Hello Benjamin!

Thank you so much for this wonderful package and your incredible work. I have read your guide titled "Using the table1 Package to Create HTML Tables of Descriptive Statistics" and have run into a question in regards to the section titled "Example: a column of p-values".

I have used this to add a column of p-values in the past, but I run into an issue with it using the entire column for each p-value. Here is a screenshot from your guide:

image

It generates a p-value that is performing a chi-square of the entire "Race" column (ie. "White", "Black", AND "Hispanic") of "<0.001". If I want to look at each specific race and how it varies for the Control and Treatment groups, I would not be able to do that.

In that case, I would need a separate p-value for "White", a separate p-value for "Black, and a separate p-value for "Hispanic" (three separate chi-square analyses). I have tried different solutions to separate the chi-square analyses but none of them have worked, and I am quite unfamiliar with how to do this properly in table1. This is how the output would ideally look (chi-square analysis for each race category separately) if I could get it to work:

image

Is it possible for table1 to do this? Or is it a lost cause?

Thank you so much in advance.

Warmest regards,
Vaish

@benjaminrich
Copy link
Owner

It is possible. Can you explain exactly how you are computing the p-value, like how did you get 0.011 for the third p-value (Hispanic)?

@Vonyx1000
Copy link
Author

I'm sorry, I think I did the chi-square math wrong when I tried to do the Hispanic variables in the example table I sent. I am not a statistician by any means, so please take the exact math with a grain a salt.

Here is how you would do it just mathematically:
image

The formula involves the observed counts in each cell of the table, and it compares the observed counts to the counts that would be expected if there was no association between the variables.

For larger contingency tables, the formula becomes more complex, but statistical software like R should be able to do it.

For the "Hispanic" variable, the observed counts in the contingency table are as follow:
image
image
image
image
image

So, the correct chi-square test statistic is approximately 8.547. The degrees of freedom is 1 (for a 2x2 table). Now, when comparing this statistic to the chi-square distribution with 1 degree of freedom, the correct p-value is approximately 0.0035. (sorry the 0.011 was incorrect)

In R, you could make the contingency table and then feed it into the chisq.test() function like this

counts <- matrix(c(368, 61, 174, 11), nrow = 2, byrow = TRUE) t <- chisq.test(counts, correct = FALSE) print(t)

The "correct = FALSE" turns off Yates correction for small sample size, which you would typically want to be TRUE but for this example I made it false so it lines up with the math I did above. I tried repeating it with the code from your guide and there are definitely much better (and probably more statistically accurate) ways to do this, but this is what I ended up with:

library(MatchIt)
data(lalonde)
lalonde$treat <- factor(lalonde$treat, levels=c(0, 1), labels=c("Control", "Treatment"))
lalonde$race <- factor(lalonde$race, levels=c("white", "black", "hispan"), labels=c("White", "Black", "Hispanic"))
label(lalonde$race) <- "Race"

table1(~ race | treat, data = lalonde)

hispanic <- table(lalonde$treat, lalonde$race %in% c("White", "Black"))
print(hispanic)
p <- chisq.test(hispanic, correct = FALSE)
print(p)

You should also be able to do this with more than 2 variables but I am unsure of how the code would look for that. chisq.test() should support doing that.

@benjaminrich
Copy link
Owner

benjaminrich commented Jan 25, 2024

Thanks for the detailed explanation. I just wanted to be sure because I wasn't getting 0.011, which turned out to be incorrect. So, here's how you can do it (basically, I just modified one line from the example in the vignette):

pvalue <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- t.test(y ~ g)$p.value
    } else {
        # For categorical variables, perform individual chi-squared tests for each category
        p <- sapply(levels(y), function(z) chisq.test(table(y==z, g), correct=F)$p.value)
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

table1(~ age + race | treat, data=lalonde, overall=F, extra.col=list(`P-value`=pvalue))

image

Just a note of caution though, that the p-values for the different categories won't be independent of each other.

EDIT: Note that I used correct=F, but you can change it to T if you want.

@Vonyx1000
Copy link
Author

What do you mean when you say that the p-values for the different categories will not be independent of each other?

Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants