Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly Calculated P-Values #127

Open
Vonyx1000 opened this issue May 23, 2024 · 1 comment
Open

Incorrectly Calculated P-Values #127

Vonyx1000 opened this issue May 23, 2024 · 1 comment

Comments

@Vonyx1000
Copy link

Vonyx1000 commented May 23, 2024

Hello Benjamin!

Thank you so much for this wonderful package and your incredible work. I have benefited a lot from your package and appreciate the time and effort you put into it. I've been running into an issue with some of the p-values being calculated not the way I want them to be. I am working on a project and ran the following code:

pvalue <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform a standard 2-sample t-test
    # p <- t.test(y ~ g)$p.value
  } else {
    # For categorical variables, perform individual chi-squared tests for each category
    p <- sapply(levels(y), function(z) chisq.test(table(y==z, g))$p.value)
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

#2014/2015 Data
t <- table1(~ `Command`
            | CombinedYear*TextScore, data = s201415, extra.col=list(`P-value`=pvalue)
            , digits=5,overall = F,topclass="Rtable1-zebra Rtable1-shade Rtable1-times"
)

The t-test is commented out because all of the data in this project is categorical. I got the following result:

image

I am looking at evaluation results where we have 97 groups perform certain commands and they were marked as either "Not Done" or "Well Done" for each command. Since I am looking at evaluation results, I have 97 groups who were evaluated so each row adds up to 97 unless the data is missing (if it was missing, it was excluded for that row). When I manually calculate these p-values out, for example the first one "Command3", it should be significant. See example from basic chi-square calculator website online:
image
Source: https://www.socscistatistics.com/tests/chisquare2/default2.aspx

I have been troubleshooting for some time but I am unsure of what the issue is. I apologize if it is something that should be obvious, as I am not super experienced. I think it might be that instead of using 97 as the total, it is using the totals in the header (which is Not Done N=143 and Well Done N=831). Do you think you would be able to provide some insight as to why the p-values are being calculated this way? If so, how can I fix it?

Thank you so much in advance.

Warmest regards,
Vaish

@benjaminrich
Copy link
Owner

First, you have to ask yourself what are the hypotheses that you are testing (a p-value is always associated to a hypothesis test). You need to formulate this clearly or order for the p-value to have the desired meaning.

Note that in the screenshot you posted from the website, I don't think it is formulated correctly because you have a 2x2 contingency table with a grand total of 194(!). I don't think this represents your situation, but again you need to formulate the hypotheses clearly first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants