Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatches between significance test result of R and SPSS #100

Closed
khanhhtt opened this issue Nov 24, 2022 · 6 comments
Closed

Mismatches between significance test result of R and SPSS #100

khanhhtt opened this issue Nov 24, 2022 · 6 comments

Comments

@khanhhtt
Copy link

Hi @gdemin,

Thank you for the great package. This thread is just for asking question than reporting an issue.
I have a couple of questions regarding the mismatches between significance test output of R and SPSS that need your help as below:

1. Rounding: it appears that when the cell percentage have exactly number 5 behind the decimal - e.g 12.5, then it is rounded half down to 12 instead of round half up to 13 in the output of significance test. The rounding numbers still work well when we don't perform the test.
This is R-script I used:

# ==============================================================================
# Required packages
# ==============================================================================
library(haven)
library(expss)

# ==============================================================================
# Required functions
# ==============================================================================
# Source: https://github.com/gdemin/expss/issues/28
empty_to_zero = function(tbl, value_to_add = 0, digits = get_expss_digits()){
  # for numerics
  if_na(tbl) = value_to_add
  # for characters after significance testing
  for(i in seq_along(tbl)[-1]){
    if(is.character(tbl[[i]])){
      empty = grepl("^\\s*$", tbl[[i]])
      max_padding = max(nchar(gsub("^(.+?)(\\s*)$", "\\2", tbl[[i]], perl = TRUE)))
      replacement = paste(c(format(value_to_add, nsmall = digits), rep(" ", max_padding)), collapse = "")
      tbl[[i]][empty] = sub("\\s?", replacement, tbl[[i]][empty], perl = TRUE)
    }
  }
  tbl
}

add_percent = function(x, digits = get_expss_digits(), excluded_rows = "count", ...){
  nas = is.na(x)
  x[nas] = ""
  
  cols_idx = 2:dim(x)[2]
  
  for (col in cols_idx) {
    for (row in 1:dim(x)[1]){
      if (!grepl(excluded_rows, x[row, 1], perl = TRUE)){
        if (suppressWarnings(is.na(as.numeric(as.character(x[row,col]))))) {
          x[row,col] = sub(" ", "% ", trimws(x[row,col]))
        }
        else {
          x[row,col] = paste0(trimws(x[row,col]), "%")
        }
      }  
    }
  }
  x <- x[!grepl("Std. dev.", x$row_labels),]
  x <- x[!grepl("Unw. valid N", x$row_labels),]
  x
}

# ==============================================================================
# Report
# ==============================================================================
### Get data
df <- haven::read_sav("01. Data - Cleaned.sav") 

### Results sigtest rounded to 0 decimal

tbl_1 <- df %>% tab_cols(total(), q3_BG) %>%
  tab_cells(q4) %>%
  tab_stat_cpct(total_label = c("Total"), 
                total_statistic = c("u_cases"),
                total_row_position = "above") %>%
  tab_last_sig_cpct(digits = 0,
                    subtable_marks = "greater",
                    sig_labels = LETTERS) %>%
  tab_pivot() %>%
  empty_to_zero(digits = 0) %>%
  add_percent(excluded_rows = "#")
tbl_1

### Results without sigtest

tbl_1 <- df %>% tab_cols(total(), q3_BG) %>%
  tab_cells(q4) %>%
  tab_stat_cpct(total_label = c("Total"), 
                total_statistic = c("u_cases"),
                total_row_position = "above") %>%
  # tab_last_sig_cpct(digits = 10,
  #                   subtable_marks = "greater",
  #                   sig_labels = LETTERS) %>%
  tab_pivot() %>%
  empty_to_zero(digits = 0) %>%
  add_percent(excluded_rows = "#")
tbl_1

And here is the SPSS syntax..

GET FILE '01. Data - Cleaned.sav'.

COMPUTE Totalt = 1.
EXECUTE.

* Sigtest.
CTABLES
    /VLABELS VARIABLES=q4 DISPLAY=LABEL
    /VLABELS VARIABLES=q3_BG DISPLAY=LABEL
    /TABLE q4 [C][COLPCT.COUNT PCT40.0, TOTALS [COUNT F40.0]] by (Totalt + q3_BG) [C]
    /SLABELS POSTION=ROW VISIBLE = NO
    /CATEGORIES VARIABLES=q4 EMPTY=INCLUDE TOTAL=YES POSITION=BEFORE
    /COMPARETEST TYPE=PROP ALPHA=0.05 ADJUST=NONE ORIGIN=COLUMN INCLUDEMRSETS=YES 
    CATEGORIES=ALLVISIBLE MERGE=YES STYLE=SIMPLE SHOWSIG=NO. 

The comparison results of significance test will then
image

The results without significance test are still fine
image

2. There are some pair comparison that marks as significant in R but not the same case in SPSS.
Especially, when a proportion is 1, the significance test is also performed in R.
image

However, the document of SPSS Statistic Algorithms 22 - page 264 states that the test will not be performed in this case
image

I think I have used the R function in an inefficient way so that it leads to the mismatched.
Could you please help take a look and give me some advise on this matter?
In the attachment, there are data file, R script, SPSS script, and comparison results between R and SPSS for your reference.
The first sheet of the Excel file is the significance results and the second sheet is the results without the test.

Thank you in advance!

Reproducible examples.zip

@gdemin
Copy link
Owner

gdemin commented Nov 25, 2022

Hi!
Thank you for the detailed report.

  1. expss uses built-in R round function. It's behaviour is documented:

Note that for rounding off a 5, the IEC 60559 standard (see also ‘IEEE 754’) is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2.

Personally, I wish it would be left as is to be consistent with other parts of R. But if you consider it is a serious issue I can change this behaviour.

  1. I will investigate it - perhaps it is a bug. However, it is rather strange that SPSS ignores 0% and 100% values. For sufficiently large bases R function which I use internally calculates significances for this edge case without any issues:
prop.test(x = c(9, 19), n = c(85, 19)) # x is number of successes, n - number of trials

#	2-sample test for equality of proportions with continuity correction
#
# data:  c(9, 19) out of c(85, 19)
# X-squared = 58.636, df = 1, p-value = 1.897e-14
# alternative hypothesis: two.sided
# 95 percent confidence interval:
#  -0.9917263 -0.7965090
# sample estimates:
#    prop 1    prop 2 
# 0.1058824 1.0000000 

@khanhhtt
Copy link
Author

Hi @gdemin,

Thank you for quick and informative response.

  1. This is also a surprise for me when I get used to R that the built-in R round function has a different behaviour than what I usually practice. I know that it is great if all packages of R could be consistency, and that behaviour of the round function might be useful for people in other fields. But it's also great if there could be an option so that I can choose the behaviour of rounding that is suitable for my purpose. And I think it will ease any concerns when people use SPSS to compare the result.

  2. Looking forward to hearing more news from you soon on this.

@khanhhtt
Copy link
Author

Hi @gdemin,

I hope you are doing well 😊
I just would like to know if there is any news on your side.
Do you have the plan regarding the adjustment of rounding in expss? Or if there is not, could you please help suggest some work around that could solve my concern?

Many thanks!

@gdemin
Copy link
Owner

gdemin commented Dec 25, 2022

Hi @khanhhtt

I will add option about rounding in the next version. But I cant promise anything about when it will be ready.

As for workaround, you can set expss_digits(3) and then round numbers with code below:

library(expss)
round2 = function(x, digits = 0) {
    posneg = sign(x)
    z = abs(x)*10^digits
    z = z + 0.5 + sqrt(.Machine$double.eps)
    z = trunc(z)
    z = z/10^digits
    z*posneg
}

round_table_values = function(tbl, digits = 1){
    col_index = seq_along(tbl)[-1]
    cell_pattern = "^(.*?)([-0-9.]+)(.*?)$"
    for(i in col_index){
        curr = tbl[[i]]
        if(is.character(curr)){
           numeric_values = suppressWarnings(as.numeric(gsub(cell_pattern, "\\2", curr)))
           numeric_values = round2(numeric_values, digits = digits)
           not_na_index = which(!is.na(numeric_values))
           curr[not_na_index] = sapply(not_na_index, 
                                                 function(cell_index)
                                                     gsub(cell_pattern, 
                                                          paste0("\\1", numeric_values[cell_index], "\\3"), 
                                                          curr[cell_index])
                                                 )
        } else {
            curr = round2(curr, digits = digits)
        }
        tbl[[i]] = curr
        
    }
    tbl
}

data(mtcars)
expss_digits(3)
mtcars = apply_labels(mtcars,
                      mpg = "Miles/(US) gallon",
                      cyl = "Number of cylinders",
                      disp = "Displacement (cu.in.)",
                      hp = "Gross horsepower",
                      drat = "Rear axle ratio",
                      wt = "Weight (lb/1000)",
                      qsec = "1/4 mile time",
                      vs = "Engine",
                      vs = c("V-engine" = 0,
                             "Straight engine" = 1),
                      am = "Transmission",
                      am = c("Automatic" = 0,
                             "Manual"=1),
                      gear = "Number of forward gears",
                      carb = "Number of carburetors"
)

mtcars_table = cross_cpct(mtcars, 
                          list(cyl, gear),
                          list(total(), vs, am)
)

res = significance_cpct(mtcars_table)

round_table_values(res)

@khanhhtt
Copy link
Author

Hi @gdemin,

That's great! Thank you so much for spending time on Christmas day to give me the workaround solution.
This is all I need for now 😊

@gdemin
Copy link
Owner

gdemin commented Jul 16, 2023

Fixed in version 0.11.6
Rounding is set with expss_round_half_to_even(FALSE).
For SPSS significance there is an argument as_spss in significance_cpct and others.

@gdemin gdemin closed this as completed Jul 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants