guess possible primary keys #1824

moodymudskipper · 2023-03-01T14:56:26Z

I had this use case and wondered if it'd be relevant as a {dm} utility

Given an unknown data frame, what columns could form a primary key ?

I have an implementation here for local data frames but this could be adapted.
We test combinations of columns starting with columns that have the most distinct values, and dismiss right away irrelevant combinations (e.g. 2 columns with 2 distinct values each cannot be a PK for a dataset of 10 rows because 2*2 < 10).
We try first with 1 col, then 2, then 3, and stop there by default.
We can choose to return early, as soon as we find enough candidates, or to give all possible candidates for the minimum required n of columns.
There's a progress bar, which might be improved a bit.

guess_pk <- function(data, max_n = 3, max_res = Inf) { 
  n_col <- ncol(data)
  n_row <- nrow(data)
  max_n <- min(max_n, n_col)
  distincts <- purrr::map_int(data, dplyr::n_distinct)
  ordered_col_ids <- order(distincts, decreasing = TRUE)
  res <- list()
  for (n in seq_len(max_n)) {
    combs <- combn(ordered_col_ids, n)
    writeLines(sprintf("Testing %s combinations of %s", ncol(combs), n))
    pb <- progress::progress_bar$new(total = ncol(combs))
    for (i in seq_len(ncol(combs))) {
      pb$tick()
      comb <- combs[,i]
      if (prod(distincts[comb]) < n_row) next
      valid <- nrow(unique(data[,comb])) == n_row
      if (valid) {
        res[[length(res) + 1]] <- names(data)[comb]
        if (length(res) == max_res) return(res)
      }
    }
    # if we found matches for n cols, no need to check for n+1
    if (length(res) > 0) return(res)
  }
  res
}

# testing on mtcars, which doesn't actually have a proper pk if we ignore row names :

guess_pk(mtcars, max_res = 1)
#> Testing 11 combinations of 1
#> Testing 55 combinations of 2
#> [[1]]
#> [1] "qsec" "wt"

guess_pk(mtcars)
#> Testing 11 combinations of 1
#> Testing 55 combinations of 2
#> [[1]]
#> [1] "qsec" "wt"  
#> 
#> [[2]]
#> [1] "qsec" "disp"
#> 
#> [[3]]
#> [1] "qsec" "mpg" 
#> 
#> [[4]]
#> [1] "qsec" "hp"  
#> 
#> [[5]]
#> [1] "qsec" "drat"
#> 
#> [[6]]
#> [1] "qsec" "carb"
#> 
#> [[7]]
#> [1] "qsec" "cyl" 
#> 
#> [[8]]
#> [1] "qsec" "am"  
#> 
#> [[9]]
#> [1] "wt"  "mpg"

^{Created on 2023-03-01 with reprex v2.0.2}

krlmlr · 2023-03-14T17:19:29Z

We have dm::enum_pk_candidates(), but this only looks at simple keys.

A more efficient strategy for compound keys might be to look at the columns from the left (first column, then the first two, etc..)

moodymudskipper · 2023-03-24T13:23:47Z

More efficient assuming some common practice but less robust, we'll might get a 5 variables compound key where variables 3 and 5 would suffice.
Or maybe we do a first pass as you suggest, we find our 5 variable compound key, and do a second pass trying to remove variables one by one, maybe starting from before last and going back to the first ? and these would eliminate in turn variables 4, 2, and 1 in this case. This will find a single solution but that seems a reasonable compromise.
The function might do the second pass optionally.

krlmlr added this to the keys milestone Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guess possible primary keys #1824

guess possible primary keys #1824

moodymudskipper commented Mar 1, 2023 •

edited

Loading

krlmlr commented Mar 14, 2023

moodymudskipper commented Mar 24, 2023

guess possible primary keys #1824

guess possible primary keys #1824

Comments

moodymudskipper commented Mar 1, 2023 • edited Loading

krlmlr commented Mar 14, 2023

moodymudskipper commented Mar 24, 2023

moodymudskipper commented Mar 1, 2023 •

edited

Loading