Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guess possible primary keys #1824

Open
moodymudskipper opened this issue Mar 1, 2023 · 2 comments
Open

guess possible primary keys #1824

moodymudskipper opened this issue Mar 1, 2023 · 2 comments
Milestone

Comments

@moodymudskipper
Copy link
Collaborator

moodymudskipper commented Mar 1, 2023

I had this use case and wondered if it'd be relevant as a {dm} utility

Given an unknown data frame, what columns could form a primary key ?

I have an implementation here for local data frames but this could be adapted.
We test combinations of columns starting with columns that have the most distinct values, and dismiss right away irrelevant combinations (e.g. 2 columns with 2 distinct values each cannot be a PK for a dataset of 10 rows because 2*2 < 10).
We try first with 1 col, then 2, then 3, and stop there by default.
We can choose to return early, as soon as we find enough candidates, or to give all possible candidates for the minimum required n of columns.
There's a progress bar, which might be improved a bit.

guess_pk <- function(data, max_n = 3, max_res = Inf) { 
  n_col <- ncol(data)
  n_row <- nrow(data)
  max_n <- min(max_n, n_col)
  distincts <- purrr::map_int(data, dplyr::n_distinct)
  ordered_col_ids <- order(distincts, decreasing = TRUE)
  res <- list()
  for (n in seq_len(max_n)) {
    combs <- combn(ordered_col_ids, n)
    writeLines(sprintf("Testing %s combinations of %s", ncol(combs), n))
    pb <- progress::progress_bar$new(total = ncol(combs))
    for (i in seq_len(ncol(combs))) {
      pb$tick()
      comb <- combs[,i]
      if (prod(distincts[comb]) < n_row) next
      valid <- nrow(unique(data[,comb])) == n_row
      if (valid) {
        res[[length(res) + 1]] <- names(data)[comb]
        if (length(res) == max_res) return(res)
      }
    }
    # if we found matches for n cols, no need to check for n+1
    if (length(res) > 0) return(res)
  }
  res
}

# testing on mtcars, which doesn't actually have a proper pk if we ignore row names :

guess_pk(mtcars, max_res = 1)
#> Testing 11 combinations of 1
#> Testing 55 combinations of 2
#> [[1]]
#> [1] "qsec" "wt"

guess_pk(mtcars)
#> Testing 11 combinations of 1
#> Testing 55 combinations of 2
#> [[1]]
#> [1] "qsec" "wt"  
#> 
#> [[2]]
#> [1] "qsec" "disp"
#> 
#> [[3]]
#> [1] "qsec" "mpg" 
#> 
#> [[4]]
#> [1] "qsec" "hp"  
#> 
#> [[5]]
#> [1] "qsec" "drat"
#> 
#> [[6]]
#> [1] "qsec" "carb"
#> 
#> [[7]]
#> [1] "qsec" "cyl" 
#> 
#> [[8]]
#> [1] "qsec" "am"  
#> 
#> [[9]]
#> [1] "wt"  "mpg"

Created on 2023-03-01 with reprex v2.0.2

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 14, 2023

We have dm::enum_pk_candidates(), but this only looks at simple keys.

A more efficient strategy for compound keys might be to look at the columns from the left (first column, then the first two, etc..)

@moodymudskipper
Copy link
Collaborator Author

More efficient assuming some common practice but less robust, we'll might get a 5 variables compound key where variables 3 and 5 would suffice.
Or maybe we do a first pass as you suggest, we find our 5 variable compound key, and do a second pass trying to remove variables one by one, maybe starting from before last and going back to the first ? and these would eliminate in turn variables 4, 2, and 1 in this case. This will find a single solution but that seems a reasonable compromise.
The function might do the second pass optionally.

@krlmlr krlmlr added this to the keys milestone Aug 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants