You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had this use case and wondered if it'd be relevant as a {dm} utility
Given an unknown data frame, what columns could form a primary key ?
I have an implementation here for local data frames but this could be adapted.
We test combinations of columns starting with columns that have the most distinct values, and dismiss right away irrelevant combinations (e.g. 2 columns with 2 distinct values each cannot be a PK for a dataset of 10 rows because 2*2 < 10).
We try first with 1 col, then 2, then 3, and stop there by default.
We can choose to return early, as soon as we find enough candidates, or to give all possible candidates for the minimum required n of columns.
There's a progress bar, which might be improved a bit.
guess_pk<-function(data, max_n=3, max_res=Inf) {
n_col<- ncol(data)
n_row<- nrow(data)
max_n<- min(max_n, n_col)
distincts<-purrr::map_int(data, dplyr::n_distinct)
ordered_col_ids<- order(distincts, decreasing=TRUE)
res<-list()
for (nin seq_len(max_n)) {
combs<- combn(ordered_col_ids, n)
writeLines(sprintf("Testing %s combinations of %s", ncol(combs), n))
pb<-progress::progress_bar$new(total= ncol(combs))
for (iin seq_len(ncol(combs))) {
pb$tick()
comb<-combs[,i]
if (prod(distincts[comb]) <n_row) nextvalid<- nrow(unique(data[,comb])) ==n_rowif (valid) {
res[[length(res) +1]] <- names(data)[comb]
if (length(res) ==max_res) return(res)
}
}
# if we found matches for n cols, no need to check for n+1if (length(res) >0) return(res)
}
res
}
# testing on mtcars, which doesn't actually have a proper pk if we ignore row names :
guess_pk(mtcars, max_res=1)
#> Testing 11 combinations of 1#> Testing 55 combinations of 2#> [[1]]#> [1] "qsec" "wt"
guess_pk(mtcars)
#> Testing 11 combinations of 1#> Testing 55 combinations of 2#> [[1]]#> [1] "qsec" "wt" #> #> [[2]]#> [1] "qsec" "disp"#> #> [[3]]#> [1] "qsec" "mpg" #> #> [[4]]#> [1] "qsec" "hp" #> #> [[5]]#> [1] "qsec" "drat"#> #> [[6]]#> [1] "qsec" "carb"#> #> [[7]]#> [1] "qsec" "cyl" #> #> [[8]]#> [1] "qsec" "am" #> #> [[9]]#> [1] "wt" "mpg"
More efficient assuming some common practice but less robust, we'll might get a 5 variables compound key where variables 3 and 5 would suffice.
Or maybe we do a first pass as you suggest, we find our 5 variable compound key, and do a second pass trying to remove variables one by one, maybe starting from before last and going back to the first ? and these would eliminate in turn variables 4, 2, and 1 in this case. This will find a single solution but that seems a reasonable compromise.
The function might do the second pass optionally.
I had this use case and wondered if it'd be relevant as a {dm} utility
Given an unknown data frame, what columns could form a primary key ?
I have an implementation here for local data frames but this could be adapted.
We test combinations of columns starting with columns that have the most distinct values, and dismiss right away irrelevant combinations (e.g. 2 columns with 2 distinct values each cannot be a PK for a dataset of 10 rows because 2*2 < 10).
We try first with 1 col, then 2, then 3, and stop there by default.
We can choose to return early, as soon as we find enough candidates, or to give all possible candidates for the minimum required n of columns.
There's a progress bar, which might be improved a bit.
Created on 2023-03-01 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: