You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scanning numeric columns, I quickly wish to find out which columns have unique, distinct, values on each row.
The usefulness of dfSummary in scanning columns quickly, and figuring out the structural and statistical properties of each column. Normally, when I dig into datasets, I try to quickly find out if natural keys, like social security number, housing address, customer id etc are duplicated. The simplest way now, is to do a count-distinct (eg n_distinct(x) in dplyr) and compare distinct values to the row number of the data frame. I'm using dfSummary a lot, and think this would be a super enhancement.
One possible solution is to add a "% distinct" value on the marked columns since you have a (% of valid) in the column header. Or a "flag" like a string saying "Unique" or "(all unique)" or something. Now I have to check the Freqs against the row count, which of course is just a minor inconvenience... Anyway.
The text was updated successfully, but these errors were encountered:
This is a good idea. I'd go for the "All distinct values", however, a new term ("All") will need to be added to the translations dataset, which will require some work. Help is always welcome.
Scanning numeric columns, I quickly wish to find out which columns have unique, distinct, values on each row.
The usefulness of dfSummary in scanning columns quickly, and figuring out the structural and statistical properties of each column. Normally, when I dig into datasets, I try to quickly find out if natural keys, like social security number, housing address, customer id etc are duplicated. The simplest way now, is to do a count-distinct (eg n_distinct(x) in dplyr) and compare distinct values to the row number of the data frame. I'm using dfSummary a lot, and think this would be a super enhancement.
One possible solution is to add a "% distinct" value on the marked columns since you have a (% of valid) in the column header. Or a "flag" like a string saying "Unique" or "(all unique)" or something. Now I have to check the Freqs against the row count, which of course is just a minor inconvenience... Anyway.
The text was updated successfully, but these errors were encountered: