Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot_bar function shows the frequencies blurry in create_report #123

Closed
myildir3 opened this issue Jul 17, 2019 · 3 comments
Closed

plot_bar function shows the frequencies blurry in create_report #123

myildir3 opened this issue Jul 17, 2019 · 3 comments
Labels
N/A: nothing to do here Nothing to do here

Comments

@myildir3
Copy link

myildir3 commented Jul 17, 2019

Hello,

Firstly, thank you for writing an awesome package. I have been trying to visualize bar plots and histograms to get the distribution of the variables. My dataset has around 80 variables and some of them have more than 100 levels. I need to lump the insignificant levels into "other" i.e, setting the threshold, let's say 0.05. I want to include the frequency of each bar in the bar plot. I tweaked the source code a little bit to add the frequencies in the bar plot. I will include it here. It works in a single plot but, not in create_report function. The numbers become unreadable in create_report function. Could you help me to create a desirable report?

# Code for plot_bar
function (data, with = NULL, maxcat = 50, order_bar = TRUE, 
  binary_as_factor = TRUE, title = NULL, ggtheme = theme_gray(), 
  theme_config = list(), nrow = 3L, ncol = 3L, parallel = FALSE) 
{
  .doTrace({
  }, "on entry")
  {
    frequency <- measure <- variable <- value <- NULL
    if (!is.data.table(data)) 
      data <- data.table(data)
    split_data <- split_columns(data, binary_as_factor = binary_as_factor)
    if (split_data$num_discrete == 0) 
      stop("No discrete features found!")
    discrete <- split_data$discrete
    ind <- .ignoreCat(discrete, maxcat = maxcat)
    if (length(ind)) {
      message(length(ind), " columns ignored with more than ", 
        maxcat, " categories.\n", paste0(names(ind), 
          ": ", ind, " categories\n"))
      drop_columns(discrete, names(ind))
      if (length(discrete) == 0) 
        stop("Note: All discrete features ignored! Nothing to plot!")
    }
    feature_names <- names(discrete)
    if (is.null(with)) {
      dt <- discrete[, list(frequency = .N), by = feature_names]
    }
    else {
      if (is.factor(data[[with]])) {
        measure_var <- suppressWarnings(as.numeric(levels(data[[with]]))[data[[with]]])
      }
      else if (is.character(data[[with]])) {
        measure_var <- as.numeric(data[[with]])
      }
      else {
        measure_var <- data[[with]]
      }
      if (all(is.na(measure_var))) 
        stop("Failed to convert `", with, "` to continuous!")
      if (with %in% names(discrete)) 
        drop_columns(discrete, with)
      tmp_dt <- data.table(discrete, measure = measure_var)
      dt <- tmp_dt[, list(frequency = sum(measure, na.rm = TRUE)), 
        by = feature_names]
    }
    dt2 <- suppressWarnings(melt.data.table(dt, measure.vars = feature_names))
    layout <- .getPageLayout(nrow, ncol, ncol(discrete))
    plot_list <- .lapply(parallel = parallel, X = layout, 
      FUN = function(x) {
        if (order_bar) {
          base_plot <- ggplot(dt2[variable %in% feature_names[x]], 
            aes(x = reorder(value, frequency), y = frequency))
        }
        else {
          base_plot <- ggplot(dt2[variable %in% feature_names[x]], 
            aes(x = value, y = frequency))
        }
        base_plot + geom_bar(stat = "identity") + geom_text(stat = "identity", 
          position = "identity", aes(label = frequency, 
            color = "red", angle = 90, fontface = "bold", 
            vjust = -0.5)) + coord_flip() + xlab("") + 
          ylab(ifelse(is.null(with), "Frequency", toTitleCase(with)))
      })
    class(plot_list) <- c("multiple", class(plot_list))
    plotDataExplorer(plot_obj = plot_list, page_layout = layout, 
      title = title, ggtheme = ggtheme, theme_config = theme_config, 
      facet_wrap_args = list(facet = ~variable, nrow = nrow, 
        ncol = ncol, scales = "free"))
  }
}

By the way, is there any way to include ColorBrewer to make the plot appealing? In addition to this, create_report function creates bar plots and histograms after detecting the type of the variable. For example, when I use str() function, I can see the variable classes. However, binary variables with 0 and 1 are still considered numerical. But, create_report considers them as discrete. How can I determine and set the numerical and categorical variables automatically without explicitly stating each variable?

Thank you,
Mehmet

@boxuancui
Copy link
Owner

Let me know if I missed anything:

  1. To group levels, you can use group_category(). Do that first then send it to create_report().
  2. Is this the line you added? aes(label = frequency, color = "red", angle = 90, fontface = "bold", vjust = -0.5)). Here is the code for bar charts in the report. If you overwrite plot_bar, it should just work. Could you make sure it is named as plot_bar? Maybe temporarily setting it to global and try, i.e., <<-?
  3. It is difficult to add colors at the moment, but possible. See Group-wise color in scatterplot #113 for more details.
  4. I fixed some bugs in the latest develop version. Could you update and see if it still exists? You might have to manually set the value for binary_as_factor.

@myildir3
Copy link
Author

The first bullet makes sense.

I added that code to the source code in the second bullet but, I am confused where or how to overwrite plot_bar. Should I replace the code chunk I wrote with what you have suggested?

split_columns function works fine except if a continuous variable has NULL, the function considers it a discrete variable.

@boxuancui
Copy link
Owner

If you just run the new function, it should replace the old. You can also try to set it as global to verify, e.g., plot_bar <<- function(...) {...}.
I will look into the split_columns issue.

FYI, I am traveling at the moment, so might be slow to respond.

@boxuancui boxuancui added the N/A: nothing to do here Nothing to do here label Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
N/A: nothing to do here Nothing to do here
Projects
None yet
Development

No branches or pull requests

2 participants