Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not work on titanic dataset #13

Closed
edvardoss opened this issue Jun 10, 2019 · 7 comments
Labels
bug

Comments

@edvardoss
Copy link

@edvardoss edvardoss commented Jun 10, 2019

I’m impressed while reading your blog about model interpretation and try to test this package on popular dataset “titanic” but all my attemtions is failed.

install.packages("titanic") # only data in package
data("titanic_train",package="titanic")
library(tidyverse)
str(titanic_train)

d <- titanic_train %>% as_tibble %>%
  mutate(title=str_replace_all(string = Name, # extract title as general feature
                               pattern = "^[[:alpha:][:space:]'-]+,\\s+(the\\s)?(\\w+)\\..+",
                               replacement = "\\2")) %>%
  mutate(title=str_trim(title),
         title=case_when(title %in% c('Mlle','Ms')~'Miss', # normalize some titles
                         title=='Mme'~ 'Mrs',
                         title %in% c('Capt','Don','Major','Sir','Jonkheer', 'Col')~'Sir',
                         title %in% c('Dona', 'Lady', 'Countess')~'Lady',
                         TRUE~title)) %>%
  mutate(title=as_factor(title),
         Survived=factor(Survived,levels = c(0,1),labels=c("no","yes")),
         Sex=as_factor(Sex),
         Pclass=factor(Pclass,ordered = T)) %>%
  group_by(title) %>% # impute Age by median in current title
  mutate(Age=replace_na(Age,replace = median(Age,na.rm = T))) %>% ungroup
table(d$title,d$Sex) # look on title distribution        
caret::nearZeroVar(x = d,saveMetrics = T) # search and drop some unusefull features (PassengerId,Name,Ticket)
d <- d %>% select_at(vars(-c(PassengerId,Name,Ticket)))
d %>% summarise_all(~sum(is.na(.))) # control NAs

library(ranger)
m <- ranger(formula = Survived~.,data = d,mtry = 6,min.node.size = 5, num.trees = 600,
            importance = "permutation")

library(easyalluvial)
imp <- importance(m) %>% as.data.frame %>% tidy_imp(imp = .,df=d)
alluvial_wide(data = select(d,Survived,title,Pclass,Sex,Fare),fill_by = "first_variable") # ok, it work but i wont describe model (not describe data)

gds <- get_data_space(df = d,imp,degree = 4) # Error in Summary.factor(c(1L, 2L, 3L, 2L, 1L, 1L, 1L, 4L, 2L, 2L, 3L,  : ‘max’ not meaningful for factors

# ok, don`t  give up and try caret
library(caret)
trc <- trainControl(method = "none")
m <- train(Survived~.,data = d,method="rf",trControl=trc,importance=T)
alluvial_model_response_caret(train = m,degree = 4,bins=5,stratum_label_size = 2.8) # Error in tidy_imp(imp, df) : not all listed important variables found in input data


@erblast erblast added the bug label Jun 10, 2019
@erblast

This comment has been minimized.

Copy link
Owner

@erblast erblast commented Jun 10, 2019

Thanks for reporting, I did not think to test with an all factor dataset. Will fix this as soon as possible

erblast added a commit that referenced this issue Jun 13, 2019
@edvardoss

This comment has been minimized.

Copy link
Author

@edvardoss edvardoss commented Jun 21, 2019

get_data_space now work, thank you!
But next step - not.

library(ranger)
m <- ranger(formula = Survived~.,data = d,mtry = 6,min.node.size = 5, num.trees = 600,
            importance = "permutation")
library(easyalluvial)
imp <- importance(m) %>% as.data.frame %>% tidy_imp(imp = .,df=d)

dspace <- get_data_space(df = d,imp,degree = 4) # Work!
pred = predict(m, data = dspace)
p = alluvial_model_response(pred, dspace, imp, degree = 4) # Error in alluvial_model_response: "pred" needs to be a numeric or a factor vector
@erblast

This comment has been minimized.

Copy link
Owner

@erblast erblast commented Jun 24, 2019

fixing some issues that arise when having character and factors in the training data eb74c37

@erblast

This comment has been minimized.

Copy link
Owner

@erblast erblast commented Jun 24, 2019

Hi sorry Ia am not as frequently checking back on this as I would like to. So The problem is with predict in the ranger package it does not return pure predictions but returns some kind of list that needs to be indexed to get to the predictions.

try:
p = alluvial_model_response(pred = pred$predictions, dspace = gds, imp = imp, degree = 4)

this works for me. Could you install the most recent development version? And tell me if it works for you now? Including the caret bit?.

Thanks for reporting this, it uncovered a few issues when using factors that I should have anticipated. I have added your example as a new test case. It will go to CRAN in the next two weeks hopefully.

@edvardoss

This comment has been minimized.

Copy link
Author

@edvardoss edvardoss commented Jun 25, 2019

Hi!
Yes, i'm install latest dev.version.
Sorry for ranger::predict - i am not properly checked this object type, thank you for your answer, its work well!
But caret still generate error for me:

# ok, don`t  give up and try caret
devtools::install_local(path = "C:\\Users\\AnanevHA\\Downloads\\easyalluvial-master",force = TRUE)
library(caret)
trc <- trainControl(method = "none")
m <- train(Survived~.,data = d,method="rf",trControl=trc,importance=T)
library(easyalluvial)
alluvial_model_response_caret(train = m,degree = 4,bins=5,stratum_label_size = 2.8) # Error in tidy_imp(imp, df) : not all listed important variables found in input data
@erblast

This comment has been minimized.

Copy link
Owner

@erblast erblast commented Jun 25, 2019

Could you make sure that you have the latest dev version installed
devtools::install_github('https://github.com/erblast/easyalluvial.git')

When you execute easyalluvial::tidy_imp
you see the function source code. You should find the following lines.

 # correct dummyvariable names back to original name

  df_ori_var = tibble( ori_var = names( select_if(df, ~ is.factor(.) | is.character(.) ) ) ) %>%

not the | is.character(.) was added. This should resolve the error you were getting.
Let me know how it goes.

@edvardoss

This comment has been minimized.

Copy link
Author

@edvardoss edvardoss commented Jun 26, 2019

Hi!
Everything is working, thank you!

@edvardoss edvardoss closed this Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.