Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import directory #149

Closed
wants to merge 6 commits into from
Closed

Import directory #149

wants to merge 6 commits into from

Conversation

ruaridhw
Copy link
Contributor

Hi Thomas,

Huge fan of the package! Thought I would add a function I've been using that is very similar to an idea that was suggested in #141. Obviously you've added import_list since that issue but this may be of some use if not absorbed into import_list itself as a special case.

Use Case:
Provides a simple wrapper over import_list for the case where a directory of files is to be loaded. Handles the expansion of files in the path and merging the list of data.frames.

return(data.table::rbindlist(data))
},
error = function(e){
warning("Error is likely due to the presence of integer64 columns which cannot be coerced in the data.table's merge. To fix this pass the argument integer64=\"character\" to import_directory")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very special case where data.table throws an error when using rbind with integer64 columns as per their discussion. Arguably this trycatch block could be removed and the function allowed to fail to avoid returning either a data.frame or a list

@leeper
Copy link
Contributor

leeper commented Apr 12, 2017

A couple of reactions:

  1. Tests appear to have failed.
  2. I'm not sure what the advantage of a new function is over a statement like: import_list(dir(pattern = "..."))
  3. I like the core idea of offering an rbind-ing procedure. Maybe this could just be added as an option to import_list() instead, like:
import_list(c("file1.csv", "file2.csv"), rbind = TRUE)

which would call rbindlist().

@ruaridhw
Copy link
Contributor Author

Apologies, I've corrected the tests.

I agree that it mightn't warrant an entirely new function. It would be easy enough to add the two features (default behaviour to pattern and rbindlist) into import_list.

For default behaviour I'm referring to choosing a pattern based on the most common file extension for that directory. For example a scenario where a folder contains a README.txt or similar that should be ignored among 10 CSV files.

@ruaridhw
Copy link
Contributor Author

Another idea is a handler for within import_list to allow for

  1. One or more of the datasets in the list to be corrupted. You would have the option of returning a list that warns of any errors but continues with the rest of the files without breaking
  2. Adding an additional column in each data.frame to identify to which file the data.frame belongs

See ruaridhw/rio@58c190c and if it seems useful I can submit a separate PR

@leeper leeper closed this in b276a1f Apr 19, 2017
@leeper
Copy link
Contributor

leeper commented Apr 19, 2017

Just a note that I implemented an rbind argument in import_list() but decided against merging this PR. I added an example showing the use of dir() to grab a pattern of file names. Thank you for the idea!

@ruaridhw
Copy link
Contributor Author

Sounds great!

Regarding my comment above, would either of those features suit import_list? Use case is when reading many large files I wish to treat the one or two that are corrupt / contain "bad" data by exception without causing the entire import_list to fail. I also find it useful to attach the name of the source file to each member of the list if there is no other defining feature in the data. The two features are mutually exclusive of course...

Couple of illustrations:

cat("a,b,c","1,1,1","2,2,2",file = "good.csv", sep = "\n")
cat("a,b,c","333","4,4,4",file = "bad.csv", sep = "\n")

import_list(c("good.csv","bad.csv"))
## Error in fread(input = "bad.csv" ...

import_list_new(c("good.csv","bad.csv"), allow_failure = TRUE)
## [[1]]
##   a b c
## 1 1 1 1
## 2 2 2 2
##
## [[2]]
## NULL
##
## Warning message:
## In value[[3L]](cond) :
## Expected sep (',') but new line ...
import_list_new(c("good.csv","bad.csv"), allow_failure = TRUE, add_source = TRUE)
## [[1]]
##   a b c Source.File
## 1 1 1 1 good.csv
## 2 2 2 2 good.csv
##
## [[2]]
## NULL

@ruaridhw ruaridhw deleted the import_dir branch April 19, 2017 12:09
@leeper
Copy link
Contributor

leeper commented Apr 19, 2017

I like add_source (maybe call the column _file); I've also currently hard coded fill = TRUE which it might be good to make optional via a new argument.

I'm not sure about making decisions about merging - that seems like more than what rio is supposed to do.

@ruaridhw
Copy link
Contributor Author

Which merging decision? I'm imagining a try-catch in case one of the imports fails in the lapply. You would be notified if (and which) files fail but the idea would be to not override the rbind argument - continue merging the "good" files or returning a NULL df in the list

@leeper
Copy link
Contributor

leeper commented Apr 19, 2017

@ruaridhw Ah, I misunderstood. Yes, I try catch on the call to import() is probably a good idea, with a warning and NULL return value for that list entry.

leeper added a commit that referenced this pull request Apr 19, 2017
@leeper
Copy link
Contributor

leeper commented Apr 19, 2017

I've implemented this in 433a4bc

@ruaridhw
Copy link
Contributor Author

Love it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants