Import directory #149

ruaridhw · 2017-04-12T11:16:57Z

Hi Thomas,

Huge fan of the package! Thought I would add a function I've been using that is very similar to an idea that was suggested in #141. Obviously you've added import_list since that issue but this may be of some use if not absorbed into import_list itself as a special case.

Use Case:
Provides a simple wrapper over import_list for the case where a directory of files is to be loaded. Handles the expansion of files in the path and merging the list of data.frames.

ruaridhw · 2017-04-12T11:21:41Z

R/import_directory.R

+      return(data.table::rbindlist(data))
+    },
+    error = function(e){
+      warning("Error is likely due to the presence of integer64 columns which cannot be coerced in the data.table's merge. To fix this pass the argument integer64=\"character\" to import_directory")


This is a very special case where data.table throws an error when using rbind with integer64 columns as per their discussion. Arguably this trycatch block could be removed and the function allowed to fail to avoid returning either a data.frame or a list

leeper · 2017-04-12T13:19:16Z

A couple of reactions:

Tests appear to have failed.
I'm not sure what the advantage of a new function is over a statement like: import_list(dir(pattern = "..."))
I like the core idea of offering an rbind-ing procedure. Maybe this could just be added as an option to import_list() instead, like:

import_list(c("file1.csv", "file2.csv"), rbind = TRUE)

which would call rbindlist().

ruaridhw · 2017-04-12T14:19:54Z

Apologies, I've corrected the tests.

I agree that it mightn't warrant an entirely new function. It would be easy enough to add the two features (default behaviour to pattern and rbindlist) into import_list.

For default behaviour I'm referring to choosing a pattern based on the most common file extension for that directory. For example a scenario where a folder contains a README.txt or similar that should be ignored among 10 CSV files.

ruaridhw · 2017-04-12T14:27:13Z

Another idea is a handler for within import_list to allow for

One or more of the datasets in the list to be corrupted. You would have the option of returning a list that warns of any errors but continues with the rest of the files without breaking
Adding an additional column in each data.frame to identify to which file the data.frame belongs

See ruaridhw/rio@58c190c and if it seems useful I can submit a separate PR

leeper · 2017-04-19T10:59:10Z

Just a note that I implemented an rbind argument in import_list() but decided against merging this PR. I added an example showing the use of dir() to grab a pattern of file names. Thank you for the idea!

ruaridhw · 2017-04-19T12:08:18Z

Sounds great!

Regarding my comment above, would either of those features suit import_list? Use case is when reading many large files I wish to treat the one or two that are corrupt / contain "bad" data by exception without causing the entire import_list to fail. I also find it useful to attach the name of the source file to each member of the list if there is no other defining feature in the data. The two features are mutually exclusive of course...

Couple of illustrations:

cat("a,b,c","1,1,1","2,2,2",file = "good.csv", sep = "\n")
cat("a,b,c","333","4,4,4",file = "bad.csv", sep = "\n")

import_list(c("good.csv","bad.csv"))
## Error in fread(input = "bad.csv" ...

import_list_new(c("good.csv","bad.csv"), allow_failure = TRUE)
## [[1]]
##   a b c
## 1 1 1 1
## 2 2 2 2
##
## [[2]]
## NULL
##
## Warning message:
## In value[[3L]](cond) :
## Expected sep (',') but new line ...

import_list_new(c("good.csv","bad.csv"), allow_failure = TRUE, add_source = TRUE)
## [[1]]
##   a b c Source.File
## 1 1 1 1 good.csv
## 2 2 2 2 good.csv
##
## [[2]]
## NULL

leeper · 2017-04-19T12:17:36Z

I like add_source (maybe call the column _file); I've also currently hard coded fill = TRUE which it might be good to make optional via a new argument.

I'm not sure about making decisions about merging - that seems like more than what rio is supposed to do.

ruaridhw · 2017-04-19T12:35:48Z

Which merging decision? I'm imagining a try-catch in case one of the imports fails in the lapply. You would be notified if (and which) files fail but the idea would be to not override the rbind argument - continue merging the "good" files or returning a NULL df in the list

leeper · 2017-04-19T13:11:44Z

@ruaridhw Ah, I misunderstood. Yes, I try catch on the call to import() is probably a good idea, with a warning and NULL return value for that list entry.

leeper · 2017-04-19T13:52:54Z

I've implemented this in 433a4bc

ruaridhw · 2017-04-19T14:04:19Z

Love it!

Ruaridh Williamson added 5 commits April 12, 2017 20:09

Add import_directory

101ab4b

Remove stringr dependency

29e89a2

Add function argument documentation

2c2fe51

List and Directory tests

780b01e

Remove debugging line

1250782

ruaridhw commented Apr 12, 2017

View reviewed changes

Correct tests

fea66a6

leeper closed this in b276a1f Apr 19, 2017

ruaridhw deleted the import_dir branch April 19, 2017 12:09

leeper added a commit that referenced this pull request Apr 19, 2017

add rbind functionality in import_list() (#149)

433a4bc

ruaridhw mentioned this pull request Apr 29, 2018

Enhancement: allow folder paths and patterns in import_list #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import directory #149

Import directory #149

ruaridhw commented Apr 12, 2017

ruaridhw Apr 12, 2017

leeper commented Apr 12, 2017

ruaridhw commented Apr 12, 2017

ruaridhw commented Apr 12, 2017

leeper commented Apr 19, 2017

ruaridhw commented Apr 19, 2017

leeper commented Apr 19, 2017

ruaridhw commented Apr 19, 2017

leeper commented Apr 19, 2017

leeper commented Apr 19, 2017

ruaridhw commented Apr 19, 2017

Import directory #149

Import directory #149

Conversation

ruaridhw commented Apr 12, 2017

ruaridhw Apr 12, 2017

Choose a reason for hiding this comment

leeper commented Apr 12, 2017

ruaridhw commented Apr 12, 2017

ruaridhw commented Apr 12, 2017

leeper commented Apr 19, 2017

ruaridhw commented Apr 19, 2017

leeper commented Apr 19, 2017

ruaridhw commented Apr 19, 2017

leeper commented Apr 19, 2017

leeper commented Apr 19, 2017

ruaridhw commented Apr 19, 2017