-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLSX Importer #28
XLSX Importer #28
Conversation
This somehow got missed while adding fixture files for other formats.
lib/daru/io/importers/xlsx.rb
Outdated
module Importers | ||
class XLSX < Base | ||
Daru::DataFrame.register_io_module :from_xlsx, self | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the XLSX Importer intended be used directly as Daru::DataFrame.from_xlsx(...)
, or to be redirected from the Excel Importer if filename ends with .xlsx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redirect from the excel importer if the file name ends in .xlsx
. from_excel
should be the only method.
spec/fixtures/README.md
Outdated
- `LOBSTAHS_rt.windows.xlsx` : Downloaded from [here](https://github.com/vanmooylipidomics/LOBSTAHS/blob/master/inst/doc/xlsx/LOBSTAHS_rt.windows.xlsx). Contains data about Lipid and Oxylipin Biomarker Screening Through Adduct Hierarchy Sequences (LOBSTASHS). Conatins two sheets called `LOBSTAHS_rt.windows` and `Notes`. | ||
- `Microcode.xlsx` : Downloaded from [here](https://github.com/tkim371/CS2200/blob/master/proj2/project2/Microcode.xlsx). | ||
- `Stock-counts-sheet.xlsx` : Downloaded from [here](https://www.exact.com/uk/images/downloads/getting-started-excel-sheets/Stock-counts-sheet.xlsx). Contains data about stocks. Helps in ensuring that HTML tags of cell attributes are striped off, while constructing the DataFrame. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somehow completely missed to add xlsx fixture files, during PR #17. Sorry! :-/
lib/daru/io/importers/xlsx.rb
Outdated
book = Roo::Excelx.new(@path) | ||
worksheet = book.sheet(@sheet) | ||
data = worksheet.to_a | ||
data = strip_html_tags(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Is there a lot of examples in the wild with HTML tags in cells?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh damn. Roo converts any formatting to HTML unconditionally :( What could possibly lead any sane developer to this decision 😡
lib/daru/io/importers/xlsx.rb
Outdated
end | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions to the impl:
- what about skipping some cols and rows? I believe I've seen a lot of governmental open data, for ex., with large headers above the data itself;
- a lot of Excel file I've seen also had "index" (e.g. row names) in them, what about it?
- it is high probability of merged cells, especially in index/order, WDYT of it?
- a lot of Excel files have formulae in cells, WDYT of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll soon add the :skiprows
and :skipcols
options. They seem to be quite necessary for an initial draft.
Regarding the rest of the features, let me open issue #36 to make sure these suggestions don't get missed. 😄
spec/daru/io/importers/xlsx_spec.rb
Outdated
let(:sheet) { 0 } | ||
let(:headers) { true } | ||
|
||
context 'parses default sheet' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "default" is 0?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be logical to parse the 1st sheet by default? The default is set as 0, even for the from_excel
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is logical. What I am concerned about here is by concept description, you are testing "what it would be if there's no sheet param passed" (I am guessing it by "default" word). But in fact, you are testing the sheet: 0
explicitly passed.
spec/daru/io/importers/xlsx_spec.rb
Outdated
WebMock | ||
.stub_request(:get,"http://dummy-remote-url/#{file}.xlsx") | ||
.to_return(status: 200, body: File.read("spec/fixtures/excelx/#{file}.xlsx")) | ||
WebMock.disable_net_connect!(allow: /dummy-remote-url/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed?.. I believe we can just disable_net_connect in spec_helper.rb (or nowhere, it is done by default), without those dances.
spec/daru/io/importers/xlsx_spec.rb
Outdated
its(:ncols) { is_expected.to eq(7) } | ||
its(:nrows) { is_expected.to eq(15) } | ||
its(:index) { is_expected.to eq((0..14).to_a.to_index) } | ||
its('Item code.first') { is_expected.to eq(nil) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What it was in source file? Empty cell? Then let's name the context this way. Like context "when sheet contains empty cells"
.
And the second question: how do you decide that it should be nil
, and not, say, ""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, on context names. There are several agreements, but generally we follow this structure:
describe Class do
describe '#method_name' do
context 'in this circumstances' do # e.g. "when this and that passed", "on erroneous request", "if requested from network"
end
end
end
E.g., context description should .... well, you know, describe context of this group of tests :)
Updated YARD docs, and added RSpec test for skip options
The filename got changed to an incorrect one in the spec file, before pushing the last commit. Note to self : Run bundle exec rspec before pushing. :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest of the code looks good, but I have some reservations about the requiring and documentation.
Daru::DataFrame.register_io_module :from_excel, self | ||
Daru::DataFrame.register_io_module :from_excel do |*args| | ||
if args.first.end_with? '.xlsx' | ||
require 'daru/io/importers/excelx' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that a require
will be called each time the from_excel
method is called? That's inefficient. You should ideally perform all the requires when the gem is required or when a specific importer/exporter is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a require
will be called each time the from_excel
method is called with a .xlsx
file. But I'm not sure why it'd be inefficient - isn't require
by default done only once? And would this make it better?
require 'daru/io/importers/excelx' unless defined?(Daru::IO::Importers::Excelx)
perform all the requires when the gem is required or when a specific importer/exporter is required.
When a user's use-case is only .xls
files, the excelx importer isn't needed to be require
d right? (Or is unconditionally requiring excelx importer when requiring excel importer, a better standard?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's inefficient.
Not very. require
works only once, the next time Ruby's enternal mechanizms will just check whether require'd file is loaded already.
lib/daru/io/importers/excel.rb
Outdated
@@ -45,22 +55,26 @@ class Excel < Base | |||
# # 3 4 Franz nil Paris nil | |||
# # 4 5 George 5.5 Tome a,b,c | |||
# # 5 6 Fernand nil nil nil | |||
def initialize(path, worksheet_id: 0) | |||
def initialize(path, sheet: 0, headers: true) | |||
optional_gem 'spreadsheet', '~> 1.1.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have the optional_gem
method run at the time of the gem require? Or how about "trying" to require all gems at gem require first and then throw warnings only when the specific method is called?
For example, the spreadsheet
gem will be required when the user calls require 'daru/io'
or require 'daro/io/importers/excel'
. This will not throw any warnings. However, when the user calls from_excel
you can throw the error. You check whether a library is present or not with this: https://stackoverflow.com/questions/8424769/how-do-you-check-if-a-library-ruby-gem-has-been-loaded
@zverok thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athityakumar also, can you add documentation to the optional_gem
method?
https://github.com/athityakumar/daru-io/blob/master/lib/daru/io/base.rb#L7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v0dro - Sure, will document the #optional_gem
method. Meanwhile, intention of having optional_gem
run at #initialize
is same as that of conditionally requiring excelx
: require only when a particular IO module is used. Sure, optional_gem
can be tweaked to try requiring dependencies when an Importer is required and raise errors (LoadError
, etc.) when the importer is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have the optional_gem method run at the time of the gem require? Or how about "trying" to require all gems at gem require first and then throw warnings only when the specific method is called?
@v0dro To be honest, I am strictly against it. optional_gem
is not effect-less, it outputs "Hey, we need this and that gem to do what you want". Imagine daru-io v.2.3.4, which somebody wants to use to just import some CSV? It will output two full screens of please install <this irrelevant stuff>
just for nothing.
I believe, calling optional_gem
exactly at the moment when somebody tries to really import/export xlsx
is a good idea.
Side note: In future, there can exist code like
format = controller.params[:format]
dataframe = Daru.from(format, controller.params[:input])
...which could probably fail too late (e.g. works for 1000 users, and fails for 1001st, because format was dbf
, and dependency is absent), but it is another story (about APIs where format
is a variable, and then about APIs of "early check" that we support this and that format, and probably explicitly forbid those and those).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense to call optional_gem
only when the method is called.
lib/daru/io/importers/excel.rb
Outdated
@@ -45,22 +55,26 @@ class Excel < Base | |||
# # 3 4 Franz nil Paris nil | |||
# # 4 5 George 5.5 Tome a,b,c | |||
# # 5 6 Fernand nil nil nil | |||
def initialize(path, worksheet_id: 0) | |||
def initialize(path, sheet: 0, headers: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason for changing to sheet
? It will break backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the excelx (xlsx) importer, name of the keyword argument has been chosen as sheet
, as it could even be a String
and worksheet_id
seems a bit misleading to only Integer
type (Of course, the documentation should take care of it). And to maintain consistency between excel and excelx importer, worksheet_id
was renamed to sheet
.
But this was prior to commit f9a8356, where excel importer redirected to excelx importer and both methods were forced to have same argument names. Now that redirect is done by register_io_module
, this can certainly be reverted back to worksheet_id
.
# 4 nil 1 OUT30045 3 New stock 2014-08-01 51035 | ||
# ... ... ... ... ... ... ... ... | ||
# | ||
# @example Importing from a remote URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this documentation be easily accessible to someone who is using the from_excel
method? This implementation of using importer and exporter classes is opaque to the user, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the linkages are done dynamically, YARD won't be able to recognize the linkages. The simplest workaround I can think for this, is to include their respective hyperlinks to the Daru::DataFrame#register_io_module
YARD comments. Like,
module Daru
class DataFrame
class << self
# Links +Daru::DataFrame+ Import / Export methods to corresponding
# +Daru::IO+ Importer / Exporter classes.
#
# - {Daru::IO::Importers::CSV#initialize #from_csv method}
# - {Daru::IO::Importers::Redis#initialize #from_redis method}
# ...
def register_io_module(function, instance)
...
end
end
end
end
I think it's reasonable to expect users to open the DataFrame page to have a look at the Importers & Exporters, and find a list of them as shown below.
Does this suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v0dro I believe that daru-io
should think about it as a complementary yet separate gem.
It is a necessary compromise we've already discussed with Athitya: whether we document exact modules that do the work (Daru::IO::Whatever
) or their generated methods? (DataFrame.from_whatever
)
My answer is the former.
Let's think about somebody who uses daru-io as a foundation to add another format, say, FITS (it is astronomy data).
What they do?
require 'daru-io'
- Create
Daru::IO::FITS
class by a simple straightforward instructions, and document it. - Call
Daru::IO.register_io_module
.
Of course, we can ask them to implement pseudo-docs for from_fits
, YARD allows this... But do we really faithfully want to?..
So, I believe Daru::IO
docs should be saying something like... "Blah, blah, when you include this and that, you receive a ton of Daru::DataFrame.from_this_and_that
, please refer to Daru::IO::ThisAndThatImporter
for docs".
It is one option. Another one is stop polluting DF with from_foo
/to_bar
methods, and define for Daru::IO
API like Daru::IO::FromExcel.new('path').call
/ Daru::IO::ToExcel.new(df).call('path')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just stick to referring users to the relevant module in that case. Changing the API would not look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm willing to accept this, just don't forget my comments ;)
Daru::DataFrame.register_io_module :from_excel, self | ||
Daru::DataFrame.register_io_module :from_excel do |*args| | ||
if args.first.end_with? '.xlsx' | ||
require 'daru/io/importers/excelx' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's inefficient.
Not very. require
works only once, the next time Ruby's enternal mechanizms will just check whether require'd file is loaded already.
lib/daru/io/importers/excel.rb
Outdated
@@ -45,22 +55,26 @@ class Excel < Base | |||
# # 3 4 Franz nil Paris nil | |||
# # 4 5 George 5.5 Tome a,b,c | |||
# # 5 6 Fernand nil nil nil | |||
def initialize(path, worksheet_id: 0) | |||
def initialize(path, sheet: 0, headers: true) | |||
optional_gem 'spreadsheet', '~> 1.1.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have the optional_gem method run at the time of the gem require? Or how about "trying" to require all gems at gem require first and then throw warnings only when the specific method is called?
@v0dro To be honest, I am strictly against it. optional_gem
is not effect-less, it outputs "Hey, we need this and that gem to do what you want". Imagine daru-io v.2.3.4, which somebody wants to use to just import some CSV? It will output two full screens of please install <this irrelevant stuff>
just for nothing.
I believe, calling optional_gem
exactly at the moment when somebody tries to really import/export xlsx
is a good idea.
Side note: In future, there can exist code like
format = controller.params[:format]
dataframe = Daru.from(format, controller.params[:input])
...which could probably fail too late (e.g. works for 1000 users, and fails for 1001st, because format was dbf
, and dependency is absent), but it is another story (about APIs where format
is a variable, and then about APIs of "early check" that we support this and that format, and probably explicitly forbid those and those).
# 4 nil 1 OUT30045 3 New stock 2014-08-01 51035 | ||
# ... ... ... ... ... ... ... ... | ||
# | ||
# @example Importing from a remote URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v0dro I believe that daru-io
should think about it as a complementary yet separate gem.
It is a necessary compromise we've already discussed with Athitya: whether we document exact modules that do the work (Daru::IO::Whatever
) or their generated methods? (DataFrame.from_whatever
)
My answer is the former.
Let's think about somebody who uses daru-io as a foundation to add another format, say, FITS (it is astronomy data).
What they do?
require 'daru-io'
- Create
Daru::IO::FITS
class by a simple straightforward instructions, and document it. - Call
Daru::IO.register_io_module
.
Of course, we can ask them to implement pseudo-docs for from_fits
, YARD allows this... But do we really faithfully want to?..
So, I believe Daru::IO
docs should be saying something like... "Blah, blah, when you include this and that, you receive a ton of Daru::DataFrame.from_this_and_that
, please refer to Daru::IO::ThisAndThatImporter
for docs".
It is one option. Another one is stop polluting DF with from_foo
/to_bar
methods, and define for Daru::IO
API like Daru::IO::FromExcel.new('path').call
/ Daru::IO::ToExcel.new(df).call('path')
.
# 3 nil 1 OUT30045 3 New stock 2014-08-01 51035 | ||
# ... ... ... ... ... ... ... ... | ||
def initialize(path, sheet: 0, order: true, index: false, skiprows: 0, skipcols: 0) | ||
optional_gem 'roo', '~> 2.7.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, this can be metaprogrammed. Like this:
class Excelx < Base
requires_gem 'roo', '~> 2.7.0' # says that on real creation of object the check should be done
Another thought is you probably should not specify dependencies version so narrow for general-usage gem (which leads to idea of testing with multiple dependencies versions... So we should postpone this, just don't forget)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/daru/io/link.rb
Outdated
@@ -1,7 +1,9 @@ | |||
module Daru | |||
class DataFrame | |||
class << self | |||
def register_io_module(function, instance) | |||
def register_io_module(function, instance=nil) | |||
return define_singleton_method(function) { |*args| yield(*args) } if block_given? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With explicit &block
in args, you could've just define_singleton_method(function, &block)
here.
Approved. Merge? |
@v0dro - Sure, merging this in. 😃 |
Tests are yet to be beautified. Those changes are intended to be either here or, in PR #27 - depending on whichever PR is merged later.