Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLSX Importer #28

Merged
merged 13 commits into from
Aug 4, 2017
Merged

XLSX Importer #28

merged 13 commits into from
Aug 4, 2017

Conversation

athityakumar
Copy link
Member

Tests are yet to be beautified. Those changes are intended to be either here or, in PR #27 - depending on whichever PR is merged later.

module Importers
class XLSX < Base
Daru::DataFrame.register_io_module :from_xlsx, self

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the XLSX Importer intended be used directly as Daru::DataFrame.from_xlsx(...), or to be redirected from the Excel Importer if filename ends with .xlsx?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redirect from the excel importer if the file name ends in .xlsx. from_excel should be the only method.

- `LOBSTAHS_rt.windows.xlsx` : Downloaded from [here](https://github.com/vanmooylipidomics/LOBSTAHS/blob/master/inst/doc/xlsx/LOBSTAHS_rt.windows.xlsx). Contains data about Lipid and Oxylipin Biomarker Screening Through Adduct Hierarchy Sequences (LOBSTASHS). Conatins two sheets called `LOBSTAHS_rt.windows` and `Notes`.
- `Microcode.xlsx` : Downloaded from [here](https://github.com/tkim371/CS2200/blob/master/proj2/project2/Microcode.xlsx).
- `Stock-counts-sheet.xlsx` : Downloaded from [here](https://www.exact.com/uk/images/downloads/getting-started-excel-sheets/Stock-counts-sheet.xlsx). Contains data about stocks. Helps in ensuring that HTML tags of cell attributes are striped off, while constructing the DataFrame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somehow completely missed to add xlsx fixture files, during PR #17. Sorry! :-/

book = Roo::Excelx.new(@path)
worksheet = book.sheet(@sheet)
data = worksheet.to_a
data = strip_html_tags(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Is there a lot of examples in the wild with HTML tags in cells?..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh damn. Roo converts any formatting to HTML unconditionally :( What could possibly lead any sane developer to this decision 😡

end
end
end
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions to the impl:

  • what about skipping some cols and rows? I believe I've seen a lot of governmental open data, for ex., with large headers above the data itself;
  • a lot of Excel file I've seen also had "index" (e.g. row names) in them, what about it?
  • it is high probability of merged cells, especially in index/order, WDYT of it?
  • a lot of Excel files have formulae in cells, WDYT of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll soon add the :skiprows and :skipcols options. They seem to be quite necessary for an initial draft.

Regarding the rest of the features, let me open issue #36 to make sure these suggestions don't get missed. 😄

let(:sheet) { 0 }
let(:headers) { true }

context 'parses default sheet' do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "default" is 0?..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be logical to parse the 1st sheet by default? The default is set as 0, even for the from_excel method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is logical. What I am concerned about here is by concept description, you are testing "what it would be if there's no sheet param passed" (I am guessing it by "default" word). But in fact, you are testing the sheet: 0 explicitly passed.

WebMock
.stub_request(:get,"http://dummy-remote-url/#{file}.xlsx")
.to_return(status: 200, body: File.read("spec/fixtures/excelx/#{file}.xlsx"))
WebMock.disable_net_connect!(allow: /dummy-remote-url/)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed?.. I believe we can just disable_net_connect in spec_helper.rb (or nowhere, it is done by default), without those dances.

its(:ncols) { is_expected.to eq(7) }
its(:nrows) { is_expected.to eq(15) }
its(:index) { is_expected.to eq((0..14).to_a.to_index) }
its('Item code.first') { is_expected.to eq(nil) }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it was in source file? Empty cell? Then let's name the context this way. Like context "when sheet contains empty cells".
And the second question: how do you decide that it should be nil, and not, say, ""?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, on context names. There are several agreements, but generally we follow this structure:

describe Class do
  describe '#method_name' do
    context 'in this circumstances' do # e.g. "when this and that passed", "on erroneous request", "if requested from network"
    end
  end
end

E.g., context description should .... well, you know, describe context of this group of tests :)

Updated YARD docs, and added RSpec test for skip options
The filename got changed to an incorrect one in the spec file, before pushing the last commit. Note to self : Run bundle exec rspec before pushing. :-/
Copy link
Member

@v0dro v0dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of the code looks good, but I have some reservations about the requiring and documentation.

Daru::DataFrame.register_io_module :from_excel, self
Daru::DataFrame.register_io_module :from_excel do |*args|
if args.first.end_with? '.xlsx'
require 'daru/io/importers/excelx'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that a require will be called each time the from_excel method is called? That's inefficient. You should ideally perform all the requires when the gem is required or when a specific importer/exporter is required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a require will be called each time the from_excel method is called with a .xlsx file. But I'm not sure why it'd be inefficient - isn't require by default done only once? And would this make it better?

require 'daru/io/importers/excelx' unless defined?(Daru::IO::Importers::Excelx)

perform all the requires when the gem is required or when a specific importer/exporter is required.

When a user's use-case is only .xls files, the excelx importer isn't needed to be required right? (Or is unconditionally requiring excelx importer when requiring excel importer, a better standard?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's inefficient.

Not very. require works only once, the next time Ruby's enternal mechanizms will just check whether require'd file is loaded already.

@@ -45,22 +55,26 @@ class Excel < Base
# # 3 4 Franz nil Paris nil
# # 4 5 George 5.5 Tome a,b,c
# # 5 6 Fernand nil nil nil
def initialize(path, worksheet_id: 0)
def initialize(path, sheet: 0, headers: true)
optional_gem 'spreadsheet', '~> 1.1.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have the optional_gem method run at the time of the gem require? Or how about "trying" to require all gems at gem require first and then throw warnings only when the specific method is called?

For example, the spreadsheet gem will be required when the user calls require 'daru/io' or require 'daro/io/importers/excel'. This will not throw any warnings. However, when the user calls from_excel you can throw the error. You check whether a library is present or not with this: https://stackoverflow.com/questions/8424769/how-do-you-check-if-a-library-ruby-gem-has-been-loaded

@zverok thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athityakumar also, can you add documentation to the optional_gem method?

https://github.com/athityakumar/daru-io/blob/master/lib/daru/io/base.rb#L7

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v0dro - Sure, will document the #optional_gem method. Meanwhile, intention of having optional_gem run at #initialize is same as that of conditionally requiring excelx : require only when a particular IO module is used. Sure, optional_gem can be tweaked to try requiring dependencies when an Importer is required and raise errors (LoadError, etc.) when the importer is called.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have the optional_gem method run at the time of the gem require? Or how about "trying" to require all gems at gem require first and then throw warnings only when the specific method is called?

@v0dro To be honest, I am strictly against it. optional_gem is not effect-less, it outputs "Hey, we need this and that gem to do what you want". Imagine daru-io v.2.3.4, which somebody wants to use to just import some CSV? It will output two full screens of please install <this irrelevant stuff> just for nothing.

I believe, calling optional_gem exactly at the moment when somebody tries to really import/export xlsx is a good idea.

Side note: In future, there can exist code like

format = controller.params[:format]
dataframe = Daru.from(format, controller.params[:input])

...which could probably fail too late (e.g. works for 1000 users, and fails for 1001st, because format was dbf, and dependency is absent), but it is another story (about APIs where format is a variable, and then about APIs of "early check" that we support this and that format, and probably explicitly forbid those and those).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense to call optional_gem only when the method is called.

@@ -45,22 +55,26 @@ class Excel < Base
# # 3 4 Franz nil Paris nil
# # 4 5 George 5.5 Tome a,b,c
# # 5 6 Fernand nil nil nil
def initialize(path, worksheet_id: 0)
def initialize(path, sheet: 0, headers: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for changing to sheet? It will break backwards compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the excelx (xlsx) importer, name of the keyword argument has been chosen as sheet, as it could even be a String and worksheet_id seems a bit misleading to only Integer type (Of course, the documentation should take care of it). And to maintain consistency between excel and excelx importer, worksheet_id was renamed to sheet.

But this was prior to commit f9a8356, where excel importer redirected to excelx importer and both methods were forced to have same argument names. Now that redirect is done by register_io_module, this can certainly be reverted back to worksheet_id.

# 4 nil 1 OUT30045 3 New stock 2014-08-01 51035
# ... ... ... ... ... ... ... ...
#
# @example Importing from a remote URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this documentation be easily accessible to someone who is using the from_excel method? This implementation of using importer and exporter classes is opaque to the user, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the linkages are done dynamically, YARD won't be able to recognize the linkages. The simplest workaround I can think for this, is to include their respective hyperlinks to the Daru::DataFrame#register_io_module YARD comments. Like,

module Daru
  class DataFrame  
    class << self
      # Links +Daru::DataFrame+ Import / Export methods to corresponding 
      # +Daru::IO+ Importer / Exporter classes.
      #
      # - {Daru::IO::Importers::CSV#initialize #from_csv method}
      # - {Daru::IO::Importers::Redis#initialize #from_redis method}
      # ...
      def register_io_module(function, instance)
          ...
      end
    end
  end
end

I think it's reasonable to expect users to open the DataFrame page to have a look at the Importers & Exporters, and find a list of them as shown below.
image
Does this suffice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v0dro I believe that daru-io should think about it as a complementary yet separate gem.
It is a necessary compromise we've already discussed with Athitya: whether we document exact modules that do the work (Daru::IO::Whatever) or their generated methods? (DataFrame.from_whatever)
My answer is the former.
Let's think about somebody who uses daru-io as a foundation to add another format, say, FITS (it is astronomy data).
What they do?

  1. require 'daru-io'
  2. Create Daru::IO::FITS class by a simple straightforward instructions, and document it.
  3. Call Daru::IO.register_io_module.

Of course, we can ask them to implement pseudo-docs for from_fits, YARD allows this... But do we really faithfully want to?..

So, I believe Daru::IO docs should be saying something like... "Blah, blah, when you include this and that, you receive a ton of Daru::DataFrame.from_this_and_that, please refer to Daru::IO::ThisAndThatImporter for docs".

It is one option. Another one is stop polluting DF with from_foo/to_bar methods, and define for Daru::IO API like Daru::IO::FromExcel.new('path').call / Daru::IO::ToExcel.new(df).call('path').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just stick to referring users to the relevant module in that case. Changing the API would not look good.

Copy link
Collaborator

@zverok zverok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to accept this, just don't forget my comments ;)

Daru::DataFrame.register_io_module :from_excel, self
Daru::DataFrame.register_io_module :from_excel do |*args|
if args.first.end_with? '.xlsx'
require 'daru/io/importers/excelx'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's inefficient.

Not very. require works only once, the next time Ruby's enternal mechanizms will just check whether require'd file is loaded already.

@@ -45,22 +55,26 @@ class Excel < Base
# # 3 4 Franz nil Paris nil
# # 4 5 George 5.5 Tome a,b,c
# # 5 6 Fernand nil nil nil
def initialize(path, worksheet_id: 0)
def initialize(path, sheet: 0, headers: true)
optional_gem 'spreadsheet', '~> 1.1.1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have the optional_gem method run at the time of the gem require? Or how about "trying" to require all gems at gem require first and then throw warnings only when the specific method is called?

@v0dro To be honest, I am strictly against it. optional_gem is not effect-less, it outputs "Hey, we need this and that gem to do what you want". Imagine daru-io v.2.3.4, which somebody wants to use to just import some CSV? It will output two full screens of please install <this irrelevant stuff> just for nothing.

I believe, calling optional_gem exactly at the moment when somebody tries to really import/export xlsx is a good idea.

Side note: In future, there can exist code like

format = controller.params[:format]
dataframe = Daru.from(format, controller.params[:input])

...which could probably fail too late (e.g. works for 1000 users, and fails for 1001st, because format was dbf, and dependency is absent), but it is another story (about APIs where format is a variable, and then about APIs of "early check" that we support this and that format, and probably explicitly forbid those and those).

# 4 nil 1 OUT30045 3 New stock 2014-08-01 51035
# ... ... ... ... ... ... ... ...
#
# @example Importing from a remote URL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v0dro I believe that daru-io should think about it as a complementary yet separate gem.
It is a necessary compromise we've already discussed with Athitya: whether we document exact modules that do the work (Daru::IO::Whatever) or their generated methods? (DataFrame.from_whatever)
My answer is the former.
Let's think about somebody who uses daru-io as a foundation to add another format, say, FITS (it is astronomy data).
What they do?

  1. require 'daru-io'
  2. Create Daru::IO::FITS class by a simple straightforward instructions, and document it.
  3. Call Daru::IO.register_io_module.

Of course, we can ask them to implement pseudo-docs for from_fits, YARD allows this... But do we really faithfully want to?..

So, I believe Daru::IO docs should be saying something like... "Blah, blah, when you include this and that, you receive a ton of Daru::DataFrame.from_this_and_that, please refer to Daru::IO::ThisAndThatImporter for docs".

It is one option. Another one is stop polluting DF with from_foo/to_bar methods, and define for Daru::IO API like Daru::IO::FromExcel.new('path').call / Daru::IO::ToExcel.new(df).call('path').

# 3 nil 1 OUT30045 3 New stock 2014-08-01 51035
# ... ... ... ... ... ... ... ...
def initialize(path, sheet: 0, order: true, index: false, skiprows: 0, skipcols: 0)
optional_gem 'roo', '~> 2.7.0'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this can be metaprogrammed. Like this:

class Excelx < Base
  requires_gem 'roo', '~> 2.7.0' # says that on real creation of object the check should be done

Another thought is you probably should not specify dependencies version so narrow for general-usage gem (which leads to idea of testing with multiple dependencies versions... So we should postpone this, just don't forget)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Keeping #25 open and also opened #41 for future references. 👍

@@ -1,7 +1,9 @@
module Daru
class DataFrame
class << self
def register_io_module(function, instance)
def register_io_module(function, instance=nil)
return define_singleton_method(function) { |*args| yield(*args) } if block_given?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With explicit &block in args, you could've just define_singleton_method(function, &block) here.

@v0dro
Copy link
Member

v0dro commented Aug 4, 2017

Approved. Merge?

@athityakumar
Copy link
Member Author

@v0dro - Sure, merging this in. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants