More approachable names #19

eugene-eeo · 2015-04-12T05:02:21Z

I think @msiemens has raised an important issue about the terminology used in our project. We should always strive to make it more "intuitive", that is, that developers can just start delving into the source code without having to look up the technical terminology used. Some suggestions:

quantifiers -> metrics
pruners -> selectors/mappers

As for maximisers I don't know of a better name, but it sounds a mouthful ;)

cc @libextract/owners @libextract/contrib

rodricios · 2015-04-12T05:10:01Z

I'm on board with metrics instead of quantifiers, but I'm personally more convinced that "pruning" fits what's going on in that submodule. Here's a snippet from the wiki:

A top down pruning will traverse nodes and trim subtrees starting at the root [..]

And from google's auto definition:

Pruning is a horticultural and silvicultural practice involving the selective removal of parts of a plant, such as branches, buds, or roots

And ditto on the "mouthful" sentiment lol.

eugene-eeo · 2015-04-12T05:13:44Z

What about renaming @pruner to @prunes or @selects?

rodricios · 2015-04-12T05:20:02Z

I think @selects is appropriate in that case. We provide the user with the selects decorator for custom pruners, or they can also user our builtin pruners?

eugene-eeo · 2015-04-12T05:27:19Z

Builtin ones can be used as well, but my current concern is that the pruners aren't having a uniform type signature. For example prune_by_node_text_length returns the node's parent and the node text length. So it should be per-strategy, i.e. ``libextract.html.article.get_parent_textlen_pairs`.

Update: Also, what's basket()?

rodricios · 2015-04-12T05:41:31Z

I took basket out, forgot I left it there :P

prune_by_node_text_length returns the node's parent and the node text length

A node's parent or a node shouldn't be much of a problem since they're both of type lxml.html.HtmlElement

I agree that the second item yielded should stick to a single type instead of numerical or Counter

eugene-eeo · 2015-04-12T05:44:27Z

With that being said, I think we may be approaching a zone of overly generic code. I think that doman-specific functions should remain domain specific. If the user wants to create a function that traverses the tree, they should be required to write their own traversal function. For example some traversals are more efficient if they share state, e.g.:

def get_pairs(etree):
    cache = {}
    for node in etree.xpath(SELECTOR):
        id = node.get('id')
        if id not in cache:
            cache[id] = func(node)
        yield cache[id]

rodricios · 2015-04-12T05:55:18Z

If the user wants to create a function that traverses the tree [..]

I'm under the impression that tree traversal is kind of a must in this domain. Couldn't we apply a similar mechanism, like an internal cache variable, if the user would like to share state?

eugene-eeo · 2015-04-12T05:59:03Z

About the overly generic code- seems like we won't be hitting that zone too soon, and no we shouldn't provide caching or memoziation. That is to be provided by the user themselves. The decorators we provide merely "enforce" a common pattern.

rodricios · 2015-04-12T05:59:07Z

I should note that part of the motivation to "pruning", "maximizing", etc. is because those are basically the steps laid out in most A.I. resources that talk about searching & predicting.

eugene-eeo · 2015-04-12T06:01:25Z

In that case, 👍 I am misunderstood.

rodricios · 2015-04-12T06:02:49Z

The question I think we both have though is: are these patterns going to be helpful? I mean, to me, they make sense, but I understand that most users are going to need a brief explanation to usage..

eugene-eeo · 2015-04-12T06:25:23Z

I think I'll work on a re-architecturing of the package:

We will have a protocol module which defines wrappers for common use-cases of parsing, pruning, prediction, and post-processing. It will include decorators like @select and defaults like parse_html and perhaps enforce the four stages (Parsing, Pruning, Prediction, Post-processing).

libextract/
  __init__.py
  coretools.py
  protocol.py
  metrics.py
  article.py
  tabular.py
  formatters.py
  strategies.py

rodricios · 2015-04-12T06:29:17Z

Sounds good! I would be in favor of dropping article and tabular from the module and refactor their methods into an appropriately named module?

eugene-eeo · 2015-04-12T06:32:23Z

What do you mean by that?

rodricios · 2015-04-12T06:41:11Z

In this branch, article.py only has one function: get_text, and tabular.py has two (at least being used by tabular's strategy, sort_best_pair and filter_tags.

rodricios · 2015-04-12T06:44:53Z

I think we should promote the individual methods used by the strategy, as opposed to the module itself.

eugene-eeo · 2015-04-12T06:53:08Z

I think I'll revert the library to the state where it was before we started adding the @prunes decorator. I am a firm believer in simplicity being the key to composability, and I think that the library was in the best state then. The YAGNI methodology should be adopted here, especially for our library. Look at this stackoverflow question.

Let's focus on building a nice-to-use API and worry about extensibility when the problem comes to us.

rodricios · 2015-04-12T07:03:55Z

Ehh, but wouldn't we be trying to refactor get_pairs again? Or is that a fool's errand?

eugene-eeo · 2015-04-12T07:05:41Z

IMHO, that's a fool's errand.

rodricios · 2015-04-12T07:15:59Z

[..] worry about extensibility when the problem comes to us.

It's sort of already been requested by Beluki:

My rationale for this is that the important thing is to make the pipeline easily extensible/configurable and documenting how users can extend it. Closures allow that.

As for making this API useful, I'm wondering where our efforts should be concentrated. I'm thinking we can keep these changes active as they do, imo, implement a good base for further abstraction of the general html extraction process flow (this is coming from personal exp.). Reverting to a state with the plumper versions of html.article and html.tabular is like basically going to a version that's closer to what eatiht was and currently is.

eugene-eeo · 2015-04-12T07:24:54Z

What about moving the @prunes decorator to coretools but keep the functions in html.article and html.tabular? It offers a nice compromise between code-reuse and locality.

rodricios · 2015-04-12T07:35:53Z

Unfortunately, I'm opting for dropping html.article and html.tabular because imo, with those two submodules, we're simply reimplementing eatiht.

The reason for my position is because if we're trying to create a library that's supposed to be "purely functional" and supposed to work with HT/XML documents, the html name, as well as the article and tabular distinction don't really allow for that.

There's functions that are similar, but not 100% the same, but those are the ones that I think should be refactored and generalized, in order to go towards creating a functional lib.

A flat structure without the html folder would be a step in the right direction I think.

eugene-eeo · 2015-04-12T07:36:58Z

👍 will do.

rodricios · 2015-04-12T07:52:23Z

:) While you're doing that, I'm going to try to see if I can create a

-to-json formatter.

eugene-eeo · 2015-04-12T07:54:29Z

I'm really sorry about the whole commotion caused. I was just really frustrated and confused just now, and I should've just merged and refactored from there. But I've come to realise that what we are building is not just a mere extraction tool but a "framework" (if you will) to help users develop their own extraction algorithms.

rodricios · 2015-04-12T08:16:24Z

Not a problem dude! I hope I didn't come off too firm on my ideas, but you're right, we can say that we are actually making an extraction framework as opposed to just a single extraction tool.

I mean, I think we'll do the bare minimum for users to start using extractors at the get-go, like supply them default "main article" and "tabular data" extraction algos, but what can possibly make this unique from the rest of what's out there is the underlying predictive and traversal components

eugene-eeo added the comment/concern label Apr 12, 2015

eugene-eeo pushed a commit that referenced this issue Apr 12, 2015

the great refactoring, addresses #19

2c4aad2

eugene-eeo closed this as completed Apr 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More approachable names #19

More approachable names #19

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

More approachable names #19

More approachable names #19

Comments

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015

eugene-eeo commented Apr 12, 2015

rodricios commented Apr 12, 2015