Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequences #45

Closed
xkollar opened this issue Oct 7, 2016 · 2 comments
Closed

Sequences #45

xkollar opened this issue Oct 7, 2016 · 2 comments
Labels
fixed at head The issue has been address but the fix is not yet included in a released version of the library.
Milestone

Comments

@xkollar
Copy link

xkollar commented Oct 7, 2016

Hi. I like your library 👍.
However, I do not see any clear/obvious way how to parse (/scrape)

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>

into something like

type Title = String
type Paragraph = String -- For simplicity
data Part = Part Title [Paragraph]

expected :: [Part]
expected =
    [ Part "title2 1" ["text 1", "text 2"]
    , Part "title2 2" ["text 3"]
    , Part "title2 3" []
    ]

If I just miss something, would you consider adding this into examples. Or maybe a slight change in combinators? Or maybe introduce some sequence operator?

Probably related to issue #41.

Thanks :-).

@fimad
Copy link
Owner

fimad commented Oct 16, 2016

Yeah, unfortunately there isn't a great way to do this today :(

A lot of scalpel's internals currently rely on the assumption that an HTML document is a tree and scraping/parsing involves selecting the sub-tree that you care about and extracting data from that sub-tree.

I've opened up #48 as a sort of meta-issue to solve the general problem of selecting multiple sub-trees. If you have an ideas for what a good API would look like please post them there :)

@fimad fimad modified the milestone: 0.4.1 Oct 17, 2016
@fimad fimad modified the milestone: 0.6.0 Feb 14, 2019
fimad added a commit that referenced this issue Feb 16, 2019
The approach utilizes a zipper like API to focus on sibling nodes and
execute scrapers on the focused nodes. Regression tests have been added
that shows that this solves the requirements in issues #41, and #45.

The documentation needs to be cleaned up a bit and I'm not quite happy
with the name `visitSerially` and `visitChildrenSerially`.

Also there is a bad interaction with the newly added textSelector that
makes stepNext/Back behave unexpectedly. Whitespace between tags count
as nodes which leads to unintuitive behavior.

Issue #48
@fimad fimad added the fixed at head The issue has been address but the fix is not yet included in a released version of the library. label Feb 17, 2019
@fimad
Copy link
Owner

fimad commented Feb 18, 2019

This is now supported in version 0.6.0. This specific issue is added as a regression test:

    ,   scrapeTest
            "Issue #45 regression test"
            (unlines [
              "<body>"
            , "  <h1>title1</h1>"
            , "  <h2>title2 1</h2>"
            , "  <p>text 1</p>"
            , "  <p>text 2</p>"
            , "  <h2>title2 2</h2>"
            , "  <p>text 3</p>"
            , "  <h2>title2 3</h2>"
            , "</body>"
            ])
            (Just [
              ("title2 1", ["text 1", "text 2"])
            , ("title2 2", ["text 3"])
            , ("title2 3", [])
            ])
            (chroot "body" $ inSerial $ many $ do
                title <- seekNext $ text "h2"
                ps <- untilNext (matches "h2") (many $ do
                  -- New lines between tags count as text nodes, skip over
                  -- these.
                  optional $ stepNext $ matches textSelector
                  stepNext $ text "p")
                return (title, ps))```

@fimad fimad closed this as completed Feb 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed at head The issue has been address but the fix is not yet included in a released version of the library.
Projects
None yet
Development

No branches or pull requests

2 participants