Sequences #45

xkollar · 2016-10-07T14:54:28Z

Hi. I like your library 👍.
However, I do not see any clear/obvious way how to parse (/scrape)

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>

into something like

type Title = String
type Paragraph = String -- For simplicity
data Part = Part Title [Paragraph]

expected :: [Part]
expected =
    [ Part "title2 1" ["text 1", "text 2"]
    , Part "title2 2" ["text 3"]
    , Part "title2 3" []
    ]

If I just miss something, would you consider adding this into examples. Or maybe a slight change in combinators? Or maybe introduce some sequence operator?

Probably related to issue #41.

Thanks :-).

The text was updated successfully, but these errors were encountered:

fimad · 2016-10-16T22:57:08Z

Yeah, unfortunately there isn't a great way to do this today :(

A lot of scalpel's internals currently rely on the assumption that an HTML document is a tree and scraping/parsing involves selecting the sub-tree that you care about and extracting data from that sub-tree.

I've opened up #48 as a sort of meta-issue to solve the general problem of selecting multiple sub-trees. If you have an ideas for what a good API would look like please post them there :)

The approach utilizes a zipper like API to focus on sibling nodes and execute scrapers on the focused nodes. Regression tests have been added that shows that this solves the requirements in issues #41, and #45. The documentation needs to be cleaned up a bit and I'm not quite happy with the name `visitSerially` and `visitChildrenSerially`. Also there is a bad interaction with the newly added textSelector that makes stepNext/Back behave unexpectedly. Whitespace between tags count as nodes which leads to unintuitive behavior. Issue #48

fimad · 2019-02-18T05:14:08Z

This is now supported in version 0.6.0. This specific issue is added as a regression test:

    ,   scrapeTest
            "Issue #45 regression test"
            (unlines [
              "<body>"
            , "  <h1>title1</h1>"
            , "  <h2>title2 1</h2>"
            , "  <p>text 1</p>"
            , "  <p>text 2</p>"
            , "  <h2>title2 2</h2>"
            , "  <p>text 3</p>"
            , "  <h2>title2 3</h2>"
            , "</body>"
            ])
            (Just [
              ("title2 1", ["text 1", "text 2"])
            , ("title2 2", ["text 3"])
            , ("title2 3", [])
            ])
            (chroot "body" $ inSerial $ many $ do
                title <- seekNext $ text "h2"
                ps <- untilNext (matches "h2") (many $ do
                  -- New lines between tags count as text nodes, skip over
                  -- these.
                  optional $ stepNext $ matches textSelector
                  stepNext $ text "p")
                return (title, ps))```

fimad mentioned this issue Oct 16, 2016

Figure out story for queries that span multiple sub-trees. #48

Closed

fimad modified the milestone: 0.4.1 Oct 17, 2016

fimad modified the milestone: 0.6.0 Feb 14, 2019

fimad mentioned this issue Feb 16, 2019

Initial support for scraping multiple sub-trees #72

Merged

fimad added the fixed at head The issue has been address but the fix is not yet included in a released version of the library. label Feb 17, 2019

fimad closed this as completed Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequences #45

Sequences #45

xkollar commented Oct 7, 2016

fimad commented Oct 16, 2016

fimad commented Feb 18, 2019

Sequences #45

Sequences #45

Comments

xkollar commented Oct 7, 2016

fimad commented Oct 16, 2016

fimad commented Feb 18, 2019