Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Type Provider #414

Closed
ptrelford opened this issue Feb 17, 2014 · 20 comments

Comments

@ptrelford
Copy link
Contributor

commented Feb 17, 2014

Data is often published as tables in PDF documents, making it hard to access programmatically.

Open source .Net PDF readers:

A type provider might make such data more accessible.

@ovatsus

This comment has been minimized.

Copy link
Member

commented Feb 17, 2014

I think that's @colinbull next project after the html tp, right? ;)

@tpetricek tpetricek added type-feature and removed type-feature labels Feb 17, 2014

@tpetricek

This comment has been minimized.

Copy link
Member

commented Feb 17, 2014

This would be an amazing feature! Scraping PDFs is a thing that could make F# Data really nice for data journalism (I think publishing "open government data" as unreadable PDF is a popular trick :-))

@ovatsus

This comment has been minimized.

Copy link
Member

commented Feb 17, 2014

There's a bunch of UK train related data in pdfs...

@ghost

This comment has been minimized.

Copy link

commented Feb 17, 2014

This probably belongs in a new FSharp.Documents project? Could encompass Word, Excel, google docs, PDF etc.

@ovatsus

This comment has been minimized.

Copy link
Member

commented Feb 17, 2014

Yes, that makes sense. Or at least on a separate assembly/nuget package

@quintusm

This comment has been minimized.

Copy link

commented Feb 17, 2014

I vote for the FSharp.Documents project. We just need to make sure that we don't bring in a dependency on Office to do this

@tpetricek

This comment has been minimized.

Copy link
Member

commented Feb 17, 2014

I think the right splitting is tricky - if we add the HTML provider, then that could belong both to "documents" and to "data". But I agree that some splitting is definitely needed, at least as a separate assembly and nugget package (for better reference management).

Equally, I'd love to see SQL provider and SQL Commands provider in F# data, but perhaps in a separate package FSharp.Data.Databases. I think that is more actively being worked on, so perhaps that's the first thing we should do.

For now, I think tracking HTML and PDF providers as work items here makes sense.

@ghost

This comment has been minimized.

Copy link

commented Feb 18, 2014

IMHO an HTML provider would belong in "documents".

Such a thing would inevitably get dragged away from the core concerns of data (missing values, integration with data visualization, type inference) towards the same sort of issues for other document formats (natural language, entity extraction, table extraction, link following, examination if active content/code, to name a few).

A good rule of thumb: if it has embedded macros or ActionScript or JavaScript etc, it's a document, not a data format :)

@tpetricek

This comment has been minimized.

Copy link
Member

commented Feb 18, 2014

Somehow, this thread keeps reminding me of: http://www.charlespetzold.com/etc/csaml.html

@sgtz

This comment has been minimized.

Copy link

commented Feb 18, 2014

That's crazy in a verbose kind of way... And so much enthusiasm to go with it too.

On 18 Feb 2014, at 04:24, Tomas Petricek notifications@github.com wrote:

Somehow, this thread keeps reminding me of: http://www.charlespetzold.com/etc/csaml.html


Reply to this email directly or view it on GitHub.

@ovatsus

This comment has been minimized.

Copy link
Member

commented Feb 19, 2014

I agree and disagree at the same time :)

I think there's room for having type providers for html and even excel as part of FSharp.Data, as long as they focus on the core concerns that @dsyme mentioned, and fit well with the existent xml/json/csv providers. And I think the one we have coming up from @colinbull does that. But that doesn't preclude the existence of a separate project with a full-fledged excel provider, for example, and maybe the same for html. Deedle handles csv files, but CsvProvider still has its place. Most other languages/communities have more than one alternative for each thing, and that's ok. We are also seeing that it the two new community sql type providers.

As for pdf, I'm not sure which category it will belong to, but if we're talking about having a separate package, we might as well make it to a separate project. That said, there's no point in putting the cart before the horse, so we can keep having the discussion here, and when we have a concrete prototype or pull request, we can then discuss if it fits here or if it diverges enough that it should spawn a new project.

PS: I think the database related type providers, and the ones targeted at Hadoop-like systems, have different enough goals and concerns that they should be kept separate.

@ptrelford

This comment has been minimized.

Copy link
Contributor Author

commented Feb 19, 2014

@ovatsus agree on cart before horse, I started the thread here to get some eyes on it which seems to have worked. There have been some great suggestions on project location, I guess we can work out where it goes if and when we have a prototype PDF type provider.

I did some research on available libraries for reading PDF files to kick things off.

Another part is having a set of PDF examples, so we can understand the challenges in interpreting particular sets of data.

Please post links to example PDFs containing data that people would like to access (e.g. train information).

Cheers,
Phil

@ovatsus

This comment has been minimized.

Copy link
Member

commented Feb 22, 2014

Pdfs with the national rai timetables: http://www.networkrail.co.uk/aspx/3828.aspx
@ptrelford as you created this issue, I assume you also have some example pdfs?

@ptrelford

This comment has been minimized.

Copy link
Contributor Author

commented Mar 10, 2014

@ovatsus yes, an example is US government budget: http://www.whitehouse.gov/sites/default/files/omb/budget/fy2014/assets/budget.pdf

I've taken a look at PDFsharp, and created a snippet for extracting text: http://fssnip.net/lT

PDFsharp currently does not support all features in PDF 1.5 documents.

It is fairly easy to extract the text content for example of a train timetable e.g. http://www.networkrail.co.uk/browse%20documents/eNRT/Dec13/timetables/Table%20001.pdf

Some of the structure of a document is easily discovered:

  • pages
  • lines (as rows) via operators see Appendix A (Operator Summary)

However the text on a line appears to be layed out on a coordinate basis with no consistent mark-up for delimiting columns. In the case of the train timetable, this might be reverse engineered using the vertical line positions as delimiters. Unfortunately vertical line delimiters are not consistent across documents, which makes the task of programmatically inferring the structure of a page non-trivial.

@taylorwood

This comment has been minimized.

Copy link
Contributor

commented May 8, 2014

The hard problem of extracting tabular data from a PDF as @ptrelford mentioned is that it's internal representation of tabular data in PostScript (a programming language in itself) retains none of the structural row/column info. Imagine a rendering of a simple HTML table, but instead of the markup consisting of simple table/row/cell elements, each word is instead defined as its own div and absolute CSS position, and the divs are defined in no logical order.

With those circumstances I think the readily apparent way to recognize tabular data is to

  • infer the rows/columns by fitting the text to a grid and finding the best-fit alignments (not trivial or 100% reliable) and...
  • looking for vector-based lines to hint the table structure

I've used those approaches to good effect for PDF and OCR data extraction but only when I had a reasonable idea of the page and table layout beforehand, never for anything as general purpose as this would require.

And to even get to that point you have to take a dependency on an open source PDF library that will no doubt have its own issues with certain versions and features of the PDF spec unless you want to implement your own PostScript parser/layout engine... Oi.

@dsyme

This comment has been minimized.

Copy link
Contributor

commented May 8, 2014

I've mentioned this before, but it's my gut feeling that the HTML, PDF and other "document" providers should probably go in a separate project FSharp.Data.Documents.

@ovatsus

This comment has been minimized.

Copy link
Member

commented May 8, 2014

I agree with @dsyme on the PDF provider, it should be on its own project if/when someones does it.
I disagree on the HTML provider, I think it's so pervasive that it should go in FSharp.Data, and the HtmlTypeProvider branch is very close to being merged to master and released

@dsyme

This comment has been minimized.

Copy link
Contributor

commented May 8, 2014

Yup, html is ok. But do consider that it could also make a good seed project to help form a viable documents provider, with critical mass. Your call though! :)

@colinbull

This comment has been minimized.

Copy link
Contributor

commented May 8, 2014

I must admit I'm confused about the definition of documents in this context. Aren't CSV, XML and JSON also document types? @dsyme when you consider documents such as pdf and html are you thinking in a purely web context, in which case things like syndication and RDF formats should be grouped together? Otherwise isn't it better to just keep each provider separate?

-----Original Message-----
From: "Don Syme" notifications@github.com
Sent: ‎08/‎05/‎2014 18:14
To: "fsharp/FSharp.Data" FSharp.Data@noreply.github.com
Cc: "Colin" colinbul@googlemail.com
Subject: Re: [FSharp.Data] PDF Type Provider (#414)

Yup, html is ok. But do consider that it could also make a good seed project to help form a viable documents provider, with critical mass. Your call though! :)

Reply to this email directly or view it on GitHub.

@tpetricek

This comment has been minimized.

Copy link
Member

commented May 8, 2014

I think the best strategy is to let things grow organically and adapt as needed (for example PDF provider could as well be placed in "data journalism" package - it really depends on what kinds of things people start building!)

At the moment, it makes sense to have the HTML provider in F# Data. Having a PDF provider somewhere would be nice, but it is not going to happen anytime soon and I'm not sure what other things could be treated as "documents" (perhaps Excel, but then, you might want to do much more then read data from Excel files).

As for the "seed project", this is pretty much the goal of the F# Data Toolbox and I think that is a great place for experimenting with other data sources (and can also include code that is good for interactive data access, but perhaps not for deployment/production).

@dsyme

This comment has been minimized.

Copy link
Contributor

commented May 9, 2014

Document = format whose primary purpose is visual display, storage and editing of information for human consumption. (my definition) PDF, PS, HTML, Markdown, Word, OneNote, Google docs, ...

Or "documents have fonts"

By "seed" I meant "core quality project" - the bit that's not experimental, and sets the standard for all the other work in the area, much as the original CSV provider here.

It doesn't really matter, but it just seems logical to draw a dividing line between "data formats" and "document formats". Both to preserve the focus of FSharp.Data on true data formats and mass data services, and to support the creation of new great projects for the panoply of document formats.

"Data journalism" would be an aggregation package, like FsLab, combining both of these and more.

For HTML the question may depend on whether you focus on the visual or tabular parts of HTML.

@ovatsus

This comment has been minimized.

Copy link
Member

commented Aug 30, 2014

Closing until there is a concrete pull request

@ovatsus ovatsus closed this Aug 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.