New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF support added #88

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@the-fallen
Copy link
Contributor

the-fallen commented Apr 10, 2017

Sorry for the late commit, been really busy with college assignments lately.
Closed the last pull request and made this new one due to some issues.
As @Treora said, this commit takes cares of pdfjs-worker cleanup, doesn't load pdf twice and doesn't check if it's pdf twice.
The reason i think this is a better solution is if we do it in background, in case of a different module, we have to wait on an async chrome.tabs.query call everytime to get url and check if it's a pdf. Also, this saves alot of code replication.
Thanks alot for reviewing

@Treora

This comment has been minimized.

Copy link
Collaborator

Treora commented Apr 10, 2017

Nice, thanks for continuing with this!

Just FYI, you know you can just reset a branch and git push -f to continue on the same PR with new code? This is okay too though.

Your choice of combining extractPageMetadata and extractPageText into one function makes sense, though I would leave the ugly details of calling Readability in their own file. Note that the current extraction stuff was all made rather quickly and tentatively. I would not mind at all to reorganise it more thoroughly, and think again which data we want to store and how (suggestions welcome).

As for code details, some style needs cleaning (try npm run lint-fix), try refactor and comment extract-pdf-data to make it pleasant to read (also feel free to use await instead of then), and use the new fetch API instead of the XMLHttpRequest.

@the-fallen the-fallen force-pushed the the-fallen:master branch from 4430fa0 to 0f2cdd1 Apr 10, 2017

@the-fallen

This comment has been minimized.

Copy link
Contributor Author

the-fallen commented Apr 10, 2017

@Treora As you mentioned, Readability mess is now in a seperate file, comments have been added in extract-pdf-data wherever needed, used await instead of then everywhere(thanks for the suggestion, it really does look a lot more understandable!), used fetch API to get the pdf, did lint-fix and removed most of the errors.
Also, to improve search for pdfs, i added a few fields in searchableTextFields for metadata like author, subject, keywords. Do give your thoughts if you think any other search field should be added.
Thanks alot for your inputs.

@Treora
Copy link
Collaborator

Treora left a comment

Good changes. Now I could understand the code better, and thereby find still more things to comment on. :)


var totalContent = []
var promises = []
collectContent()

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

Function can just be inlined if it is only called and defined right here.

callback(pageContent)
})
})
}

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

Why such a mix of callbacks and promises? And two arrays, for promises and text contents? Would something like this be more elegant? (probably contains mistakes)

const pageNumbers = range(pdf.pdfInfo.numPages) // import from lodash/fp/range
// (or come up with a cleaner for-loop replacement without range)

const pagesTextContentPs = pageNumbers.map(async i => {
    const page = await pdf.getPage(i)
    const pageTextPieces = await page.getTextContent()
    const pageText = pageTextPieces.items.map(item => item.str).join(' ')
    return pageText
})
const pagesTextContent = await Promise.all(pagesTextContentPs)
const totalDocumentText = pagesTextContent.join(' ')

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

Also perhaps joining the pages with a few newlines instead of a space could still preserve some readability of the output. (I'm not sure what the 'items' are, I just named them textPieces above; somewhat meaningful names really helps readers understand the code)

function getData(blob) {
return new Promise(function (resolve, reject) {
var fileReader = new FileReader()
fileReader.onload = async function (blob) {

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

This function feels too long to be nested. I would give it a name and put it above, and have it return the value, that is passed to resolve here.

@@ -0,0 +1,34 @@
import { getMetadata, metadataRules } from 'page-metadata-parser'
import extractPdfData from './extract-pdf-data'
import getText from './gettext'

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

I would try keep more informative names. extract-page-text was quite accurate, I think?

resolve(
{
pageText: { bodyInnerText: totalContent },
pageMetaData: JSON.parse(JSON.stringify(data.info, null, 2)),

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

stringify and parse? What does that achieve?

var fileReader = new FileReader()
fileReader.onload = async function (blob) {
require('pdfjs-dist')
require('fs')

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

Do we need to import fs in the browser as well? Could be, but I thought it was a NodeJS thing.

'extractedMetadata.Title',
'extractedMetadata.Author',
'extractedMetadata.Subject',
'extractedMetadata.Keywords',

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

Good idea to make these searchable. We may want to harmonise the metadata of html pages and pdfs; having title for one and Title for the other looks awful. The whole choice of fields and names is up for revision though, but we could do all this after this PR.

This comment has been minimized.

@the-fallen

the-fallen Apr 11, 2017

Author Contributor

Yes, it would be great if we could have same property keys.

@@ -0,0 +1,58 @@
// Returns a promise for an Object containing PDF Text and MetaData
function getData(blob) {

This comment has been minimized.

@Treora

Treora Apr 10, 2017

Collaborator

More meaningful name? We could actually expose just one extractPdfData, which accepts either a blob or a url, especially now url-to-blob adds just two lines of code.

export default async function extractPdfData({blob, url}) {
    if (blob === undefined) {
        response = await fetch(url)
        blob = await ...
    }
    ....
}
@the-fallen

This comment has been minimized.

Copy link
Contributor Author

the-fallen commented Apr 11, 2017

@Treora My bad, i didn't touch that part of oliver's code with callbacks, should've rewritten. Implemented now as you said with minor changes. Did all the other changes as you pointed out.
Thanks for your time

@Treora

This comment has been minimized.

Copy link
Collaborator

Treora commented Apr 11, 2017

Beautiful. I will test it out, if it works then it seems ready for merging (I will go through it once more and do some small comment rewording and style nitpicking).

Do please let know if you copy parts of other people's code; I am quite conscious about copyright, trying to keep everything this repo in the public domain (should still declare that somewhere); see e.g. Unlicense. On that note, would you be willing to waive all your copyrights on your contributions?

@the-fallen

This comment has been minimized.

Copy link
Contributor Author

the-fallen commented Apr 11, 2017

@Treora Yes, of course, i am willing. Also, no i haven't copied code from outside, nothing there to copy really, even the last traces of oliver's code are gone after rewriting pdf text extraction.
On another note, before merging, after you're done testing, do remove the console log statement in page-analysis/background/index.js line 35

@Treora

This comment has been minimized.

Copy link
Collaborator

Treora commented Apr 24, 2017

I just tested this with a few more pdf files, seems to work so far. As discussed, the data model should be revised, and also problems may still pop up; especially performance on large documents could be an issue. But merging now anyway: we will discover and solve those problems as we go.

Merged in d59ce46.

@Treora Treora closed this Apr 24, 2017

@Treora

This comment has been minimized.

Copy link
Collaborator

Treora commented Apr 24, 2017

And, of course, thanks for the effort!

@the-fallen

This comment has been minimized.

Copy link
Contributor Author

the-fallen commented Apr 26, 2017

Happy to contribute!

reficul31 pushed a commit to reficul31/webmemex-extension that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment