Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for .epub text extraction? #106

Closed
josephrocca opened this issue Dec 20, 2016 · 7 comments
Closed

Support for .epub text extraction? #106

josephrocca opened this issue Dec 20, 2016 · 7 comments

Comments

@josephrocca
Copy link

josephrocca commented Dec 20, 2016

I'd have thought the .epub format would be popular enough to make it into this lib. Is there any reason why it's missing? There are already a few good npm packages for parsing epub files, so it wouldn't be hard to integrate it into textract.

Great lib in any case - thanks for releasing it 👍

@jsinmotion
Copy link

That's an excellent idea! Have you used any of the npm packages? Which one(s) would you recommend?

@josephrocca
Copy link
Author

josephrocca commented Dec 20, 2016

I had a look at the packages, and then realised that it'd make much more sense just to extract the epub (epubs are just zip files in disguise) and then grab all the .html/.htm files and extract the text with cheerio:

var AdmZip = require('adm-zip');
var cheerio = require('cheerio');

function epubToText(path) {

  let zip = new AdmZip(path);
  let zipEntries = zip.getEntries(); // an array of ZipEntry records

  let output = "";

  // look through all files in zip
  for(let entry of zipEntries) {

    // get file extension:
    let nameParts = entry.entryName.split(".");
    let lastPart = nameParts[nameParts.length-1];

    if(lastPart === "html" || lastPart === "htm") {

      // extract text with cheerio
      let $ = cheerio.load( zip.readAsText(entry.entryName) );
      output += $("body").text();

    }

  }

  return output;

}

There's probably a much more efficient way to do a lot of that but as a prototype it works fine. adm-zip extracts the files into memory, so there's nothing to clean up. I think the epub packages on npm are more targeted at reading metadata and stuff, rather than doing simple text extraction. Here's the script with a book from project gutenberg:

epubToText.zip

@jsinmotion
Copy link

@josephrocca sounds pretty straightforward based on https://en.wikipedia.org/wiki/EPUB : just parse the container.xml, figure out the order of the files, then use cheerio (already a dependency) to process. It's probably better to rely on a module, but if I have time I'll try to whip up something

@josephrocca
Copy link
Author

josephrocca commented Dec 20, 2016

@jsinmotion Woops! You're right, I completely forgot about the ordering of the files. Here's a mock-up of what it might look like with proper ordering:

var AdmZip = require('adm-zip');
var cheerio = require('cheerio');

function epubToText(path) {

  let zip = new AdmZip(path);
  let zipEntries = zip.getEntries(); // an array of ZipEntry records

  // get content.opf path and containing folder
  let $ = cheerio.load( zip.readAsText('META-INF/container.xml') );
  let contentOpfPath = $("container rootfiles rootfile").attr("full-path");
  let contentOpfFolder = contentOpfPath.split("/")
  contentOpfFolder.pop();
  contentOpfFolder = contentOpfFolder.join("");

  // push html/htm files into our array of paths to convert to text
  $ = cheerio.load(  zip.readAsText(contentOpfPath) );
  let contentFilePaths = [];
  $("package manifest item").each((i, el) => {

    let path = $(el).attr("href");
    let pathParts = path.split(".");
    let lastPart = pathParts[pathParts.length-1];

    if(lastPart === "html" || lastPart === "htm") {
      contentFilePaths.push(contentOpfFolder+"/"+path);
    }

  });

  // extract text from each file with cheerio
  let output = "";
  for(let path of contentFilePaths) {

    let $ = cheerio.load( zip.readAsText(path) );
    output += $("body").text();

  }

  return output;

}

And fixed demo:

epubToText.zip

@dbashford
Copy link
Owner

Will take a look! No one has clamored for it, and when I looked a year or so ago there weren't a lot of good options available (and I wasn't in the mood to figure it out myself given the lack of interest).

@dbashford dbashford added this to the 2.2.0 milestone Dec 23, 2016
@dbashford dbashford removed this from the 2.2.0 milestone Sep 11, 2017
@andineck
Copy link

epub support would be really helpful. I think @josephrocca 's mockup looks quite good already.
Any help needed with this?

@dbashford
Copy link
Owner

Sorry, this totally fell off my radar. Happy to take a PR! But I may be able to dig in in the next few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants