Support for .epub text extraction? #106

josephrocca · 2016-12-20T05:23:39Z

I'd have thought the .epub format would be popular enough to make it into this lib. Is there any reason why it's missing? There are already a few good npm packages for parsing epub files, so it wouldn't be hard to integrate it into textract.

Great lib in any case - thanks for releasing it 👍

The text was updated successfully, but these errors were encountered:

jsinmotion · 2016-12-20T05:29:38Z

That's an excellent idea! Have you used any of the npm packages? Which one(s) would you recommend?

josephrocca · 2016-12-20T06:33:01Z

I had a look at the packages, and then realised that it'd make much more sense just to extract the epub (epubs are just zip files in disguise) and then grab all the .html/.htm files and extract the text with cheerio:

var AdmZip = require('adm-zip');
var cheerio = require('cheerio');

function epubToText(path) {

  let zip = new AdmZip(path);
  let zipEntries = zip.getEntries(); // an array of ZipEntry records

  let output = "";

  // look through all files in zip
  for(let entry of zipEntries) {

    // get file extension:
    let nameParts = entry.entryName.split(".");
    let lastPart = nameParts[nameParts.length-1];

    if(lastPart === "html" || lastPart === "htm") {

      // extract text with cheerio
      let $ = cheerio.load( zip.readAsText(entry.entryName) );
      output += $("body").text();

    }

  }

  return output;

}

There's probably a much more efficient way to do a lot of that but as a prototype it works fine. adm-zip extracts the files into memory, so there's nothing to clean up. I think the epub packages on npm are more targeted at reading metadata and stuff, rather than doing simple text extraction. Here's the script with a book from project gutenberg:

epubToText.zip

jsinmotion · 2016-12-20T06:38:45Z

@josephrocca sounds pretty straightforward based on https://en.wikipedia.org/wiki/EPUB : just parse the container.xml, figure out the order of the files, then use cheerio (already a dependency) to process. It's probably better to rely on a module, but if I have time I'll try to whip up something

josephrocca · 2016-12-20T08:40:56Z

@jsinmotion Woops! You're right, I completely forgot about the ordering of the files. Here's a mock-up of what it might look like with proper ordering:

var AdmZip = require('adm-zip');
var cheerio = require('cheerio');

function epubToText(path) {

  let zip = new AdmZip(path);
  let zipEntries = zip.getEntries(); // an array of ZipEntry records

  // get content.opf path and containing folder
  let $ = cheerio.load( zip.readAsText('META-INF/container.xml') );
  let contentOpfPath = $("container rootfiles rootfile").attr("full-path");
  let contentOpfFolder = contentOpfPath.split("/")
  contentOpfFolder.pop();
  contentOpfFolder = contentOpfFolder.join("");

  // push html/htm files into our array of paths to convert to text
  $ = cheerio.load(  zip.readAsText(contentOpfPath) );
  let contentFilePaths = [];
  $("package manifest item").each((i, el) => {

    let path = $(el).attr("href");
    let pathParts = path.split(".");
    let lastPart = pathParts[pathParts.length-1];

    if(lastPart === "html" || lastPart === "htm") {
      contentFilePaths.push(contentOpfFolder+"/"+path);
    }

  });

  // extract text from each file with cheerio
  let output = "";
  for(let path of contentFilePaths) {

    let $ = cheerio.load( zip.readAsText(path) );
    output += $("body").text();

  }

  return output;

}

And fixed demo:

epubToText.zip

dbashford · 2016-12-20T13:29:36Z

Will take a look! No one has clamored for it, and when I looked a year or so ago there weren't a lot of good options available (and I wasn't in the mood to figure it out myself given the lack of interest).

andineck · 2017-09-17T19:24:57Z

epub support would be really helpful. I think @josephrocca 's mockup looks quite good already.
Any help needed with this?

dbashford · 2017-09-21T15:27:11Z

Sorry, this totally fell off my radar. Happy to take a PR! But I may be able to dig in in the next few weeks.

josephrocca mentioned this issue Dec 20, 2016

Support for .mobi extension? #107

Open

dbashford added this to the 2.2.0 milestone Dec 23, 2016

dbashford removed this from the 2.2.0 milestone Sep 11, 2017

dbashford closed this as completed in 8951230 Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for .epub text extraction? #106

Support for .epub text extraction? #106

josephrocca commented Dec 20, 2016 •

edited

jsinmotion commented Dec 20, 2016

josephrocca commented Dec 20, 2016 •

edited

jsinmotion commented Dec 20, 2016

josephrocca commented Dec 20, 2016 •

edited

dbashford commented Dec 20, 2016

andineck commented Sep 17, 2017

dbashford commented Sep 21, 2017

Support for .epub text extraction? #106

Support for .epub text extraction? #106

Comments

josephrocca commented Dec 20, 2016 • edited

jsinmotion commented Dec 20, 2016

josephrocca commented Dec 20, 2016 • edited

jsinmotion commented Dec 20, 2016

josephrocca commented Dec 20, 2016 • edited

dbashford commented Dec 20, 2016

andineck commented Sep 17, 2017

dbashford commented Sep 21, 2017

josephrocca commented Dec 20, 2016 •

edited

josephrocca commented Dec 20, 2016 •

edited

josephrocca commented Dec 20, 2016 •

edited