Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to search and replace text within a document? #71

Closed
distracteddev opened this issue May 22, 2016 · 19 comments
Closed

How to search and replace text within a document? #71

distracteddev opened this issue May 22, 2016 · 19 comments

Comments

@distracteddev
Copy link

First off, thanks for writing and maintaining Hummus!

From reading the documentation and perusing the issues, I've gathered so far that this is not supported at a high level by Hummus. I also found your explanation suggesting that it wouldn't necessarily be too difficult, you just had to understand the structure/anatomy of a pdf document.

Was just curious if this was actually simpler to do than I've discovered? Or perhaps an example exists but I've just missed it?

Either way, I've started to read the PDF spec, focusing on the text portions and am starting to understand some of the low level API calls now. Any tips to set me down the right path would be appreciated.

@galkahana
Copy link
Owner

Hi,
I've never tried implementing search & replace, but yeah, i think that should be possible. not sure about whether it should be easy. I'll provide some notes as to how i would approach it, but it might be a good idea to consult someone who did this or read into some library code that actually implements it.
I'd Allocate a few of weeks for it. just as an out of hand estimate.

you need to tackle these problems:

  1. how to read the text in the pdf
  2. how to correctly replace and display the new text

How to parse the text

Content (text, graphics) is placed in content streams of pages. So you need to look into the pages content streams. (each page may have more than one).

Text is placed in content streams inside blocks marked by "BT" and "ET" commands.
In these blocks you should track for text placement commands like "Tj". Tj has a single parameter (as string that precedes it) that is an encoded string of the text. it is encoded per the font that's current. you need to track this font then. You can decode the text using the font encoding or unicode map. The pdf specs has an explanation on how acrobat decodes the text, so you can use it for the implementation.
You need to somehow figure out words out of the text. meaning, when spaces come in. hopefully you can break words by relying on Tj commands being separate per word...but i'm not sure about it.

You need something to tokenize the content stream. i got a good class for it in the C++ implementation called PDFParserTokenizer, which i didn't expose via the hummusjs moduble. if it makes sense we may want to expose it, or reimplement it. it's def is here. This one here shows basic tokenization of a content stream. hope it's ok that its in C++.
by tokenizing the streams you can get to the commands and then track back (or rather save it in advance) to the relevant parameter.

Note that you may get form xobjects placed. these are pieces of reusable graphics that function like pages within pages. you need to track their content stream too in case they are placed in a page.

Get this up and running, and if you're happy with getting the text in a document/page you can move on to replacing the text.

How to replace the text

if you want to replace the text, you should track the original placement commands in charge of it and replace them with a new command placing the new text. you'll probably have to replace the whole paragraph (gotta figure out whether something is a paragraph) as the text length will change and placements will change and you don't want your replaced text to look funny. in funny i mean that it will run over the text following it or leave too much space. so actually you are looking to replace the whole paragraph text...that's probably a better approach. figure out the new paragraph text and place it. hopefully this will work.

You can use hummus commands to place new text or use lower level commands.
the tricky part is to add any new characters to the font definition. Assuming that the PDF has only the characters it needs for rendering the text is already has, this means probably that you need to know which original font was used...realizing it from the PDF is not very easy, but can be done. doing the actual embedding...you are probably better of creating a new font using hummus, with the same name, and writing all the text using that font. simply replace the Tf command placing the old font with the new one, and use Tjs to place the new text (sticking to that would avoid having to know the size and color of the original text).

Good luck,
Gal.

@galkahana
Copy link
Owner

how to parse text with hummus - http://pdfhummus.com/post/156548561656/extracting-text-from-pdf-files

@filmerjarred
Copy link

I managed to implement this the following way

var hummus = require('hummus');

//write our example pdf
var pdfWriter = hummus.createWriter('./source.pdf', {compress:false});
var arialFont = pdfWriter.getFontForFile('./LucidaBrightDemiBold.ttf');
var page = pdfWriter.createPage(0,0,600,800);
var cxt = pdfWriter.startPageContentContext(page);

var textOptions = {font:arialFont, size:14, color:0x222222};
cxt.writeText('Example text',75,75,textOptions)
pdfWriter.writePage(page)
pdfWriter.end();


//init modification writer
var modPdfWriter = hummus.createWriterToModify('./source.pdf', {modifiedFilePath:'./output.pdf', compress:false});

//get references to the contents stream on the relevant page (first, in this instance)
var sourceParser = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
var pageObject = sourceParser.parsePage(0);
var textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();
var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');

//read the original block of text data
var data = [];
var readStream = sourceParser.startReadingFromStream(textStream);
while(readStream.notEnded()){
  var readData = readStream.read(10000);
  data = data.concat(readData);
}

//create new string
var string = new Buffer(data).toString();
string = string.replace(/Example text/g, 'Exmpl txt');

//Create and write our new text object
var objectsContext = modPdfWriter.getObjectsContext();
objectsContext.startModifiedIndirectObject(textObjectID);

var stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(string));
objectsContext.endPDFStream(stream);

objectsContext.endIndirectObject();

modPdfWriter.end();

//removes old objects no longer in use
hummus.recrypt('./output.pdf', './outputClean.pdf');

function strToByteArray(str) {
  var myBuffer = [];
  var buffer = new Buffer(str);
  for (var i = 0; i < buffer.length; i++) {
      myBuffer.push(buffer[i]);
  }
  return myBuffer;
}

Note this will only work if the new text being written is already on the pdf (I think it's something to do with the font info for characters not already on the pdf not being included in the document), and to make the code work you need to organise a font file for writing the example text.

@alexey-sh
Copy link

@BrighTide it seems like your code doesn't work. the outputClean.pdf and output.pdf are empty

@filmerjarred
Copy link

Revisted the old code and this is what shook out in the end, this is working for us to this day


module.exports = function redactPDF ({filePath, patterns}) {
	const modPdfWriter = hummus.createWriterToModify(filePath, {modifiedFilePath: `${filePath}-modified`, compress: false})
	const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount()

	for (let page = 0; page < numPages; page++) {
		const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile()
		const objectsContext = modPdfWriter.getObjectsContext()

		const pageObject = copyingContext.getSourceDocumentParser().parsePage(page)
		const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents')
		const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID()

		let data = []
		const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream)
		while (readStream.notEnded()) {
			const readData = readStream.read(10000)
			data = data.concat(readData)
		}

		const pdfPageAsString = Buffer.from(data).toString()

		let toRedactString = findInText({patterns, string: pdfPageAsString})

		const redactedPdfPageAsString = pdfPageAsString.replace(new RegExp(toRedactString, 'g'), new Array(toRedactString.length).join('-'))

		// Create what will become our new text object
		objectsContext.startModifiedIndirectObject(textObjectID)

		const stream = objectsContext.startUnfilteredPDFStream()
		stream.getWriteStream().write(strToByteArray(redactedPdfPageAsString))
		objectsContext.endPDFStream(stream)

		objectsContext.endIndirectObject()
	}

	modPdfWriter.end()

	hummus.recrypt(`${filePath}-modified`, filePath)
}

function findInText ({patterns, string}) {
	for (let pattern of patterns) {
		const match = new RegExp(pattern, 'g').exec(string)
		if (match) {
			if (match[1]) {
				return match[1]
			}
			else {
				return match[0]
			}
		}
	}

	return false
}

function strToByteArray (str) {
	let myBuffer = []
	let buffer = Buffer.from(str)
	for (let i = 0; i < buffer.length; i++) {
		myBuffer.push(buffer[i])
	}
	return myBuffer
}

@dongnthut19
Copy link

Please explain about: 'let toRedactString = findInText({patterns, string: pdfPageAsString})'. I don't understant that code.

@filmerjarred
Copy link

findInText is defined further down, it simply executes on an array of regexes
findInText({patterns: [/abc/], string: pdfPageAsString})
would try and find 'abc' somewhere in the pdf, after which it would redact it.

You might also be confused about the es6 feature that's being used? http://www.benmvp.com/learning-es6-enhanced-object-literals/#property-value-shorthand

@dongnthut19
Copy link

Thanks @BrighTide. I coppied your code and run it. but I cound not find text that I need. 'toRedactString = undefined'. please see my code:

function findInText(
patterns: any,
strRplace: string,
) {
for (const pattern of patterns) {
const match = new RegExp(pattern, 'g').exec(strRplace);
if (match) {
if (match[1]) {
return match[1];
}
return match[0];
}
}
}

function replaceText1(sourceFile: string, targetFile: string, patterns: any) {
const modPdfWriter = hummus.createWriterToModify(sourceFile, { modifiedFilePath: targetFile, compress: false });
const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount();

for (let page = 0; page < numPages; page += 1) {
const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
const objectsContext = modPdfWriter.getObjectsContext();

const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents');
const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();

let data: any = [];
const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
while (readStream.notEnded()) {
  const readData = readStream.read(10000);
  data = data.concat(readData);
}

const pdfPageAsString = Buffer.from(data).toString();
console.log('pdfPageAsString = ', pdfPageAsString);

const toRedactString = findInText(patterns, pdfPageAsString);

console.log('toRedactString = ', toRedactString);

let redactedPdfPageAsString: string = '';
if (toRedactString !== undefined) {
  redactedPdfPageAsString = pdfPageAsString.replace(new RegExp(toRedactString, 'g'), new Array(toRedactString.length).join('-'));
}

// Create what will become our new text object
objectsContext.startModifiedIndirectObject(textObjectID);

const stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(redactedPdfPageAsString));
objectsContext.endPDFStream(stream);

objectsContext.endIndirectObject();

}

modPdfWriter.end();

return;
}

replaceText1(sourcePDF, destinationPDF, [/amount/]);

@DNikolic-Paycor
Copy link

Hey!
I reealy need help with your function about part with matching the pattern.
I think that there is problem in:
function findInText ({patterns, string}) {}
when you pass more then one char in pattern because console.log('pdfPageAsString = ', pdfPageAsString) returns string that looks like
5.27877 0 Td (c)Tj 5.2798 0 Td (t)Tj 6.35974 0 Td (t)Tj 3.35985 0 Td (h)Tj 6 0 Td (e)Tj 8.27865 0 Td (g)Tj 5.87977 0 Td (r)Tj 3.95982 0 Td (a)Tj 5.27977 0 Td (p)Tj 6 0 Td
So how is possible that regEx like /abc/g find something in this string ?
Function works when you pass one char to regExp so could you help me with this?
Best regards!

@nithinkashyapn
Copy link

Hey,

Thanks for the snippet

But when running i am getting the following error

var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();                                                            

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

I tried downgrading the package but it's not downgrading as well.

@tinwinaung
Copy link

I also get the same error

Hey,

Thanks for the snippet

But when running i am getting the following error

var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();                                                            

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

I tried downgrading the package but it's not downgrading as well.

das-peter added a commit to das-peter/puppeteer-renderer that referenced this issue Mar 18, 2019
… or nice but seems to work for now.

We need to revisit this to re-integrate token support - maybe with this code example: galkahana/HummusJS#71 (comment)
das-peter added a commit to das-peter/puppeteer-renderer that referenced this issue Mar 18, 2019
… or nice but seems to work for now.

We need to revisit this to re-integrate token support - maybe with this code example: galkahana/HummusJS#71 (comment)
@bingomvm
Copy link

Hey!
I reealy need help with your function about part with matching the pattern.
I think that there is problem in:
function findInText ({patterns, string}) {}
when you pass more then one char in pattern because console.log('pdfPageAsString = ', pdfPageAsString) returns string that looks like
5.27877 0 Td (c)Tj 5.2798 0 Td (t)Tj 6.35974 0 Td (t)Tj 3.35985 0 Td (h)Tj 6 0 Td (e)Tj 8.27865 0 Td (g)Tj 5.87977 0 Td (r)Tj 3.95982 0 Td (a)Tj 5.27977 0 Td (p)Tj 6 0 Td
So how is possible that regEx like /abc/g find something in this string ?
Function works when you pass one char to regExp so could you help me with this?
Best regards!

@kicaUBUNTU . I have the same question. Do you solve this problem? can you tell how to solve it.

@venkatarajeshm
Copy link

Revisted the old code and this is what shook out in the end, this is working for us to this day


module.exports = function redactPDF ({filePath, patterns}) {
	const modPdfWriter = hummus.createWriterToModify(filePath, {modifiedFilePath: `${filePath}-modified`, compress: false})
	const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount()

	for (let page = 0; page < numPages; page++) {
		const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile()
		const objectsContext = modPdfWriter.getObjectsContext()

		const pageObject = copyingContext.getSourceDocumentParser().parsePage(page)
		const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents')
		const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID()

		let data = []
		const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream)
		while (readStream.notEnded()) {
			const readData = readStream.read(10000)
			data = data.concat(readData)
		}

		const pdfPageAsString = Buffer.from(data).toString()

		let toRedactString = findInText({patterns, string: pdfPageAsString})

		const redactedPdfPageAsString = pdfPageAsString.replace(new RegExp(toRedactString, 'g'), new Array(toRedactString.length).join('-'))

		// Create what will become our new text object
		objectsContext.startModifiedIndirectObject(textObjectID)

		const stream = objectsContext.startUnfilteredPDFStream()
		stream.getWriteStream().write(strToByteArray(redactedPdfPageAsString))
		objectsContext.endPDFStream(stream)

		objectsContext.endIndirectObject()
	}

	modPdfWriter.end()

	hummus.recrypt(`${filePath}-modified`, filePath)
}

function findInText ({patterns, string}) {
	for (let pattern of patterns) {
		const match = new RegExp(pattern, 'g').exec(string)
		if (match) {
			if (match[1]) {
				return match[1]
			}
			else {
				return match[0]
			}
		}
	}

	return false
}

function strToByteArray (str) {
	let myBuffer = []
	let buffer = Buffer.from(str)
	for (let i = 0; i < buffer.length; i++) {
		myBuffer.push(buffer[i])
	}
	return myBuffer
}

Hi, Thank you for the code. With the help of this snippet, I could extract the Text in TJ and replace it. However, text from all TJs in the output pdf disappeared. I guess, it is something to do with the font? How can I embed font to CopyingContext? Please help.
Best Regards.

@mohammedabualsoud
Copy link

@venkatarajeshm I got this error
TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

@Cali93
Copy link

Cali93 commented Oct 8, 2019

Hey!
I reealy need help with your function about part with matching the pattern.
I think that there is problem in:
function findInText ({patterns, string}) {}
when you pass more then one char in pattern because console.log('pdfPageAsString = ', pdfPageAsString) returns string that looks like
5.27877 0 Td (c)Tj 5.2798 0 Td (t)Tj 6.35974 0 Td (t)Tj 3.35985 0 Td (h)Tj 6 0 Td (e)Tj 8.27865 0 Td (g)Tj 5.87977 0 Td (r)Tj 3.95982 0 Td (a)Tj 5.27977 0 Td (p)Tj 6 0 Td
So how is possible that regEx like /abc/g find something in this string ?
Function works when you pass one char to regExp so could you help me with this?
Best regards!

@kicaUBUNTU . I have the same question. Do you solve this problem? can you tell how to solve it.

I'm also having the same problem and the issue is even before the findText, it is the because of the data bytes array coming from the readStream that are already formatted like that. But I have no clue how to solve that problem as I'm new to Hummus and PDF manipulation. Also I'm not sure but it might be because the text is vectorised as in some cases the text is formatted as a normal string.

But it might be related to what @galkahana said above:

How to parse the text
Content (text, graphics) is placed in content streams of pages. So you need to look into the pages content streams. (each page may have more than one).
Text is placed in content streams inside blocks marked by "BT" and "ET" commands.
In these blocks you should track for text placement commands like "Tj". Tj has a single parameter (as string that precedes it) that is an encoded string of the text. it is encoded per the font that's current. you need to track this font then. You can decode the text using the font encoding or unicode map. The pdf specs has an explanation on how acrobat decodes the text, so you can use it for the implementation.
You need to somehow figure out words out of the text. meaning, when spaces come in. hopefully you can break words by relying on Tj commands being separate per word...but i'm not sure about it.

@apic-apps
Copy link

Did anyone find a solution for this error?

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

@duchiep123
Copy link

@venkatarajeshm Hi venkatarajeshm, have you successfully replaced the text?
Currently I have identified the text in BT and ET

BT
/ F6 16 Tf 1 0 0 -1 0 0 Tm
230 -286 Td <0012> Tj
10.6718750 0 Td <0003> Tj
6.21875000 0 Td <0004> Tj
8.89062500 0 Td <0015> Tj
8.89062500 0 Td <0040> Tj
8.89062500 0 Td <0010> Tj
9,76562500 0 Td <000A> Tj
ET

Are <0012>, <0003> characters, right?
But I don't know how it's encoded, I just know it's encoding based on the current font in the file
I want to find the email in the pdf file so I have to locate the @ character. But each font has a different encoding,
in the above example <0040> is the @ character but I tested it on a different font it is not <0040>
so is there a way to help me find out what is the @ encoded character in a specific pdf file?

I really need it
Thank you so much

@creativebull
Copy link

I got the error which the others mentioned before. Did anyone find a solution for this error?

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

@Suraj0704
Copy link

@galkahana @filmerjarred
i want to modified pdf in such way:
first search the String in the pdf and then bold the string.
I have a project pls try to give any solution for this.....
Thanks in Advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests