Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with hyphen #10

Open
phisu opened this issue Aug 8, 2016 · 8 comments
Open

problem with hyphen #10

phisu opened this issue Aug 8, 2016 · 8 comments

Comments

@phisu
Copy link

phisu commented Aug 8, 2016

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a line, i guess, we are mostly not interested in them. the quality of the extracted text is maybe better, if they are eliminated. this could be done by a extra cleanup of the output of your class or by your class itself. what do you think about that?

philipp

@christian-vigh-phpclasses
Copy link
Owner

Hello Philip,

Well, to tell the truth, the initial version of my class did suppress
hyphens ; I noticed that when running it with the Microsoft RTF
Specifications, converted to a PDF file.

I finally suppressed it because during the following weeks, I did not have
any new sample showing such samples, and I was afraid of side-effects.

However, now it seems that it makes sense to put it back. I think I will add
a PDFOPT_UNHYPHENATE option in the constructor, so that the output text will
be post-processed to remove hyphens.

I will come back to you when the new version will be available.

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?

philipp


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

@christian-vigh-phpclasses
Copy link
Owner

Ooops I completely forgot : do you have a sample to give to me ? or
recommend me on sample you already sent to me ?


De : phisu [mailto:notifications@github.com]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?

philipp


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

@christian-vigh-phpclasses
Copy link
Owner

Hello Philipp,

I�m glad to tell you that the PdfToText V1.2.36 class is now able to
�un-hyphenate� words. Simply specify the PDFOPT_NO_HYPHENATED_WORDS for the
$options parameter of the constructor or of the Load() method.

I�ve noticed one unwanted side-effect in your sample
�150701-DSE-Katalog-verlinkt.pdf� : the output text

        à-la-carte-

        Speisen

Is displayed as :

        à-la-carteSpeisen

Maybe it will be better once I�ll have implemented a more robust management
of x/y coordinates, but don�t expect miracles !

However, the rest of the text contents, which contains many hyphenated
words, seems to look fine.

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?

philipp


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

@phisu
Copy link
Author

phisu commented Aug 9, 2016

hello christian,
i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:

https://www.digitales.oesterreich.gv.at/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

the output starts with:

61111113111399111111111111391121111911111311137111111146111911113 43
6111111311139911111111111139112111191111131113711111119111911113
7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443
61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139
11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111
3■3
1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111
911119143911911911119114433■3

with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.

philipp

@phisu
Copy link
Author

phisu commented Aug 9, 2016

hello christian,
i think the elimination of hyphens is not so important than the a akurat output of white-spaces and line-breaks.

philipp

@christian-vigh-phpclasses
Copy link
Owner

Hello Philipp,

It�s too late ! I implemented this feature in the early versions of my class
then removed it because I feared side effects.

I added it again : it was nothing and took me an hour to complete. Sometimes
I need to work on easy things�

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : mardi 9 août 2016 10:03
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello christian,
i think the elimination of hyphens is not so important than the a akurat
output of white-spaces and line-breaks.

philipp


You are receiving this because you commented.
Reply to this email directly, view
<#10 (comment)
nt-238482045> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ald1FHdilDBR8ng1zo1sB
jg1x53aks5qeDQkgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8an50_UmgICjHCziu41nSiW1hlF8uks5
qeDQkgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

@christian-vigh-phpclasses
Copy link
Owner

Hello Philipp,

I solved this problem late this night before you performed your testings.

It was due to my complete reworking of how I’m handling Unicode to UTF8 translations. One internal function, which was accepting a character s a parameter, now accepts an integer value. I just missed 2 calls in my code which were still supplying a character value as a parameter.

The latest version, 1.2.38, solved that (I tried it on the sample you sent to me).

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : mardi 9 août 2016 08:31
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello christian,
i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:

https://www.digitales.oesterreich.gv.at/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

the output starts with:

61111113111399111111111111391121111911111311137111111146111911113 43
6111111311139911111111111139112111191111131113711111119111911113
7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443
61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139
11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111
3■3
1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111
911119143911911911119114433■3

with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.

philipp


You are receiving this because you commented.
Reply to this email directly, view #10 (comment) it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8amMHYNivYV0uj2tuAzVlmMyOzt8Lks5qeB6sgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8al6dVdqX7GHE84_XoIDm6wKJ4BnOks5qeB6sgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

@shravspy
Copy link

I want hyphens in my pdf. Is there an option not to remove it with layout, because as of now it removes all the hyphens from my table in pdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants