Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to do word by word comparison? #21

Closed
GoogleCodeExporter opened this issue Mar 27, 2015 · 8 comments
Closed

Is it possible to do word by word comparison? #21

GoogleCodeExporter opened this issue Mar 27, 2015 · 8 comments

Comments

@GoogleCodeExporter
Copy link

This is a great piece of code. I have one concern on the compariosn. Is it 
possible to chnage the logic in such a way that it does word by word 
comparison? 

I have written following test code in java
========================================================================
String text1="Hello how are you. <br/> My name is Prathyusha. Shravanthi 
is my friend.";
String text2="Hello. <br/>My name is Shravanthi. Prathyusha was my 
friend.";
diff_match_patch diff = new diff_match_patch(); 
LinkedList<diff_match_patch.Diff> diffs = diff.diff_main(text1, text2);
diff.diff_cleanupSemantic(diffs);
String result = diff.diff_prettyHtml(diffs);
System.out.println(result);
=====================================================================

I see the following output
--------------------------------------------------------
Hello<DEL> how are you</DEL>. <br/><DEL> </DEL>My name is <DEL>Prathyusha. 
Shravanthi i</DEL><INS>Shravanthi. Prathyusha wa</INS>s my friend.
--------------------------------------------------------

As the output shows the logic doesn't break the differences into logical 
words. Rather it does comparison on a chunk of string. Word by word 
comparison would help in getting a precise count of the newly added words 
and the deleted words. Additinally if we check the deleted 
<DEL>Prathyusha. Shravanthi i</DEL> text and the inserted <INS>Shravanthi. 
Prathyusha wa</INS> text, the middle chars ' ' and '.' are common. They 
should not be considered as deleted and inserted. This problem wouldn't 
have arised if we do a word by word comparison. The word count in all is 
increased by 4 conidering ' ' and '.' as 2 words.
Is it possible to do a word by word comparison?

Regards,
Pratap

Original issue reported on code.google.com by pratapma...@yahoo.com on 4 Jul 2009 at 1:51

@GoogleCodeExporter
Copy link
Author

Word-by-word diffing is not a native function of this library.  However, it is 
easy 
to do.  You need to break your text into words (how you define a word is a more 
interesting problem than you might think), create a lookup table of Unicode 
characters to words, build two strings made up of the Unicode characters 
associated 
with each word, Diff those two strings, then convert the diff back into the 
text.

Sounds complicated, but it's not -- because the code has already been written 
for 
you.  Just look at the diff_linesToChars and diff_charsToLines functions.  Copy 
them 
and make them split on words instead of characters.  Then your code will just 
be:

  Object b[] = diff_wordsToChars(text1, text2);
  String wordText1 = (String) b[0];
  String wordText2 = (String) b[1];
  wordarray = (ArrayList<String>) b[2];
  LinkedList<Diff> diffs = diff_main(wordText1, wordText2, false);
  diff_charsToWords(diffs, wordarray);

Have fun defining what a "word" is.  Been there, done that on another project.  
:)

Original comment by neil.fra...@gmail.com on 4 Jul 2009 at 2:24

  • Changed state: WontFix
  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect

@GoogleCodeExporter
Copy link
Author

Thanks for your input Neil. I will try that.

Original comment by pratapma...@yahoo.com on 6 Jul 2009 at 5:24

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

I need this enhancement too:)

Original comment by chunhaic...@gmail.com on 17 Apr 2010 at 12:31

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

I implemented the word-by-word (yes, is was easy), and it does a pretty good 
job just
tokenizing spaces and newlines.
Something like:
  wordEndSpace = text.indexOf(' ', wordStart);
  wordEndNewline = text.indexOf('\n', wordStart);
  wordEnd = Math.min(wordEndSpace, wordEndNewline);    
That will do the trick effectively. You could of course do an nicer array 
version, if
you have more matches (punctuation etc). Or perhaps regexp as well.
I guess the reason the simple version works well for me, is that the text is
preprocessed (from HTML) is a rather cool way, so whitespace in the text 
matches HTML
rendering quite close.

Thanks for a great little piece of code, Niel.

Regards,
Mads Buus Westmark

Original comment by madsbuus...@gmail.com on 20 Apr 2010 at 11:45

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

I need this feature also.

Original comment by g33.ad...@gmail.com on 25 Jun 2011 at 11:26

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

[deleted comment]

1 similar comment
@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

hi,
diff_linesToChars  functions having LinesToCharsResult as a return type. 
Is there any changes required for diff_wordsToChars() ?

 Object b[] = diff_wordsToChars(text1, text2);
  String wordText1 = (String) b[0];
  String wordText2 = (String) b[1];
  wordarray = (ArrayList<String>) b[2];
  LinkedList<Diff> diffs = diff_main(wordText1, wordText2, false);
  diff_charsToWords(diffs, wordarray);

Using the diff-match-path class,we can get the character comparison not a word 
comparison. what are all the changes required for the Word comparison?

Thanks for advance

Original comment by monigov...@gmail.com on 17 Nov 2011 at 7:06

  • Added labels: ****
  • Removed labels: ****

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant