Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Html diff render #78

Merged

Conversation

janakrajchadha
Copy link
Contributor

@janakrajchadha janakrajchadha commented Aug 10, 2017

WIP

I've tried and tested this differ. It works well on the few test cases I've tried it on.
There are better ways to implement some parts of the differ but most of the ones I've found include the reconstruction of the HTML tree which leads to loss of attribute data from tags like <head> and <body>.
Some changes in the source code of lxml would also make this job easier but there's probably a good reason they haven't done that.

Please review @danielballan :)

@janakrajchadha
Copy link
Contributor Author

Fixes #69

@Mr0grog
Copy link
Member

Mr0grog commented Aug 20, 2017

@janakrajchadha What’s the status on this? Just curious since it sounded like we came to an agreement over a week ago that we should not just remove ins and del elements and add our own, but there have been no updates to this PR since before then. The tests here still look for the addition of simple ins and del elements, too.

@janakrajchadha
Copy link
Contributor Author

@Mr0grog I went deeper into lxml's code to understand where we could alter it to add our own tags. Turns out that there's no parameter which allows us to pass our own tags, we'll have to alter the source code in a few places.
Also found out that lxml uses difflib in one of the layers for this diff. I had shared the possible paths to take with @danielballan, waiting for his views on it.
Understanding and reconstructing this based on our needs may take a little longer than earlier expected.

@janakrajchadha janakrajchadha changed the title Html diff render WIP:Html diff render Aug 21, 2017
@janakrajchadha janakrajchadha changed the title WIP:Html diff render WIP: Html diff render Aug 21, 2017
@Mr0grog
Copy link
Member

Mr0grog commented Sep 6, 2017

@danielballan @janakrajchadha does it make sense to merge and deploy this in the mean time, improving the implementation not to cause problems on pages with <ins>/<del> tags and the like later?

@Mr0grog
Copy link
Member

Mr0grog commented Sep 6, 2017

Update: decided in weekly meeting to merge this and iterate later, pending a deeper review of the existing implementation here by @danielballan.

@janakrajchadha
Copy link
Contributor Author

janakrajchadha commented Sep 9, 2017

@Mr0grog @danielballan I was planning to create my own version of htmldiff based on lxml's source code. As discussed earlier, lxml uses InsensitiveSequenceMatcher which is inherited from SequenceMatcher class of difflib. The problem we faced with lxml's implementation is that it does not allow custom tags to be added instead of simple <ins>/<del> tags.
Anastasia, who has joined us on Slack, has worked on similar tools and I was lucky enough to come across a solution for our problem while going through her diff repositories.
You can take a look at her implementation of htmldiff.
Also, lxml's htmldiff_tokens .
I'll test it out and give you an update soon.
Let me know if you have any suggestions.

@danielballan
Copy link
Contributor

Sounds good. Shall we leave this PR in limbo in the meantime?

@Mr0grog
Copy link
Member

Mr0grog commented Sep 11, 2017

Sorry I have been slow to respond. Unless someone has a PR they are ready to submit, I would not like to leave this in limbo. I would like to have something for our most critical diff, even if we completely replace it in the near future.

@danielballan
Copy link
Contributor

That is a very fair point. Post v0 a lot of our diffing stuff might get superceded by collaborative projects, but this will suffice for v0. I'll add some additional tests. @janakrajchadha, carry on with your collaborative work with Anastasia.

As requested by @danielballan, I've added more tests for html diff
render.
@danielballan, take a look and let me know if these are along the lines
of what you had in mind.
@janakrajchadha
Copy link
Contributor Author

@danielballan I'm not sure why I'm getting the FileNotFound error. I'm looking for reasons.
Would you rather have all the test cases defined in the code itself?

@danielballan
Copy link
Contributor

Good test cases. As you've discovered, bundling static files into a Python package properly is not straightforward. We'll probably find a good reason to do that later (and I'll show you how) but in this case I think it's easier and better to just define the test cases in the code itself.

@janakrajchadha
Copy link
Contributor Author

@danielballan Fixed, should this be merged now?

@danielballan danielballan merged commit 303239f into edgi-govdata-archiving:master Sep 14, 2017
@danielballan
Copy link
Contributor

Great, thanks @janakrajchadha.

@janakrajchadha janakrajchadha deleted the html-diff-render branch September 14, 2017 14:46
Mr0grog added a commit that referenced this pull request Sep 16, 2017
This should have been done in #78, but was apparently missed.
Mr0grog added a commit that referenced this pull request Sep 17, 2017
This should have been done in #78, but was apparently missed.
@janakrajchadha janakrajchadha mentioned this pull request Oct 9, 2017
janakrajchadha pushed a commit to janakrajchadha/web-monitoring-processing that referenced this pull request Oct 13, 2017
This should have been done in edgi-govdata-archiving#78, but was apparently missed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants