GitHub - finalspy/Exercice

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
bin		bin
src		src
.gitignore		.gitignore
README		README
pom.xml		pom.xml
Repository files navigation

Introduction
============
This code is an implementation to solve the problem detailed below.


How to build
============
This project uses a simple maven build.
Prerequisite : have maven installed and accessible from command prompt.
from the project root folder launch "mvn clean install"


How to use
==========
Run the following command against the generated jar :
java -jar Exercice-0.1.1-SNAPSHOT-jar-with-dependencies.jar [firstFile.xml] [secondFile.xml]
(or from project sources , cd to bin directory and launch ./exercice.sh )



Problem to solve
================

The problem to solve is the synchronization of two fileystem 
directories. Supposing the original structure were:
 
<tree name="root">
<tree name="home">
<tree name="joe">
<file name="a.txt" modif-date="20120102T1030" size="5032"/>
</tree>
</tree>
</tree>
 
(which comes down to having one file /home/joe/a.txt). Now, if the 
modified structure were something like:
 
<tree name="root">
<tree name="home">
<tree name="joe">
<file name="a.txt" modif-date="20120102T1030" size="5032"/>
<file name="b.txt" modif-date="20120102T1030" size="1234"/>
</tree>
</tree>
</tree>
 
the diff between the two would be the new file /home/joe/b.txt.

The action to be take is to create the /home/joe/b.txt using a 
possibly mock implementation of the following Java interface :
https://github.com/dtrott/fuse4j/blob/master/maven/fuse4j-core/src/main/java/fuse/Filesystem3.java.
 
The goal of the exercise is to write a program in the Java language that 
is able to detect diffs between such XML descriptions for a number of 
common cases:
* file update
* new file
* new directory
* file rename
* ...
 
The provided tests will be automatically reproductible, prove the 
correct working of the code and will be illustrative of the usage of the 
software. The provided package should respect the standard Java layout 
and have a command-line automatic build which can build the sources and 
run tests.


Technical Specifications
========================
  
  Program inputs :
  ----------------
  2 distinct xml files representing a filesystem structure with tree tags (for directories) and file tags (for files). 
  
  Program output :
  ----------------
  A data structure identifying differences between the two filesystem structures provided as xml entry point files.
  
  Development process :
  ---------------------
  First we'll be writing this document and also add some commentary in empty code structures as //TODO items. 
  Then for each point we'll write the appropriate unit test for Junit.
  And after we'll implement the method that satisfies the preceding tests to have them all executed with success.
  At this point some code refactoring could be done on existing code and test should still execute with success. 
  The global development process will be iterative.
  Even once the final goal is achieved, we'll have to review the work we've done to determine if there are some way to improve it or add missing functionnalities.
  
  Problematic
  -----------
  
  At first sight the problematic is quite simple. 
  We need to detect added or deleted tags, and those for which the attributes are different.
  
  * Possible Solution :
      - load files as Dom Documents to manipulate them easily.
      - list all elements that should be tested.
      - for each element :
          _ test if present in the other document, if missing then file/tree is only present on the analyzed document (ie added) => took action to save this difference for the final report. 
          _ test if present but with different attributes (ie modified),  => took action to save this difference for the final report. 
          _ if present delete the Element from the other document if also present whether or not it has different attributes.
          _ list all remaining elements in the other document. Due to the receeding step that means thos files aren't present in the first document (ie deleted from it) => took action to save this difference for the final report. 
      - parse the result object to summarize all the detected changes.
  This solution raises one question : how to identify elements selected on one document to search for it on another.
  We propose to store each references in a map. The key would be a concatenation of the Element absolute path constructed using a path separator, ancestors @name attribute, and the element name, and the value the Element object it self.
  eg: <root><tree name="home"><tree name="joe"><file name="a.txt" date... /></tree></tree></root> would add the entry : ["/home/joe/a.txt",Element] 
  This method will allow us to store elements in a map as the filesystem won't allow many files/folders with the same path.
  If not it will drove us into problems because Map structure doesn't allow duplicate key.
  Then we'll have to iterate over that map for the core treatment (ie differences detection).	
  (Note: all classes are made public static for easy testing).
  
  Tracking of renaming : 
  The first version used path and file name to identify files.
  A better way to process is to have the file hash value in the xml description, thus are able to track simple renaming and make a distinction between them and add/delete of two separate files.
  What we've done, was to add a last step to compare deleted/added file list by comparing hash of files content in order to detect a same content with a different path/name.
  Prior to this another step was added to Map all deleted/added results by hash.
  This process is used in famous distributed source control management like GIT.
  The idea behind this use of hash values is that content is more important than file itself.
  For folders we should also be able calculate a hash by using the folder contents hashes sorted then concatenated and hashed again... 
  But this doesn't solve the case of simultaneous renaming and content change which is too close from deletion and creation of a new file.
  To allow that the only reliable solution seems to keep track of renaming at saving time and pass this meta information at synchronization with remote file system.
  
  
    
  Next steps
  ----------
  Benchmark the code with huge amount of data and complex structures.
  Add hash on folders to follow renaming.
  Find some ways to improve algorithm :
  If an entire folder has been deleted/added we could improve the algorythm to only take care of the top level element involved in the deletion/addition instead of taking each sub elements one by one.
  Find some ways to parallelize treatment.
  JDom is a DOM API that won't scale well on large documents because it loads the whole document and consumes memory.
  Alternatives are Sax parser API like Stax, JAXB ...
  If XML chunks can be managed to keep small size maybe parallelized treatment of small chunks would be preferable.
  
  
Todo
====
// TODO check program usage
// TODO use fuse4j-hadoopfs on results from diff to apply some kind of patch.
Improve packaging and distribution of software (zip with /bin/, /README, /lib/Exercice.jar ...)
Add mocking framework for better unit testing.