Skip to content
Laryn edited this page Mar 28, 2023 · 2 revisions

What does this module do?

In a multi step form (like update.php) this module executes the following steps:

1) Search for duplicates

  • Searches are done on the public and/or private file system
  • It searches for file names ('filename' as used by pathinfo (2)) that end with _{n} and whose part before that also exists as separate file name.
  • It compares file size and md5 hash (see md5_file (3)) to determine if these are real duplicates or possible duplicates.
  • The results are presented as a list whereby for the possible duplicates clickable thumbnails are shown so you can visually compare them. For documents just a clickable icon is shown.

2) Search for usages

  • Look if a managed file record is defined for the duplicate (and the original).
  • Searches for references to the managed file in user pictures, all image and file fields, all fields that according to their field schema have a foreign key to the file_managed table.
  • Searches for URI references to the file or a(n image style) derivative in selected text and link fields.

3) Update usages

  • Found references to the managed file record are updated to refer to the managed file record of the original (or if that does not yet exist, the uri field is simply updated).
  • Found textual usages are changed to refer to the URI of the original document.
  • Note 1: this phase uses the entity_save() function of the entity api (contrib and thus a dependency) to ensure that caches are cleared, hooks are called, file_usage is updated, rules are executed, etc.
  • Note 2: this phase does keep track of failed updates so that the next phase can skip managed file records or files that are still being referred to.

4) Delete duplicates

  • All managed file records that are no longer referred to are deleted.
  • All duplicate files that ar no longer referred to are deleted.
Clone this wiki locally