# Getting Started with difPy

This notebook contains code samples on how to use the [difPy](https://github.com/elisemercury/Duplicate-Image-Finder) Python package for finding duplicate/similar images.

In [1]:
import difPy
difPy.__version__

'4.0.0-beta11'

In the following examples we will use a directory with the following folder structure:

&emsp;.<br>
&emsp;|- **Folder1**<br>
&emsp;|&emsp;|- image1.jpg<br>
&emsp;|&emsp;|- ...<br>
&emsp;|&emsp;|- imageN.jpg<br>
&emsp;|- **Folder2**<br>
&emsp;|&emsp;|- image1.jpg<br>
&emsp;|&emsp;|- ...<br>
&emsp;|&emsp;|- imageN.jpg<br>
&emsp;|- image1.jpg<br>
&emsp;|- ...<br>
&emsp;|- imageN.jpg<br>

It contains 2 subdirectories `Folder1` and `Folder2` and a few images (in our example we have a total of 22 images that include 7 pairs of duplicates).

## I. Basic Single Folder Search

Firstly, we need to build a `dif` object which will contain the repository of images and image tensors.

In [3]:
dif = difPy.build("C:/Pictures/difPy Test")

difPy preparing files: [100%]


After having build the repository, we can invoke difPy's search feature to let it search for duplicates among the images.

In [4]:
search = difPy.search(dif)

difPy searching files: [100%]


Output the result of the search process:

In [None]:
search.result

In [None]:
# > Output

{152101248435483263223868727742281888434: 
    {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (1).png',
     'matches': {218998084323469732336137050027392537538: 
                    {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (2).png',
                     'mse': 0.0},
                 149063264974240208220478733702797119308: 
                    {'location': 'C:\\Pictures\\difPy Test\\Main_Folder_files (1).png',
                     'mse': 0.0}}},
 22636187817081990587967668041826814190: 
    {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (3).png',
     'matches': {174403622781798519274435373860283880785: 
                    {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (4).png',
                     'mse': 0.0},
                 338410529253241118027323176907943126769: 
                    {'location': 'C:\\Pictures\\difPy Test\\Main_Folder_files (2).png',
                     'mse': 0.0}}},
 76665284592033273964231284549474486828: 
    {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (1).jpg',
     'matches': {204994251620760622645353609131380975785: 
                    {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (4).jpg',
                     'mse': 0.0}}},
90973933985186530600072841488145022274: 
    {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (2).jpg',
     'matches': {320625470485905842454128517620228860689: 
                    {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (3).jpg',
                     'mse': 0.0}}},
 155282432310800845768869493841794469647: 
    {'location': 'C:\\Pictures\\difPy Test\\Main_Folder_files (5).png',
     'matches': {76376570623726054665503729074941265231: 
                    {'location': 'C:\\Pictures\\difPy Test\\Main_Folder_files (6).png',
                     'mse': 0.0}}}}

Analyze statistics around the difPy search result:

In [None]:
search.stats

In [None]:
# > Output

{'directory': ['C:/Pictures/difPy Test'],
 'process': {'build':
                {'duration': {'start': '2023-08-29T23:31:20.210531',
                              'end': '2023-08-29T23:31:21.377826',
                              'seconds_elapsed': 1.1673},
                 'parameters': {'recursive': True,
                                'in_folder': False,
                                'limit_extensions': True,
                                'px_size': 50}},
             'search': 
                {'duration': {'start': '2023-08-29T23:31:25.408573',
                              'end': '2023-08-29T23:31:26.435525',
                              'seconds_elapsed': 1.027},
                 'parameters': {'similarity_mse': 0},
                 'files_searched': 17,
                 'matches_found': {'duplicates': 7, 
                                   'similar': 0}}},
 'invalid_files': {'count': 5,
                   'logs': {'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (1).avif': 'ImageFilterWarning: invalid image extension.',
                            'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (2).avif': 'ImageFilterWarning: invalid image extension.',
                            'C:\\Pictures\\difPy Test\\Main_Folder_files (1).avif': 'ImageFilterWarning: invalid image extension.',
                            'C:\\Pictures\\difPy Test\\Main_Folder_files (1).lnk': 'ImageFilterWarning: invalid image extension.',
                            'C:\\Pictures\\difPy Test\\Main_Folder_files (2).avif': 'ImageFilterWarning: invalid image extension.'}}}

## II. Basic Multi Folder Search

difPy supports searching among multiple input folders:

In [20]:
dif = difPy.build("C:/Users/elise/Pictures/difPy Test/Folder1", 
                  "C:/Users/elise/Pictures/difPy Test/Folder2")

search = difPy.search(dif)

difPy preparing files: [100%]
difPy searching files: [100%]


## III. In-Folder Search

By default, difPy will search for matches in the union of all directories specified in the `directory` parameter. In this example, we want to make difPy only search for matches within each folder separately. 

We can do this by setting `in_folder` to `True`.

In [10]:
dif = difPy.build("C:/Pictures/difPy Test", in_folder = True)

difPy preparing files: [100%]


Run the search:

In [12]:
search = difPy.search(dif)

difPy searching files: [100%]


In [None]:
search.result

In [None]:
# > Output

{'group_0': {'location': 'C:\\Pictures\\difPy Test',
             'contents': {98709272196713849344848018014435789097: 
                            {'location': 'C:\\Pictures\\difPy Test\\Main_Folder_files (5).png',
                             'matches': {71478117174850943214577159598373355637: 
                                            {'location': 'C:\\Pictures\\difPy Test\\Main_Folder_files (6).png',
                                             'mse': 0.0}}}}},
 'group_1': {'location': 'C:\\Pictures\\difPy Test\\Folder1',
             'contents': {239884997223852077903937741023120511368: 
                            {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (1).png',
                             'matches': {255164768522949980820784701796222517077: 
                                            {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (2).png',
                                             'mse': 0.0}}},
                          328291229701187883127704154870577749509: 
                            {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (3).png',
                             'matches': {322539927784269412928170787873655878487: 
                                            {'location': 'C:\\Pictures\\difPy Test\\Folder1\\Folder1_Files (4).png',
                                             'mse': 0.0}}}}},
 'group_2': {'location': 'C:\\Pictures\\difPy Test\\Folder2',
             'contents': {159772767701427368585036770862084098312: 
                            {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (1).jpg',
                             'matches': {62855373167845162697259548417873655065: 
                                            {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (4).jpg',
                                             'mse': 0.0}}},
                          40115952529100222280989417554751604167: 
                            {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (2).jpg',
                             'matches': {251561559539487576337507392003670007376: 
                                            {'location': 'C:\\Pictures\\difPy Test\\Folder2\\Folder2_Files (3).jpg',
                                             'mse': 0.0}}}}}}

We can see above that the results and matches have now been formatted by groups i. e. by folders.