# Sanity checks 

This notebook reports the following: 

- Items with duplicate pdf-files
- Items *without*  pdf-files
- Trash status: empty or not
- Standalone items: pdf-files or notes
- Report items with no DOI **and** no ISBN
- Report duplicate items based on: 
  - DOI/ISBN
  - Title

In [None]:
%run config.ipynb

# Retrieve data from server 

In [None]:
%%time
zot, lib_items = retrieve_data()

# Calculation

Generate a list for items with duplicate pdf files
<a id='fetch-duplicate-pdf'></a>

This cell makes many calls to the server (retrieving per item children), therefore it might be a bit slow!

So, patience ...

In [None]:
%%time
log.info("Resolving items with duplicate pdf attachments ...")
items_duplicate_attach, pdf_attachments = get_items_with_duplicate_pdf(zot, lib_items)

In [None]:
num_duplicates = len(items_duplicate_attach)

if num_duplicates:
    log.warning(f"Got: {num_duplicates} items:")
    STATUS_OK = False
    for item in items_duplicate_attach:
        log_title(item)

else:
    log.info(f"Found: {num_duplicates} items:")
    
log.info("Done!")

# Report items with multiple attachments

Multiple attachments are ok.  
We are looking for duplicate pdf files.   


In [None]:
num_duplicates = len(items_duplicate_attach)

if num_duplicates:
    log.warning(f"Found {num_duplicates} items with duplicate pdf files:")
    STATUS_OK = False
    for item in items_duplicate_attach:
        log_title(item)

else:
    log.info(f"no items with duplicate pdf files found")
    
log.info("Done!")

# Report items without pdfs 

This cell makes many calls to the server (retrieving per item children), therefore it might be a bit slow!

So, patience ...

In [None]:
%%time
log.info("Retrieve items without pdf file ...")
items_without_pdf = get_items_with_no_pdf_attachments(zot, lib_items)
if items_without_pdf:
    log.warning(f"Found {len(items_without_pdf)} items\n")
    STATUS_OK = False
else:
    log.warning(f"Found 0 items\n")
    
for item in items_without_pdf:
    log_title(item)

# Check standalone items

- pdf files
- notes

In [None]:
log.info("Check standalone items ...")
standalone_items = get_standalone_items(lib_items)    
if standalone_items:
    log.warning(f"Found {len(standalone_items)} items.\n")
    STATUS_OK = False
else:
    log.info(f"Found {len(standalone_items)}.\n")  

for standalone_item in standalone_items:
    log_item(standalone_item)
    
if STATUS_OK:
    log.info(f"Library is OK!")    

## Report items with duplicate pdf files

### Based on DOI/ISBN 

- Duplicates without DOI not ISBN numbers are going to be ignored! 
- Duplicates with different DOI or ISBN will be missed as well! (e.g. ISBN=0968-090X and ISBN=0968090X)


In [None]:
%%time

log.info("Resolving duplicates based on DOI/ISBN...")
# sort items by DOI
by_doi = get_items_by_doi_or_isbn(lib_items)        
delete_items = []
update_items = []
for doi, items in by_doi.items():
    if len(items) == 1:
        continue

    log.info(f"doi/isbn: {doi} | number = {len(items)}")    
    # sort by age. oldest first
    items.sort(key=date_added)
    # keep oldest item
    keep = items[0]
    # keep latest attachments
    keep_cs = zot.children(keep["key"])
    duplicates_have_pdf = False
    for item in items[-1:0:-1]:
        cs = zot.children(item["key"])
        if cs:
            for c in cs:
                c["data"]["parentItem"] = keep["key"]
                if attachment_is_pdf(c):
                    duplicates_have_pdf = True
                
            update_items.extend(cs)
            if duplicates_have_pdf:
                delete_items.extend(keep_cs)

            break  # cause, only the newest attachements are added

    delete_items.extend(items[1:])

if delete_items:
    log.warning(f"{len(delete_items)} duplicate items")
    STATUS_OK = False
else:
    log.info("no duplicates found")
    
for d in delete_items:
    log_title(d)
    

In [None]:
%%time
log.info(f"Retrieving items with no DOI & ISBN")
# for field in ["DOI", "ISBN"]:
#     log.info("Resolving items with no {field} ...")  
#     empty_list = get_items_with_empty_doi_isbn(lib_items, field)
#     if empty_list:
#         log.warning(f"found {len(empty_list)} items with no {field}")
#         for d in empty_list:
#             log.info(f"Title: {d}")
#     else:
#         log.info("all items have a {field}")
        
# now check for items having none of them (doi and isbn)
fields = ["DOI", "ISBN"]
no_doi_and_no_isbn  = get_items_with_empty_doi_and_isbn(lib_items, fields)

if no_doi_and_no_isbn:
    log.warning(f"found {len(no_doi_and_no_isbn)} items with no {fields}")
    STATUS_OK = False
    for no in no_doi_and_no_isbn:
        log.info(f"Title: {no}")


### Based on Title

In [None]:
log.info("Resolving missed duplicates based on title...")
duplicate_items_by_title = defaultdict(list)

for item in lib_items:
    if is_standalone(item):
        continue 
        
    key = item["data"]["key"]
    iType = item["data"]["itemType"]
    Title = item["data"]["title"]
    duplicate_items_by_title[iType].append(Title.capitalize())

for Type in duplicate_items_by_title.keys():
    num_duplicates_items = len(duplicate_items_by_title[Type]) - len(
        set(duplicate_items_by_title[Type])
    )
    if num_duplicates_items:
        STATUS_OK = False
        log.warning(f"{num_duplicates_items} duplicate items of type <{Type}>")
        duplicates = set([x for x in duplicate_items_by_title[Type] if duplicate_items_by_title[Type].count(x) > 1])
        for d in duplicates:
            log.info(f"Title: {d}")
    else:
        log.info(f"No duplicates type <{Type}> found")

# Trash status

- Check if Trash is empty
- Standalone items

In [None]:
if len(zot.trash()) > 0:
    log.warning("Trash is not empty. Consider emptying it!")
    STATUS_OK = False
else:
    log.info("\n----\nTrash is empty!")
    
if STATUS_OK:
    log.info("STATUS: OK")
else:
    log.warning("STATUS: NOT OK")