# Sanity checks 

This notebook reports the following: 

- Items with duplicate pdf-files
- Items *without*  pdf-files
- Trash status: empty or not
- Standalone items: pdf-files or notes
- Report items with no DOI **and** no ISBN
- Report duplicate items based on: 
  - DOI/ISBN
  - Title

In [1]:
%run config.ipynb

# Retrieve data from server 

In [2]:
%%time
zot, lib_items = retrieve_data()

INFO - 2022-03-02 15:20:02 - Retrieving Library...
INFO - 2022-03-02 15:21:02 - Got 1149 items
INFO - 2022-03-02 15:21:02 - Done at 15:21:2
CPU times: user 251 ms, sys: 38.7 ms, total: 290 ms
Wall time: 59.6 s


# Calculation

Generate a list for items with duplicate pdf files
<a id='fetch-duplicate-pdf'></a>

This cell makes many calls to the server (retrieving per item children), therefore it might be a bit slow!

So, patience ...

In [3]:
%%time
log.info("Resolving items with duplicate pdf attachments ...")
items_duplicate_attach, pdf_attachments = get_items_with_duplicate_pdf(zot, lib_items)

INFO - 2022-03-02 15:21:02 - Resolving items with duplicate pdf attachments ...
CPU times: user 15.5 s, sys: 1.72 s, total: 17.2 s
Wall time: 14min 29s


In [4]:
num_duplicates = len(items_duplicate_attach)

if num_duplicates:
    log.warning(f"Got: {num_duplicates} items:")
    STATUS_OK = False
    for item in items_duplicate_attach:
        log_title(item)

else:
    log.info(f"Found: {num_duplicates} items:")
    
log.info("Done!")

INFO - 2022-03-02 15:35:32 - Title: Balance Recovery Prediction with Multiple Strategies for Standing Humans
INFO - 2022-03-02 15:35:32 - Title: Modeling the desired direction in a force-based model for pedestrian dynamics
INFO - 2022-03-02 15:35:32 - Title: Dynamics of social groups’ decision-making in evacuations
INFO - 2022-03-02 15:35:32 - Title: Universal flow-density relation of single-file bicycle, pedestrian and car motion
INFO - 2022-03-02 15:35:32 - Title: Universalities in fundamental diagrams of cars, bicycles and pedestrians
INFO - 2022-03-02 15:35:32 - Title: Dynamic Data–driven simulation of pedestrian movement with automatic validation
INFO - 2022-03-02 15:35:32 - Title: BRVO: Predicting pedestrian trajectories using velocity-space reasoning
INFO - 2022-03-02 15:35:32 - Title: Walk this way: Improving pedestrian agent-based models through scene activity analysis
INFO - 2022-03-02 15:35:32 - Title: Parameter estimation of social forces in crowd dynamics models via a prob

# Report items with multiple attachments

Multiple attachments are ok.  
We are looking for duplicate pdf files.   


In [5]:
num_duplicates = len(items_duplicate_attach)

if num_duplicates:
    log.warning(f"Found {num_duplicates} items with duplicate pdf files:")
    STATUS_OK = False
    for item in items_duplicate_attach:
        log_title(item)

else:
    log.info(f"no items with duplicate pdf files found")
    
log.info("Done!")

INFO - 2022-03-02 15:35:32 - Title: Balance Recovery Prediction with Multiple Strategies for Standing Humans
INFO - 2022-03-02 15:35:32 - Title: Modeling the desired direction in a force-based model for pedestrian dynamics
INFO - 2022-03-02 15:35:32 - Title: Dynamics of social groups’ decision-making in evacuations
INFO - 2022-03-02 15:35:32 - Title: Universal flow-density relation of single-file bicycle, pedestrian and car motion
INFO - 2022-03-02 15:35:32 - Title: Universalities in fundamental diagrams of cars, bicycles and pedestrians
INFO - 2022-03-02 15:35:32 - Title: Dynamic Data–driven simulation of pedestrian movement with automatic validation
INFO - 2022-03-02 15:35:32 - Title: BRVO: Predicting pedestrian trajectories using velocity-space reasoning
INFO - 2022-03-02 15:35:32 - Title: Walk this way: Improving pedestrian agent-based models through scene activity analysis
INFO - 2022-03-02 15:35:32 - Title: Parameter estimation of social forces in crowd dynamics models via a prob

# Report items without pdfs 

This cell makes many calls to the server (retrieving per item children), therefore it might be a bit slow!

So, patience ...

In [6]:
%%time
log.info("Retrieve items without pdf file ...")
items_without_pdf = get_items_with_no_pdf_attachments(zot, lib_items)
if items_without_pdf:
    log.warning(f"Found {len(items_without_pdf)} items\n")
    STATUS_OK = False
else:
    log.warning(f"Found 0 items\n")
    
for item in items_without_pdf:
    log_title(item)

INFO - 2022-03-02 15:35:32 - Retrieve items without pdf file ...

INFO - 2022-03-02 15:50:39 - Title: Body-rotation behavior of pedestrians for collision avoidance in passing and cross flow
INFO - 2022-03-02 15:50:39 - Title: Unified modeling language (UML)
INFO - 2022-03-02 15:50:39 - Title: Visualization ToolKit (VTK)
INFO - 2022-03-02 15:50:39 - Title: Pedigree project
INFO - 2022-03-02 15:50:39 - Title: Jülich supercomputing centre (JSC), forschungszentrum Jülich
INFO - 2022-03-02 15:50:39 - Title: Opening (bildverarbeitung)
INFO - 2022-03-02 15:50:39 - Title: Lambertsches gesetz
INFO - 2022-03-02 15:50:39 - Title: 3D motion tracking technology
INFO - 2022-03-02 15:50:39 - Title: Precision and recall
INFO - 2022-03-02 15:50:39 - Title: Kumbh mela
INFO - 2022-03-02 15:50:39 - Title: Kinect
INFO - 2022-03-02 15:50:39 - Title: HSV-Farbraum
INFO - 2022-03-02 15:50:39 - Title: Haddsch
INFO - 2022-03-02 15:50:39 - Title: Entfernungsmessung
INFO - 2022-03-02 15:50:39 - Title: Eadweard muy

# Check standalone items

- pdf files
- notes

In [7]:
log.info("Check standalone items ...")
standalone_items = get_standalone_items(lib_items)    
if standalone_items:
    log.warning(f"Found {len(standalone_items)} items.\n")
    STATUS_OK = False
else:
    log.info(f"Found {len(standalone_items)}.\n")  

for standalone_item in standalone_items:
    log_item(standalone_item)
    
if STATUS_OK:
    log.info(f"Library is OK!")    

INFO - 2022-03-02 15:50:39 - Check standalone items ...

INFO - 2022-03-02 15:50:39 - Anleitung_Türsensoren_JUMPAexperiments_Engstelle.docx (attachment)
INFO - 2022-03-02 15:50:39 - MVN_User_Manual.pdf (attachment)
INFO - 2022-03-02 15:50:39 - IDL-Pressure-Mapping-Sensor-5400N-Datasheet(6).pdf (attachment)
INFO - 2022-03-02 15:50:39 - instruction_TekScan.docx (attachment)
INFO - 2022-03-02 15:50:39 - Xsens Sales Quote 18431.pdf (attachment)
INFO - 2022-03-02 15:50:39 - Xsens Sales Quote 18430.pdf (attachment)
INFO - 2022-03-02 15:50:39 - Anleitung_XsensAnzugAnziehenUndSensorenVorbereiten.docx (attachment)
INFO - 2022-03-02 15:50:39 - LX210255005.PNG (attachment)
INFO - 2022-03-02 15:50:39 - Anleitung_Drucksensoranzug_EngstelleJUMPA.docx (attachment)
INFO - 2022-03-02 15:50:39 - Xsens - Sole Manufacturer and supplier in Germany.pdf (attachment)
INFO - 2022-03-02 15:50:39 - XSensor Technology.pdf (attachment)
INFO - 2022-03-02 15:50:39 - PRO8 Software License Key 2.pdf (attachment)
INFO 

## Report items with duplicate pdf files

### Based on DOI/ISBN 

- Duplicates without DOI not ISBN numbers are going to be ignored! 
- Duplicates with different DOI or ISBN will be missed as well! (e.g. ISBN=0968-090X and ISBN=0968090X)


In [8]:
%%time

log.info("Resolving duplicates based on DOI/ISBN...")
# sort items by DOI
by_doi = get_items_by_doi_or_isbn(lib_items)        
delete_items = []
update_items = []
for doi, items in by_doi.items():
    if len(items) == 1:
        continue

    log.info(f"doi/isbn: {doi} | number = {len(items)}")    
    # sort by age. oldest first
    items.sort(key=date_added)
    # keep oldest item
    keep = items[0]
    # keep latest attachments
    keep_cs = zot.children(keep["key"])
    duplicates_have_pdf = False
    for item in items[-1:0:-1]:
        cs = zot.children(item["key"])
        if cs:
            for c in cs:
                c["data"]["parentItem"] = keep["key"]
                if attachment_is_pdf(c):
                    duplicates_have_pdf = True
                
            update_items.extend(cs)
            if duplicates_have_pdf:
                delete_items.extend(keep_cs)

            break  # cause, only the newest attachements are added

    delete_items.extend(items[1:])

if delete_items:
    log.warning(f"{len(delete_items)} duplicate items")
    STATUS_OK = False
else:
    log.info("no duplicates found")
    
for d in delete_items:
    log_title(d)
    

INFO - 2022-03-02 15:50:39 - Resolving duplicates based on DOI/ISBN...
INFO - 2022-03-02 15:50:39 - doi/isbn: 10.1016/j.trb.2019.03.008 | number = 2
INFO - 2022-03-02 15:50:40 - doi/isbn: 10.1016/j.physa.2018.09.038 | number = 2
INFO - 2022-03-02 15:50:42 - doi/isbn: 10.1016/j.procs.2016.04.137 | number = 2
INFO - 2022-03-02 15:50:43 - doi/isbn: 10.1016/j.physa.2018.02.021 | number = 3
INFO - 2022-03-02 15:50:45 - doi/isbn: 10.1007/s12544-017-0264-6 | number = 2
INFO - 2022-03-02 15:50:46 - doi/isbn: 10.1371/journal.pone.0166908 | number = 2
INFO - 2022-03-02 15:50:48 - doi/isbn: 10.1155/2019/3457370 | number = 2
INFO - 2022-03-02 15:50:49 - doi/isbn: 978-3-642-27737-5 | number = 2
INFO - 2022-03-02 15:50:51 - doi/isbn: 10.1109/ChiCC.2015.7260353 | number = 2
INFO - 2022-03-02 15:50:53 - doi/isbn: 10.1103/PhysRevE.85.016111 | number = 2
INFO - 2022-03-02 15:50:54 - doi/isbn: 978-1-4614-8482-0 | number = 2
INFO - 2022-03-02 15:50:55 - doi/isbn: 10.1142/S0219525909002209 | number = 2
INF

In [9]:
%%time
log.info(f"Retrieving items with no DOI & ISBN")
# for field in ["DOI", "ISBN"]:
#     log.info("Resolving items with no {field} ...")  
#     empty_list = get_items_with_empty_doi_isbn(lib_items, field)
#     if empty_list:
#         log.warning(f"found {len(empty_list)} items with no {field}")
#         for d in empty_list:
#             log.info(f"Title: {d}")
#     else:
#         log.info("all items have a {field}")
        
# now check for items having none of them (doi and isbn)
fields = ["DOI", "ISBN"]
no_doi_and_no_isbn  = get_items_with_empty_doi_and_isbn(lib_items, fields)

if no_doi_and_no_isbn:
    log.warning(f"found {len(no_doi_and_no_isbn)} items with no {fields}")
    STATUS_OK = False
    for no in no_doi_and_no_isbn:
        log.info(f"Title: {no}")

INFO - 2022-03-02 15:50:57 - Retrieving items with no DOI & ISBN


KeyError: 'title'


### Based on Title

In [10]:
log.info("Resolving missed duplicates based on title...")
duplicate_items_by_title = defaultdict(list)

for item in lib_items:
    if is_standalone(item):
        continue 
        
    key = item["data"]["key"]
    iType = item["data"]["itemType"]
    Title = item["data"]["title"]
    duplicate_items_by_title[iType].append(Title.capitalize())

for Type in duplicate_items_by_title.keys():
    num_duplicates_items = len(duplicate_items_by_title[Type]) - len(
        set(duplicate_items_by_title[Type])
    )
    if num_duplicates_items:
        STATUS_OK = False
        log.warning(f"{num_duplicates_items} duplicate items of type <{Type}>")
        duplicates = set([x for x in duplicate_items_by_title[Type] if duplicate_items_by_title[Type].count(x) > 1])
        for d in duplicates:
            log.info(f"Title: {d}")
    else:
        log.info(f"No duplicates type <{Type}> found")

INFO - 2022-03-02 15:50:57 - Resolving missed duplicates based on title...
INFO - 2022-03-02 15:50:57 - Title: Xsens mvn: consistent tracking of human motion using inertial sensing
INFO - 2022-03-02 15:50:57 - Title: Improvement of pedestrian flow by slow rhythm
INFO - 2022-03-02 15:50:57 - Title: Body-rotation behavior of pedestrians for collision avoidance in passing and cross flow
INFO - 2022-03-02 15:50:57 - Title: Linking pedestrian flow characteristics with stepping locomotion
INFO - 2022-03-02 15:50:57 - Title: Comparison of pedestrian fundamental diagram across cultures
INFO - 2022-03-02 15:50:57 - Title: The effect of stepping on pedestrian trajectories
INFO - 2022-03-02 15:50:57 - Title: Kinects and human kinetics: a new approach for studying crowd behavior
INFO - 2022-03-02 15:50:57 - Title: Experimental study on one-dimensional movement of luggage-laden pedestrian
INFO - 2022-03-02 15:50:57 - Title: Homogeneity and activeness of crowd on aged pedestrian dynamics
INFO - 2022

# Trash status

- Check if Trash is empty
- Standalone items

In [11]:
if len(zot.trash()) > 0:
    log.warning("Trash is not empty. Consider emptying it!")
    STATUS_OK = False
else:
    log.info("\n----\nTrash is empty!")
    
if STATUS_OK:
    log.info("STATUS: OK")
else:
    log.warning("STATUS: NOT OK")

INFO - 2022-03-02 15:50:58 - 
----
Trash is empty!
