# Check LESO Shipments/Cancellations File (DISP_Shipments_Cancellations)

This notebook checks that the file containing shipment/cancellation data matches the structure of previous versions of the file. The following data files are used in this notebook:   
 - A CSV file containing state/territory names followed by their postal abbreviations. Both U.S. states and territories are required. 
   - This file can be populated with data from [US Postal Service Publication 28](https://pe.usps.com/text/pub28/28apb.htm).
   - The postal_file variable in this notebook should be set to the name of this file.
 - An Excel file containing shipments/cancellations data for a quarter specified in the file name (for example: DISP_Shipments_Cancellations_01012020_to_03312020.xlsx).   
   - To download the file expected by this notebook, click on 'SHIPMENTS(TRANSFERS)-CANCELLATIONS' from the *LESO Information for Shipments (Tranfers) and Cancellations of Property* section of the [DLA LESO Public Information](https://www.dla.mil/DispositionServices/Offers/Reutilization/LawEnforcement/PublicInformation/) website.   
   - The LESO_file variable in this notebook should be set to the name of this file.

This notebook expects the Excel file to have two sheets. One sheet, labelled 'SHIPMENTS,' has requests by agencies made in the previous quarter. The other sheet, labelled 'CANCELLATIONS, has information about requests that have been cancelled.   

The 'SHIPMENTS' sheet has the following fields:   

   
| Field | Data Type | Description | Length | Expected Pattern | null? |   
| ----- | ---- | ---- | ---- | ---- |---- |   
| State | string | two digit postal abbreviation for U.S. state or territory| 2 | \[A-Z\]\[A-Z\] | no |   
| Station Name (LEA) | string | descriptive name of requesting law enforcement agency | varies | varies | no |   
| Requisition ID | string | apparently unique identifier; needs further research | 14 | [A-z0-9]{14} | no |   
| FSC | string | [Federal Supply Number](https://en.wikipedia.org/wiki/NATO_Stock_Number#Federal_Supply_Classification_Group_(FSCG)) consisting of the Federal Supply Group and Federal Supply Classification | 4 | \[0-9\]{4} | no |   
| NIIN | string | [National Item Identification Number](https://en.wikipedia.org/wiki/NATO_Stock_Number#National_Item_Identification_Number_(NIIN)) a Country Code followed by a 7-digit item identifier string | 9 | \[0-9\]{9} | no |   
| Item Name | string | descriptive name of requested item | varies | varies | no |   
| UI | string | units of requested item known as unit increments | varies | varies | no |   
| Quantity | integer | number of units requested | varies | [0-9]+ | no |   
| Acquisition Value | float | U.S. dollar amount paid when the item was originally purchased by the government | varies | [0-9]+.[0-9]{2} | no |   
| Date Shipped | datetime64 | date requested; needs further research | 29 | yyyy-mm-ddT00:00:00.000000000 | no |   
| Justification | string | descriptive text justifying request; needs further research | varies | varies | yes |   

The 'CANCELLATIONS' sheet has the following fields:   

   
| Field | Data Type | Description | Length | Expected Pattern | null? |   
| ----- | ---- | ---- | ---- | ---- |---- |   
| Cancelled By | string | apparently agency that cancelled request; needs further research | varies | varies | yes | 
| RTD Ref | string | apparently unique identifier; needs further research | 6 or 7 | [0-9]{7} | no |   
| State | string | two digit postal abbreviation for U.S. state or territory| 2 | \[A-Z\]\[A-Z\] | no |   
| Station Name (LEA) | string | descriptive name of requesting law enforcement agency | varies | varies | no |   
| FSC | string | [Federal Supply Number](https://en.wikipedia.org/wiki/NATO_Stock_Number#Federal_Supply_Classification_Group_(FSCG)) consisting of the Federal Supply Group and Federal Supply Classification | 4 | \[0-9\]{4} | no |   
| NIIN | string | [National Item Identification Number](https://en.wikipedia.org/wiki/NATO_Stock_Number#National_Item_Identification_Number_(NIIN)) a Country Code followed by a 7-digit item identifier string | 9 | \[0-9\]{9} | no |   
| Item Name | string | descriptive name of requested item | varies | varies | no |   
| UI | string | units of requested item known as unit increments | varies | varies | no |   
| Quantity | integer | number of units requested | varies | [0-9]+ | no |   
| Acquisition Value | float | U.S. dollar amount paid when the item was originally purchased by the government | varies | [0-9]+.[0-9]{2} | no |   
| Date Requested | datetime64 | date request made; needs further research | 29 | yyyy-mm-ddT00:00:00.000000000 | no |   
| Justification | string | descriptive text justifying request; needs further research | varies | varies | yes |   
| Reason Cancelled | string | capitalized code followed by description of why request is cancelled; needs further research | varies | varies | yes |   

In [None]:
#    Libraries used by this notebook.
import pandas as pd
import re
import sys

#!python --version  #Python 3.8.5
# sys is a standard library
#pd.__version__      #1.1.2 
#re.__version__     #2.2.1

sys.path.insert(0, "..\\..\\scripts\\")
from checksumfunctions import get_file_info
from checksumfunctions import get_file_hash
from notebookfunctions import get_unexpected_values

In [None]:
#    VARIABLES THAT CAN BE CUSTOMIZED

#    Enter the path to the folder containing all the data files.
path_datafiles = "../../data/"

#    This notebook expects a comma-separated file consisting of:
#        full name,postal abbreviation
#    The values can be downloaded from U.S. Postal Service Publication 28:
#        https://pe.usps.com/text/pub28/28apb.htm
#    
#    Enter the name of the file containing postal codes.
postal_file = 'USPS_StateAbbreviations.csv'

#    Get the 'LESO Information for Shipments (Tranfers) and Cancellations of Property' file from 
#        Defense Logicstics Agency Law Enforcement Support Office Public Information
#    The original name of the data file should be in the form:
#        DISP_Shipments_Cancellations_mmddyyyy_mmddyyyy.xlsx
#    
#    Enter the name of the LESO file to be checked.
#LESO_file = "DISP_Shipments_Cancellations_01012020_03312020.xlsx"
#LESO_file = "DISP_Shipments_Cancellations_04012020_06302020.xlsx"
LESO_file = "DISP_Shipments_Cancellations_07012020_09302020.xlsx"
#LESO_file = "DISP_Shipments_Cancellations_10012020_12312020.xlsx"

In [None]:
#    VARIABLES THAT SHOULD NOT BE CHANGED

#    Assume the file is good to merge.
flag_file_good_to_merge = True

#    Expected sheets based on sheets from previous files.
expected_sheets = ['SHIPMENTS', 'CANCELLATIONS']

#    Expected columns based on columns from previous files.
expected_columns = {'SHIPMENTS': ['State', 'Station Name (LEA)', 'Requisition ID',
                                  'FSC', 'NIIN', 'Item Name', 'UI', 'Quantity',
                                  'Acquisition Value', 'Date Shipped', 'Justification'],
                    'CANCELLATIONS': ['Cancelled By', 'RTD Ref', 'State', 'Station Name (LEA)',
                                      'FSC', 'NIIN', 'Item Name', 'UI', 'Quantity',
                                      'Acquisition Value', 'Date Requested', 'Justification',
                                      'Reason Cancelled']}

#    Build a dictionary of expected postal abbreviations based on the file
#    named by the 'postal_file' variable.
#        key: state abbreviation
#        value: state name
expected_postal_abbreviations = pd.read_csv(path_datafiles + postal_file, header=None,
                                            quotechar = "'").set_index([1])[0].to_dict() 


In [None]:
#    Read the data from the XLSX file.
ship_canc_dict = pd.read_excel("file:" + path_datafiles + LESO_file, sheet_name=None)
#    ship_canc_dict is a dictionary of all sheets in the LESO_file
#        keys are 'SHIPMENTS', 'CANCELLATIONS'
#        values are a single dataframe, columns vary in each dataframe
#    The records are not cumulative from quarter to quarter.

In [None]:
print('This notebook is checking: ')
print('%s\t%s\t%s' % get_file_info(path_datafiles + LESO_file))
print('MD5\t %s' % get_file_hash(path_datafiles + LESO_file, 'md5'))
print('SHA256\t %s' % get_file_hash(path_datafiles + LESO_file, 'sha256'))

### THESE QUESTIONS DECIDE IF THIS FILE CAN BE MERGED WITH FILES FROM PREVIOUS QUARTERS

###### QUESTION A: Does the file have the expected sheets?

In [None]:
good_sheets, missing_sheets, unexpected_sheets = '', '', ''
found_sheets = list(ship_canc_dict.keys())
if (found_sheets == expected_sheets):
    print('Only the expected sheets were found.')
    good_sheets = found_sheets
else:
    missing_sheets = get_unexpected_values(set(expected_sheets), set(found_sheets))
    unexpected_sheets = get_unexpected_values(set(found_sheets), set(expected_sheets))
    if (len(missing_sheets) > 0):
        print('Shipments_Cancellations file has the following missing sheets:\n', missing_sheets)
    if (len(unexpected_sheets) > 0):
        print('Shipments_Cancellations file has the following unexpected sheets:\n', unexpected_sheets)
    good_sheets = set(found_sheets).difference(set(unexpected_sheets))
    flag_file_good_to_merge = False
sheet_discrepancy = [missing_sheets, unexpected_sheets]

###### QUESTION B: Are the values of 'State' valid U.S. postal abbreviations?

In [None]:
unexpected_state_abbreviations = []
for sheet_name in good_sheets:
    unexpected_state_abbreviations_by_sheet = ([state_abbr for state_abbr in ship_canc_dict[sheet_name]['State']
                                                if state_abbr not in expected_postal_abbreviations])
    if (len(unexpected_state_abbreviations_by_sheet) == 0):
        print('Only valid state and territory abbreviations found in the', sheet_name, 'sheet.')
    else:
        print('These state or territory abbreviations are not valid in the', sheet_name, 'sheet:\n',
          [unexpected_abbreviation for unexpected_abbreviation in unexpected_state_abbreviations_by_sheet])
        unexpected_postal_abbreviations = [*unexpected_state_abbreviations, 
                                          *unexpected_state_abbreviations_by_sheet]
        flag_file_good_to_merge = False

###### QUESTION C: Do all sheets have the expected columns? (Each sheet should have a different set of columns.)

In [None]:
missing_columns = []
unexpected_columns = []
for sheet_name in good_sheets:
    missing_columns_by_sheet = get_unexpected_values(set(ship_canc_dict[sheet_name].columns),
                                            set(expected_columns[sheet_name]))
    unexpected_columns_by_sheet = get_unexpected_values(set(expected_columns[sheet_name]),
                                               set(ship_canc_dict[sheet_name].columns))    
    if (len(missing_columns_by_sheet) == 0) & (len(unexpected_columns_by_sheet) == 0):
        print('Only expected columns were found in the', sheet_name, 'sheet.')
    else:
        if (len(missing_columns_by_sheet) > 0):
            print('These columns are missing from the', sheet_name, 'sheet:\n',missing_columns_by_sheet)
            missing_columns = [*missing_columns, *missing_columns_by_sheet]
        if (len(unexpected_columns_by_sheet) > 0):
            print('These unexpected columns were found in the', sheet_name, 'sheet:\n',
                  unexpected_columns_by_sheet)
            unexpected_columns = [*unexpected_columns, *unexpected_columns_by_sheet]
        flag_file_good_to_merge = False
column_discrepancy = [missing_columns, unexpected_columns]

###### QUESTION D: Can this file be merged with DLA LESO Public Data files from previous quarters?

In [None]:
if flag_file_good_to_merge:
    print('Yes, this file can be merged with DLA LESO Public Data files from previous quarters.')
else:
    print('No, this file cannot be merged for the following reasons:')
    if (len(sheet_discrepancy[0]) + len(sheet_discrepancy[1]) > 0):
        print('See Question A')
    if len(inconsistant_state_abbreviations) > 0:
        print('See Question B')
    if (len(column_discrepancy[0]) + len(column_discrepancy[1]) > 0):
        print('See Question C')

### ADDITIONAL INFORMATION ABOUT THE ORIGINAL DATA

###### QUESTION 1: What is the basic shape of the data in each sheet?

In [None]:
for sheet_name in good_sheets:
    print('The', sheet_name, 'sheet has shape:', ship_canc_dict[sheet_name].shape)

###### QUESTION 2: What fields have null values in the original file?

In [None]:
null_counts = []
for sheet_name in good_sheets:
    a_list = [(k,v) for k,v in ship_canc_dict[sheet_name].isna().sum().iteritems()
              if v > 0]
    null_counts.append({sheet_name: a_list})
for a_dict in null_counts:
    for key in a_dict:
        if not a_dict[key]:
            print('The', key, 'sheet has no null values.')
        else:
            print('The', key, 'sheet has null values in the following columns:')
            for a_tuple in a_dict[key]:
                print('\t', a_tuple[0], ' (' + str(a_tuple[1]) + ')')

###### QUESTION 3: How many unique values are in each column of the 'SHIPMENTS' sheet?

In [None]:
if 'SHIPMENTS' not in good_sheets:
    print('Cannot count the unique values in the SHIPMENTS sheet because the sheet is missing from the file.')
else:
    unique_counts = ship_canc_dict['SHIPMENTS'].groupby('State', as_index=False).nunique()
    unique_counts['State Name'] = [expected_postal_abbreviations[i] for i in unique_counts['State']]
    display(pd.concat([pd.DataFrame([ship_canc_dict['SHIPMENTS'].nunique()], index=['Total']),
               unique_counts.set_index('State Name')]))

###### QUESTION 4: How many unique values are in each column of the 'CANCELLATIONS' sheet?

In [None]:
if 'CANCELLATIONS' not in good_sheets:
    print('Cannot count the unique values in the CANCELLATIONS sheet because the sheet is missing from the file.')
else:
    unique_counts = ship_canc_dict['CANCELLATIONS'].groupby('State', as_index=False).nunique()
    unique_counts['State Name'] = [expected_postal_abbreviations[i] for i in unique_counts['State']]
    display(pd.concat([pd.DataFrame([ship_canc_dict['CANCELLATIONS'].nunique()], index=['Total']),
               unique_counts.set_index('State Name')]))