# Check LESO Transferred Property File (DISP_AllStatesAndTerritories)

This notebook checks that the file containing transferred inventory data matches the structure of previous versions of the file. The following data files are used in this notebook:   
 - A CSV file containing state/territory names followed by their postal abbreviations. Both U.S. states and territories are required. 
   - This file can be populated with data from [US Postal Service Publication 28](https://pe.usps.com/text/pub28/28apb.htm).
   - The postal_file variable in this notebook should be set to the name of this file.
 - An Excel file containing transferred inventory data up to the quarter specified in the file name (for example: DISP_AllStatesAndTerritories_03312020.xlsx).   
   - To download the file expected by this notebook, click on 'ALASKA - WYOMING AND US TERRITORIES' from the *LESO Property Transferred to Participating Agencies* section of the [DLA LESO Public Information](https://www.dla.mil/DispositionServices/Offers/Reutilization/LawEnforcement/PublicInformation/) website.   
   - The LESO_file variable in this notebook should be set to the name of this file.

This notebook expects the Excel file to have one sheet for each state or territory with agencies that received property through the program. Each sheet has the following fields:   

   
| Field | Data Type | Description | Length | Expected Pattern | null? |   
| ----- | ---- | ---- | ---- | ---- |---- |   
| State | string | two digit postal abbreviation for U.S. state or territory | 2 | \[A-Z\]\[A-Z\] | no |   
| Station Name (LEA) | string | descriptive name of requesting law enforcement agency | varies | varies | no |   
| NSN | string | [NATO Stock Number](https://en.wikipedia.org/wiki/NATO_Stock_Number) a government-assigned identifier for requested item | 9 | \[0-9\]{4}-\[0-9\]{2}-\[A-Z0-9\]{3}-\[A-Z0-9\]{4} | no |   
| Item Name | string | descriptive name of requested item | varies | varies | no |   
| UI | string | units of requested item known as unit increments | varies | varies | no |   
| Quantity | integer | number of units requested | varies | [0-9]+ | no |   
| Acquisition Value | float | U.S. dollar amount paid when the item was originally purchased by the government | varies | [0-9]+.[0-9]{2} | no |   
| DEMIL Code | character | [demilitarization code](https://www.dla.mil/HQ/LogisticsOperations/Services/FIC/DEMILCoding/DEMILCodes/) for level of destruction required when the item leaves Department of Defense control | 1 | \[GPFDCEBQA\] | no |   
| DEMIL IC | integer | [demilitarization itegrity code](https://www.dla.mil/HQ/LogisticsOperations/Services/FIC/DEMILCoding/DEMILCodes/) validity of DEMIL Code (a missing value means it has not yet been reviewed), see [FLIS manual](https://www.dla.mil/HQ/LogisticsOperations/TrainingandReference/FLISProcedures/) for more information | 1 | [0-9] or blank | yes |   
| Ship Date | datetime64 | date transfered; needs further research | 29 | yyyy-mm-ddT00:00:00.000000000 | no |   
| Station Type | string | level of government associated with requesting agency; needs further research | 5 | 'State' | no |   

In [None]:
#    Libraries used by this notebook.
import pandas as pd
import re
import sys

#!python --version  #Python 3.8.5
# sys is a standard library
#pd.__version__     #1.1.2 
#re.__version__     #2.2.1

sys.path.insert(0, "..\\..\\scripts\\")
from checksumfunctions import get_file_info
from checksumfunctions import get_file_hash
from notebookfunctions import get_unique_values
from notebookfunctions import get_unexpected_values

In [None]:
#    VARIABLES THAT CAN BE CUSTOMIZED

#    Enter the path to the folder containing all the data files.
path_datafiles = "../../data/"

#    This notebook expects a comma-separated file consisting of:
#        full name,postal abbreviation
#    The values can be downloaded from U.S. Postal Service Publication 28:
#        https://pe.usps.com/text/pub28/28apb.htm
#    
#    Enter the name of the file containing postal codes.
postal_file = 'USPS_StateAbbreviations.csv'

#    Get the 'LESO Property Transferred to Participating Agencies' file from 
#        Defense Logicstics Agency Law Enforcement Support Office Public Information
#    The original name of the data file should be in the form:
#        DISP_AllStatesAndTerritories_mmddyyyy.xlsx  
#    
#    Enter the name of the LESO file to be checked.
#LESO_file = "DISP_AllStatesAndTerritories_03312020.xlsx"
#LESO_file = "DISP_AllStatesAndTerritories_06302020.xlsx"
LESO_file = "DISP_AllStatesAndTerritories_09302020.xlsx"
#LESO_file = "DISP_AllStatesAndTerritories_12312020.xlsx"

In [None]:
#    VARIABLES THAT SHOULD NOT BE CHANGED

#    Assume the file is good to merge.
flag_file_good_to_merge = True

#    Expected columns based on columns from previous files.
expected_columns = ['State', 'Station Name (LEA)',
                    'NSN', 'Item Name', 'Quantity', 'UI', 'Acquisition Value',
                    'DEMIL Code', 'DEMIL IC', 'Ship Date', 'Station Type']

#    Expected 'Station Types' based on values from previous files.
expected_station_types = ['State']

#    Expected 'DEMIL Codes' based on DOD 4160.28 DEMIL Program or
#    DOD 4100.39M FLIS Manual at this website:
#        https://www.dla.mil/HQ/LogisticsOperations/Services/FIC/DEMILCoding/DEMILCodes/
expected_demil_codes = ['G', 'P', 'F', 'D', 'C', 'E', 'B', 'Q', 'A']


#    Expected 'DEMIL IC' values based on DOD 4160.28 DEMIL Program or
#    DOD 4100.39M FLIS Manual at this website:
#        https://www.dla.mil/HQ/LogisticsOperations/Services/FIC/DEMILCoding/DEMILCodes/
expected_demil_integritycodes = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#    Build a dictionary of expected postal abbreviations based on the file
#    named by the 'postal_file' variable.
#        key: state abbreviation
#        value: state name
expected_postal_abbreviations = pd.read_csv(path_datafiles + postal_file, header=None,
                                            quotechar = "'").set_index([1])[0].to_dict() 

In [None]:
#    Read the data from the XLSX file.
transfer_dict = pd.read_excel("file:" + path_datafiles + LESO_file, sheet_name=None)
#    transfer_dict is a dictionary of all sheets in the LESO_file
#         keys are full state/territory names
#         values are a single dataframe of all transfers for that state/territory
#    The records may be cumulative up to this quarter.

In [None]:
print('This notebook is checking: ')
print('%s\t%s\t%s' % get_file_info(path_datafiles + LESO_file))
print('MD5\t %s' % get_file_hash(path_datafiles + LESO_file, 'md5'))
print('SHA256\t %s' % get_file_hash(path_datafiles + LESO_file, 'sha256'))

### THESE QUESTIONS DECIDE IF THIS FILE CAN BE MERGED WITH FILES FROM PREVIOUS QUARTERS

###### QUESTION A: Are the values of 'State' valid U.S. postal abbreviations?

In [None]:
unexpected_postal_abbreviations = get_unexpected_values(set(get_unique_values(transfer_dict,'State')),
                                                        set(expected_postal_abbreviations.keys()))
if len(unexpected_postal_abbreviations) == 0:
    print('Only valid state and territory abbreviations were found.')
else:
    print('These state or territory abbreviations are not valid:\n', list(unexpected_postal_abbreviations))
    flag_file_good_to_merge = False

###### QUESTION B: Does each sheet have exactly one value for 'State'?

In [None]:
inconsistant_postal_abbreviations = [state_name for state_name in transfer_dict
                                     if len(transfer_dict[state_name]['State'].unique()) != 1]
if len(inconsistant_postal_abbreviations) == 0:
    print('All sheets have exactly one state/territory abbreviation.')
else:
    print('These states do not have exactly one state/territory abbreviation:\n', inconsistant_postal_abbreviations)
    flag_file_good_to_merge = False

###### QUESTION C: Do all sheets have the expected columns? (All sheets should have the same columns.)

In [None]:
column_discrepancy = [state_name for state_name in transfer_dict
                      if (expected_columns != transfer_dict[state_name].columns.tolist())]
if len(column_discrepancy) == 0:
    print('Only expected columns were found.')
else:
    print('Columns need to be checked on these states:\n',column_discrepancy)
    flag_file_good_to_merge = False

###### QUESTION D: Can this file be merged with DLA LESO Public Data files from previous quarters?

In [None]:
if flag_file_good_to_merge:
    print('Yes, this file can be merged with DLA LESO Public Data files from previous quarters.')
else:
    print('No, this file cannot be merged for the following reasons:')
    if len(unexpected_postal_abbreviations) > 0:
        print('See Question A')
    if len(inconsistant_postal_abbreviations) > 0:
        print('See Question B')
    if len(column_discrepancy) > 0:
        print('See Question C')

### ADDITIONAL INFORMATION ABOUT THE ORIGINAL DATA

###### QUESTION 1: What is the basic shape of the data?

In [None]:
print('Transfers file has', len(transfer_dict), 'states/territories.')
print('Transfers file has', sum([len(x) for x in transfer_dict.values()]), 'rows across all states/territories.')

###### QUESTION 2: Do the state or territory names on all sheets match U.S. postal names?

In [None]:
incorrect_state_names = [state_name for state_name in transfer_dict 
                         if state_name not in expected_postal_abbreviations.values()]
if len(incorrect_state_names) == 0:
    print('All state/territory names match U.S. Postal Service names.')
else:
    for i in incorrect_state_names:
        abbreviations = list(transfer_dict[i]['State'].unique())
        print('Misspelled state/territory name : ', i,' abbreviated as ',abbreviations)
        if flag_file_good_to_merge:
            print('\tBest guess state/territory name: ', 
                  expected_postal_abbreviations[abbreviations[0]])

###### QUESTION 3: How many total null/NaN values in each column?

In [None]:
null_counts = pd.DataFrame(columns=expected_columns)
count = 0
for state_name in transfer_dict:
    for k,v in transfer_dict[state_name].isna().sum().iteritems():
        null_counts.loc[count, k] = v
    null_counts.loc[count, 'State Name'] = state_name
    count+=1
for col,num_null in null_counts[expected_columns].sum().astype(int).items():
    if num_null > 0:
        print('Found', num_null, 'null values in', col, 'across all states/territories.')
print('All other columns had no null values across all states/territories.')
#    Uncomment the following if to see null values by state.
#null_counts.set_index('State Name')

###### QUESTION 4: Are the unique values of 'Station Type' as expected?

In [None]:
unexpected_station_types = get_unexpected_values(set(get_unique_values(transfer_dict,'Station Type')),
                                                 set(expected_station_types))
if len(unexpected_station_types) == 0:
    print('\nOnly expected station types found.')
else:
    print('\nFound these unexpected station types:',list(unexpected_station_types))

###### QUESTION 5: Are the unique values of 'DEMIL Code' as expected?

In [None]:
unexpected_demil_codes = get_unexpected_values(set(get_unique_values(transfer_dict,'DEMIL Code')),
                                               set(expected_demil_codes))
if len(unexpected_demil_codes) == 0:
    print('\nOnly expected DEMIL codes found.')
else:
    print('\nFound these unexpected DEMIL codes:',list(unexpected_demil_codes))

###### QUESTION 6: Are the unique values of 'DEMIL IC' as expected?

In [None]:
unexpected_demil_integritycodes = get_unexpected_values(set(get_unique_values(transfer_dict,'DEMIL IC')),
                                                        set(expected_demil_integritycodes))
non_nan_list = []
[non_nan_list.append(ic) for ic in unexpected_demil_integritycodes if pd.notna(ic)]
if len(non_nan_list) > 0:
    print('Found these unexpected DEMIL integrity codes:',non_nan_list)
else:
    print('Only expected integrity codes found.')
print('Found',len(unexpected_demil_integritycodes) - len(non_nan_list),
      'states with NaN DEMIL integrity codes values.\nRecall a missing DEMIL integrity codes means the DEMIL code has not yet been reviewed.')

###### QUESTION 7: How many unique values are in each column of each sheet?

In [None]:
unique_counts = pd.DataFrame(columns=expected_columns)
count = 0
for state_name in transfer_dict:
    for col, num_uniq in transfer_dict[state_name].nunique().iteritems():
        unique_counts.loc[count, col] = num_uniq
    unique_counts.loc[count, 'State Name'] = state_name
    count += 1
unique_counts.set_index('State Name')