# Merge DLA LESO Public Information Files

The files published at the DLA LESO Public Information website do not appear to be cumulative or consistant over time. Therefore, there may be a need to collect the files from the site each quarter. This repository builds a dataset to meet the following needs:   
 - accumulate the data in a tab-separated format   
 - merge the two kinds of DLA LESO files into one file   
 - allow the original data to be pulled out of the merged data    

BEFORE RUNNING THIS NOTEBOOK, run the following notebooks on the files to be merged:   
__Check_DISP_AllStatesAndTerritories.ipynb__ to check *LESO Property Transferred to Participating Agencies*      
__Check_DISP_Shipments_Cancellations.ipynb__ to check *LESO Information for Shipments (Tranfers) and Cancellations of Property*   

This notebook merges one *LESO Property Transferred to Participating Agencies* file with one *LESO Information for Shipments (Tranfers) and Cancellations of Property* file.  New columns are generated, but the original data is not altered. There should be no problem recreating the original data from the fields in the merged data.   

Guided by the file naming convention used in the [Stanford Open Policing repository](https://github.com/5harad/openpolicing), the merged data is split based on two digit state or territory abbreviation then exported to tab-separated files named "TWO_LETTER_CODE_leso.tsv". If the tab-separated files already exist, the notebook will append the merged data to those files. This allows DLA LESO Public Information data from different quarters to be consolidated into one set of files.   

The merged data has the following fields:

   
| Field | Data Type | Description | Original Column | Length | Expected Pattern | null? |   
| ----- | ---- | ---- | ---- | ---- |---- | ---- |   
||| __Constructed Fields__ |||||   
| File | string | file that populated this record | same as LESO filename | varies | see LESOfile variables | no |
| Sheet | string | sheet that populated this record | same as sheet name in LESO file | varies | varies | no | 
| Item_FSG | string | supply category the item belongs to; see [Federal Supply Group Number](https://en.wikipedia.org/wiki/List_of_NATO_Supply_Classification_Groups#References) | file dependent, digits 1&2 of \['NSN','FSC'\]| 2 | \[0-9\]{2} | no |   
| Item_FSC | string | supply class the item belongs to; see [Federal Supply Group Number](https://en.wikipedia.org/wiki/List_of_NATO_Supply_Classification_Groups#References) | file dependent, digits 3&4 of \['NSN','FSC'\] | 2 | \[0-9\]{2} | no |   
| Item_CC | string | country code for where final assembly of item occurred (a.k.a. nation code; see [Federal Supply Group Number](https://en.wikipedia.org/wiki/National_Codification_Bureau) | file dependent, digits 5&6 of \['NSN'\] or digits 1&2 of \['NIIN'\]| 2 | \[0-9\]{2} | no |   
| Item_Code | string | supply class the item belongs to; see [Federal Supply Group Number](https://en.wikipedia.org/wiki/List_of_NATO_Supply_Classification_Groups#References) | file dependent, last 7 digits of \['NSN','NIIN'\] | 7 | \[0-9\]{7} | no |   
||| __Fields from Both Files__ |||||   
| StateAbbreviation | string | two digit postal abbreviation for U.S. state or territory | State | 2 | \[A-Z\]\[A-Z\] | no |   
| RequestingAgency | string | descriptive name of requesting law enforcement agency | Station Name (LEA) | varies | varies | no |   
| ItemDescription | string | descriptive name of requested item | Item Name | varies | varies | no |   
| RecordDate | datetime64 | date | file dependent \['Ship Date','Date Shipped','Date Requested'\] | 29 | yyyy-mm-ddT00:00:00.000000000 | no |   
| AcquisitionValue | float | U.S. dollar amount paid when the item was originally purchased by the government | Acquisition Value | varies | [0-9]+.[0-9]{2} | no |   
| Quantity | integer | number of units requested | Quantity | varies | [0-9]+ | no |   
| UnitIncrement | string | units of requested item known as unit increments | UI | varies | varies | no |   
||| __Fields from All Sheets in AllStatesAndTerritories__ | __fill value 'not in file'__ ||||   
| NSN | string | [NATO Stock Number](https://en.wikipedia.org/wiki/NATO_Stock_Number) a government-assigned identifier for requested item | NSN | 9 | \[0-9\]{4}-\[0-9\]{2}-\[A-Z0-9\]{3}-\[A-Z0-9\]{4} | no |   
| DEMILCode | character | [demilitarization code](https://www.dla.mil/HQ/LogisticsOperations/Services/FIC/DEMILCoding/DEMILCodes/) for level of destruction required when the item leaves Department of Defense control | DEMIL Code | 1 | \[GPFDCEBQA\] | no |   
| DEMILIC | integer | [demilitarization itegrity code](https://www.dla.mil/HQ/LogisticsOperations/Services/FIC/DEMILCoding/DEMILCodes/) validity of DEMIL Code (a missing value means it has not yet been reviewed), see [FLIS manual](https://www.dla.mil/HQ/LogisticsOperations/TrainingandReference/FLISProcedures/) for more information | DEMIL IC | 1 | [0-9] or blank | yes |   
| StationType | string | level of government associated with requesting agency; needs further research | Station Type | 5 | 'State' | no |   
||| __Fields from Both Sheets in Shipments_Cancellations__ | __fill value 'not in file'__ ||||   
| FSC | string | [Federal Supply Number](https://en.wikipedia.org/wiki/NATO_Stock_Number#Federal_Supply_Classification_Group_(FSCG)) consisting of the Federal Supply Group and Federal Supply Classification | FSC | 4 | \[0-9\]{4} | no |   
| NIIN | string | [National Item Identification Number](https://en.wikipedia.org/wiki/NATO_Stock_Number#National_Item_Identification_Number_(NIIN)) a Country Code followed by a 7-digit item identifier string | NIIN | 9 | \[0-9\]{9} | no |   
| Justification | string | descriptive text justifying request; needs further research | Justification | varies | varies | yes |   
||| __Fields from Shipments in Shipments_Cancellations__ | __fill value 'not in file'__ ||||   
| RequisitionID | string | apparently unique identifier needs further research | Requisition ID | 14 | [A-z0-9]{14} | no |   
||| __Fields from Cancellations in Shipments_Cancellations__ | __fill value 'not in file'__ ||||   
| CancelledBy | string | apparently agency that cancelled request; needs further research | Cancelled By | varies | varies | yes | 
| RTDRef | string | apparently unique identifier; needs further research | RTD Ref | 6 or 7 | [0-9]{7} | no |     
| ReasonCancelled | string | why request is cancelled; needs further research | Reason Cancelled | varies | varies | yes |   

In [None]:
#    Libraries used by this notebook.
import pandas as pd
import re
import sys

from pathlib import Path

#!python --version     #Python 3.8.5
# pathlib standard module
# sys standard module
#pd.__version__       #1.1.2
#re.__version__       #2.2.1

sys.path.insert(0, "..\\..\\scripts\\")
from checksumfunctions import get_file_info
from checksumfunctions import get_file_hash
from notebookfunctions import make_dataframe

In [None]:
#    VARIABLES THAT CAN BE CUSTOMIZED

#    Enter the path to the folder containing all the data files.
path_datafiles = "../../data/"

#    This notebook merges data from a DLA LESO Public Data file
#    that has been checked with the following notebook:
#         Check_AllStatesAndTerritories.ipynb
#    Please run that notebook before setting the 'LESOfile_all' variable.
#    
#    Enter 'LESO Property Transferred to Participating Agencies' file to be merged.
#LESOfile_all = "DISP_AllStatesAndTerritories_03312020.xlsx"
#LESOfile_all = "DISP_AllStatesAndTerritories_06302020.xlsx"
#LESOfile_all = "DISP_AllStatesAndTerritories_09302020.xlsx"
#LESOfile_all = "DISP_AllStatesAndTerritories_12312020.xlsx"
#LESOfile_all = "AllStatesAndTerritoriesQTR3FY21.xlsx"   #period ending 20210630
LESOfile_all = "AllStatesAndTerritoriesQTR4FY21.xlsx"   #period ending 20210930

#    This notebook merges data from a DLA LESO Public Data file
#    that has been checked with the following notebook:
#         Check_Shipments_Cancellations.ipynb
#    Please run that notebook before setting the 'LESOfile_shipcanc' variable.
#    
#    Enter 'LESO Information for Shipments (Tranfers) and Cancellations of Property'
#    file to be merged.
#LESOfile_shipcanc = "DISP_Shipments_Cancellations_01012020_03312020.xlsx"
#LESOfile_shipcanc = "DISP_Shipments_Cancellations_04012020_06302020.xlsx"
#LESOfile_shipcanc = "DISP_Shipments_Cancellations_07012020_09302020.xlsx"
#LESOfile_shipcanc = "DISP_Shipments_Cancellations_10012020_12312020.xlsx"
#LESOfile_shipcanc = "ShipmentsCancellationsQTR3FY21.xlsx" #period ending 20210630
LESOfile_shipcanc = "ShipmentsCancellationsQTR4FY21.xlsx" #period ending 20210930

#    Enter the path to where the merged data files will be saved.
path_mergedfiles = "../../data/merged/"

#    The merged data can be split based on a column. The merged data will
#    be saved to a series of files based on values in this column.
#    By default, it uses the 'StateAbbreviation' column. It can be modified
#    a column named 'Quarter' which is generated based on 'RecordDate.'
#    If the 'split_by_column' variable is not set, the notebook saves all merged data to one file.
split_by_column = 'StateAbbreviation'
#split_by_column = 'Quarter'

In [None]:
#    VARIABLES THAT SHOULD NOT BE CHANGED

#    The final, ordered list of columns for the merged data.
ordered_columns_list = ['File', 'Sheet', 'StateAbbreviation', 'RequestingAgency',
                        'ItemDescription', 'RecordDate', 'AcquisitionValue', 'Quantity',
                        'UnitIncrement', 'Item_FSG', 'Item_FSC', 'Item_CC',
                        'Item_Code', 'Justification', 'NSN', 'FSC', 'NIIN', 'DEMILCode',
                        'DEMILIC', 'StationType', 'RequisitionID' ,'CancelledBy',
                        'RTDRef', 'ReasonCancelled']

### PREPARE THE DATA FROM LESOfile_all

In [None]:
#    Expected columns based on columns from Check_AllStatesAndTerritories.ipynb
#trans_expected_columns = ['State', 'Station Name (LEA)',
#                          'NSN', 'Item Name', 'Quantity', 'UI', 'Acquisition Value',
#                          'DEMIL Code', 'DEMIL IC', 'Ship Date', 'Station Type']
#20210901add in 20210630, 'Station Name (LEA)' has been changed to 'Agency Name'; all others same
trans_expected_columns = ['State', 'Agency Name',
                          'NSN', 'Item Name', 'Quantity', 'UI', 'Acquisition Value',
                          'DEMIL Code', 'DEMIL IC', 'Ship Date', 'Station Type']
#    Dictionary mapping original columns to merged data columns.
trans_columns_dictionary = {'State':'StateAbbreviation', 'Agency Name':'RequestingAgency',
                            'NSN':'NSN', 'Item Name':'ItemDescription', 'Quantity':'Quantity',
                            'UI':'UnitIncrement', 'Acquisition Value':'AcquisitionValue',
                            'DEMIL Code':'DEMILCode', 'DEMIL IC':'DEMILIC',
                            'Ship Date':'RecordDate', 'Station Type':'StationType'}

In [None]:
#    Read the data from the XLSX file. 
excel_dict = pd.read_excel("file:" + path_datafiles + LESOfile_all, sheet_name=None)

#    Collect information about original data
total_transfer_records = sum([len(x) for x in excel_dict.values()])
state_to_sheet_dict = {a_df['State'].unique()[0]: sheet for sheet,a_df in excel_dict.items()}

#    Create one dataframe with all the data.
transfer_df = make_dataframe(excel_dict, 'Ship Date').rename(columns=trans_columns_dictionary)
excel_dict.clear()

#    Break 'NSN' into NATO Stock Number units.
transfer_df = transfer_df.assign(Item_FSG=transfer_df['NSN'].str.replace('-','').str[:2].values,
                                 Item_FSC=transfer_df['NSN'].str.replace('-','').str[2:4].values,
                                 Item_CC=transfer_df['NSN'].str.replace('-','').str[4:6].values,
                                 Item_Code=transfer_df['NSN'].str.replace('-','').str[6:].values,)

#    Fill missing columns with 'not in file' value to distinguish them from NaN/null values.
transfer_df['FSC'] = 'not in file'
transfer_df['NIIN'] = 'not in file'
transfer_df['Justification'] = 'not in file'
transfer_df['RequisitionID'] = 'not in file'
transfer_df['CancelledBy'] = 'not in file'
transfer_df['RTDRef'] = 'not in file'
transfer_df['ReasonCancelled'] = 'not in file'

#    Add 'File' column.
transfer_df['File'] = LESOfile_all

#    Add 'Sheet' column
transfer_df['Sheet'] = [value for key, value in state_to_sheet_dict.items()
                        for i in transfer_df['StateAbbreviation'] if i == key]

#    Order the columns in preparation for merging.
transfer_df = transfer_df[ordered_columns_list]

In [None]:
print('Pulled the transfer data from: ')
print('%s\t%s\t%s' % get_file_info(path_datafiles + LESOfile_all))
print('MD5\t %s' % get_file_hash(path_datafiles + LESOfile_all, 'md5'))
print('SHA256\t %s' % get_file_hash(path_datafiles + LESOfile_all, 'sha256'))

In [None]:
print('Total records pulled from %s: %s' % (LESOfile_all, str(total_transfer_records)))
print('After prepping the data, the transfer dataframe has %s columns with %s records.'
      % (transfer_df.shape[1], transfer_df.shape[0]))

### PREPARE THE SHIPMENTS DATA FROM LESOfile_shipcanc

In [None]:
#    Expected columns based on 'SHIPMENTS' columns from Check_Shipments_Cancellations.ipynb
current_sheet = 'SHIPMENTS'
ship_expected_columns = ['State', 'Station Name (LEA)', 'Requisition ID', 'FSC', 'NIIN',
                         'Item Name', 'UI', 'Quantity', 'Acquisition Value', 'Date Shipped',
                         'Justification']
#    Dictionary mapping original 'SHIPMENTS' columns to merged data columns.
ship_columns_dictionary = {'State':'StateAbbreviation', 'Station Name (LEA)':'RequestingAgency',
                           'Requisition ID':'RequisitionID', 'FSC':'FSC', 'NIIN':'NIIN',
                           'Item Name':'ItemDescription', 'UI':'UnitIncrement', 'Quantity':'Quantity',
                           'Acquisition Value':'AcquisitionValue', 'Date Shipped':'RecordDate',
                           'Justification':'Justification'}

In [None]:
#    Read the data 'SHIPMENTS' sheet of the XLSX file.
shipment_df = make_dataframe(pd.read_excel("file:" + path_datafiles + LESOfile_shipcanc,
                                            sheet_name=current_sheet), 'Date Shipped').\
                                 rename(columns=ship_columns_dictionary)

#    Collect information about original data
total_shipment_records = shipment_df.shape[0]

#    Break 'FSC' and 'NIIN' into NATO Stock Number units.
shipment_df = shipment_df.assign(Item_FSG=shipment_df['FSC'].astype(str).str[:2],
                                 Item_FSC=shipment_df['FSC'].astype(str).str[2:4],
                                 Item_CC=shipment_df['NIIN'].str[:2].values,
                                 Item_Code=shipment_df['NIIN'].str[2:].values)

#    Fill missing columns with 'not in file' value to distinguish them from NaN/null values.
shipment_df['NSN'] = 'not in file'
shipment_df['DEMILCode'] = 'not in file'
shipment_df['DEMILIC'] = 'not in file'
shipment_df['StationType'] = 'not in file'
shipment_df['CancelledBy'] = 'not in file'
shipment_df['RTDRef'] = 'not in file'
shipment_df['ReasonCancelled'] = 'not in file'

#    Add 'File' column.
shipment_df['File'] = LESOfile_shipcanc

#    Add 'Sheet' column
shipment_df['Sheet'] = current_sheet

#    Order the columns in preparation for merging.
shipment_df = shipment_df[ordered_columns_list]

In [None]:
print('Pulled the shipment data from: ')
print('%s\t%s\t%s' % get_file_info(path_datafiles + LESOfile_shipcanc))
print('MD5\t %s' % get_file_hash(path_datafiles + LESOfile_shipcanc, 'md5'))
print('SHA256\t %s' % get_file_hash(path_datafiles + LESOfile_shipcanc, 'sha256'))

In [None]:
print('Total shipment records pulled from %s: %s' % (LESOfile_shipcanc, str(total_shipment_records)))
print('After prepping the data, the shipment dataframe has %s columns with %s records.' 
      % (shipment_df.shape[1], shipment_df.shape[0]))

### PREPARE THE CANCELLATIONS DATA FROM LESOfile_shipcanc

In [None]:
#    Expected columns based on 'CANCELLATIONS' columns from Check_Shipments_Cancellations.ipynb
current_sheet = 'CANCELLATIONS'
canc_expected_columns = ['Cancelled By', 'RTD Ref', 'State', 'Station Name (LEA)',
                         'FSC', 'NIIN', 'Item Name', 'UI', 'Quantity', 'Acquisition Value',
                         'Date Requested', 'Justification', 'Reason Cancelled']
#    Dictionary mapping original 'CANCELLATIONS' columns to merged data columns.
canc_columns_dictionary = {'Cancelled By':'CancelledBy', 'RTD Ref':'RTDRef', 
                           'State':'StateAbbreviation', 'Station Name (LEA)':'RequestingAgency',
                           'FSC':'FSC', 'NIIN':'NIIN', 'Item Name':'ItemDescription',
                           'UI':'UnitIncrement', 'Quantity':'Quantity', 'Acquisition Value':'AcquisitionValue',
                           'Date Requested':'RecordDate', 'Justification':'Justification',
                           'Reason Cancelled':'ReasonCancelled'}

In [None]:
#    Read the data 'CANCELLATIONS' sheet of the XLSX file.
cancellation_df = make_dataframe(pd.read_excel("file:" + path_datafiles + LESOfile_shipcanc,
                                                sheet_name=current_sheet), 'Date Requested').\
                                     rename(columns=canc_columns_dictionary)

#    Collect information about original data
total_cancellation_records = cancellation_df.shape[0]

#    Break 'FSC' and 'NIIN' into NATO Stock Number units.
cancellation_df = cancellation_df.assign(Item_FSG=cancellation_df['FSC'].astype(str).str[:2],
                                         Item_FSC=cancellation_df['FSC'].astype(str).str[2:4],
                                         Item_CC=cancellation_df['NIIN'].str[:2].values,
                                         Item_Code=cancellation_df['NIIN'].str[2:].values)

#    Fill missing columns with 'not in file' value to distinguish them from NaN/null values.
cancellation_df['NSN'] = 'not in file'
cancellation_df['DEMILCode'] = 'not in file'
cancellation_df['DEMILIC'] = 'not in file'
cancellation_df['StationType'] = 'not in file'
cancellation_df['RequisitionID'] = 'not in file'

#    Add 'File' column.
cancellation_df['File'] = LESOfile_shipcanc

#    Add 'Sheet' column
cancellation_df['Sheet'] = current_sheet

#    Order the columns in preparation for merging.
cancellation_df = cancellation_df[ordered_columns_list]

In [None]:
print('Pulled the cancellation data from: ')
print('%s\t%s\t%s' % get_file_info(path_datafiles + LESOfile_shipcanc))
print('MD5\t %s' % get_file_hash(path_datafiles + LESOfile_shipcanc, 'md5'))
print('SHA256\t %s' % get_file_hash(path_datafiles + LESOfile_shipcanc, 'sha256'))

In [None]:
print('Total cancellation records pulled from %s: %s' % (LESOfile_shipcanc, str(total_cancellation_records)))
print('After prepping the data, the cancellation dataframe has %s columns with %s records.' 
      % (cancellation_df.shape[1], cancellation_df.shape[0]))

### MERGE THE DATA

In [None]:
# unzip military_equipment_distributions_to_law_enforcement_agencies_us.zip in data folder
# Merge all dataframes if the columns match.
if list(transfer_df.columns) != list(shipment_df.columns):
    print('Columns in transfer dataframe do not match columns in shipments dataframe.')
elif list(transfer_df.columns) != list(cancellation_df.columns):
    print('Columns in transfer dataframe do not match columns in cancellations dataframe.')
elif list(shipment_df.columns) != list(cancellation_df.columns):
    print('Columns in shipments dataframe do not match columns in cancellations dataframe.')
else:
    all_data_df = pd.concat([transfer_df, shipment_df, cancellation_df],axis=0)

In [None]:
print('The merged data has', all_data_df.shape[1],
      'columns with', all_data_df.shape[0], 'records.')

In [None]:
#    Write or append merged data to the TSV file(s).

if split_by_column:
    if split_by_column == 'Quarter':
        all_data_df['Quarter'] = pd.PeriodIndex(all_data_df.RecordDate, freq='Q')
        
    for i in list(all_data_df[split_by_column].unique()):
        my_file = Path(path_mergedfiles + str(i) + '_leso' + '.tsv')
        if my_file.exists():
            all_data_df[all_data_df[split_by_column] == i].\
                to_csv(my_file, header=False, index=False, mode='a',
                       columns=ordered_columns_list, sep='\t', escapechar="\\")
        else:
            all_data_df[all_data_df[split_by_column] == i].\
                to_csv(my_file, index=False, mode='w',
                       columns=ordered_columns_list, sep='\t', escapechar="\\")
else:
    my_file = Path(path_mergedfiles + 'all_leso' + '.tsv')
    if my_file.exists():
        all_data_df.to_csv(my_file, header=False, index=False, mode='a',
                           columns=ordered_columns_list, sep='\t', escapechar="\\")
    else:
        all_data_df.to_csv(my_file, index=False, mode='w',
                           columns=ordered_columns_list, sep='\t', escapechar="\\")

In [None]:
print('The merged data file(s) has been saved in the folder', path_mergedfiles)

In [None]:
readme_file = Path(path_mergedfiles + '/about/README.txt')
with open(readme_file, 'a') as a_file:
    a_file.write('\n%s\t\t%s' % (get_file_hash(path_datafiles + LESOfile_all, 'md5'), LESOfile_all))
    a_file.write('\n%s\t\t%s' % (get_file_hash(path_datafiles + LESOfile_shipcanc, 'md5'), LESOfile_shipcanc))