The purpose of this script is to perform an initial load of all Excel data provided by SAMS into a SQLite database for consolidation, cleaning, and translation. 

In [1]:
# Manipulate the file system
import os
import shutil

# Convert stored string representation of a list to a list
import ast

# Recurse through a directory tree and return file names with glob
import glob

# Decode and re-encode mangled Arabic file names
import codecs

# Connect to a SQLite database in a lazy manner.
import dataset

# Enables opening and reading of Excel files
import openpyxl

# Translating variables, sheet names, and workbook names from Arabic
# This is NOT free to use.
from google.cloud import translate

# Set the environment variable for the Google Service Account
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:\\Users\\clay\\Documents\\fxb-lcs-2b24f4f8a73a.json'

This script is meant to be run from top to bottom, as a way to reproduce the data. The next cell deletes the existing database so that it can be recreated. Do not run the next cell if you do not want to delete the database. The script takes a long time to run!

In [2]:
try:
    # Try to preserve a copy
    shutil.copy2("sams_data.sqlite","sams_data.sqlite.bak")
    print("Backed up to sams_data.sqlite.bak")
except:
    pass

try:
    os.remove("sams_data.sqlite")
    print("Removed sams_data.sqlite")
except:
    pass

Connect to the database and create references to the tables we'll use to import file names and create links to tab names. The idea is that tabs with the same or similar names likely contain data that should be consolidated into the same table.

In [3]:
db = dataset.connect("sqlite:///sams_data.sqlite")
tab_files = db['files']

## Create database references for the files

Get the list of Excel files and print out how many there are to process.

In [4]:
file_list = glob.glob("data/**/*.xls*",recursive=True)
len(file_list)

1138

Iterate through the file names, loading the files with `openpyxl` and creating records for them. Use the Google Translate API to translate Arabic file names.

Some of the file names are encoded using the incorrect Codepage. This provides a fixed value in the 'ungarbled' column of the files table. Solved with help from [the question I asked on Stack Overflow](https://stackoverflow.com/questions/45625029/how-do-i-convert-file-names-in-iso8859-6-to-utf-8/45628279#45628279).


Run this command in the proper python context to be able to use the API:

`gcloud auth application-default login`

In [5]:
translate_client = translate.Client()
target_lang = 'en'

In [6]:
for f in file_list:
    try:
        wb = openpyxl.load_workbook(f,read_only=True)
        problem = False
        sheets = wb.sheetnames
        num_sheets = len(sheets)
    except:
        print("Unable to load",f)
        problem = True
        sheets = []
        num_sheets = 0
        
    path = f
    filename = f.split("\\")[-1]
    country = f.split("\\")[1]
    year = f.split("\\")[2]
    skipped = False
    ignore = False
    
    ungarbled = None
    translation = None
    
    if ('Ω' in filename or
        'π' in filename or
        'ó' in filename or
        'Θ' in filename):
        ungarbled = filename.encode('cp437').decode('cp720')
        translation = translate_client.translate(ungarbled,
                                                 target_language=target_lang)
        translation = translation['translatedText']
    
    file_rec = {
        "file_name":filename,
        "ungarbled":ungarbled,
        "translation":translation,
        "path":path,
        "country":country,
        "year":year,
        "num_sheets":num_sheets,
        "sheet_names":str(sheets),
        "info":"",
        "problem_opening":problem,
        "skipped":skipped,
        "ignore":ignore
    }
    
    # Insert the file record into the database
    tab_files.insert(file_rec)

----

## Create db references for the sheets

This is needed because the worksheet names likely consolidate down to fewer schema than would be apparent through their current naming scheme. The English names will be used to generate table names.


Considerations:

- Sheet names need to be translated
- Sheets with the same name should share the same schema
- Variables for a sheet schema should be attached to sheet names
- Aggregate sheets are more difficult to import. Handle them later.

In [17]:
# Most of these are reference or aggregate sheets that
# will be handled later

sheet_names_to_skip = [
    "TOTAL",
    "Name",
    "Code",
    "Sheet",
    "Monthly",
    "Injured Info"
]

In [18]:
tab_sheets = db['sheets']
tab_files_sheets_join = db['files_sheets']

Create a set of the unique sheet names so that we can translate them and mark some of them to be ignored.

In [19]:
sheet_set = set()

for rec in tab_files.find():
    sheets = ast.literal_eval(rec['sheet_names'])
    for s in sheets:
        sheet_set.add(s)

Now create records for each unique sheet name and insert them into the sheets table in the database.

In [21]:
for sheet in sheet_set:
    rec = {"name":sheet}
    translation = translate_client.translate(sheet,
                                             target_language=target_lang)
    translation = translation['translatedText']
    rec["translation"] = translation
    rec["normalized"] = ""
    skip = False

    if any(skipname in sheet for skipname in sheet_names_to_skip):
        skip = True
    rec["skip"] = skip
    tab_sheets.insert(rec)

Next, join the files to the sheets in a join table for future reference. That table will be processed later to provide additional metadata about whether a particular sheet from a file was processed or imported.

Create an in-memory reference for the sheets. While this could be done in an earlier step, it's being done later in the event that the join records need to be recreated with additional metadata.

In [22]:
sheet_ref = {}
for rec in tab_sheets.find():
    sheet_ref[rec['name']] = rec['id']

Delete existing join records

In [25]:
tab_files_sheets_join.delete()

True

Now rebuild the records in it

In [26]:
join_records = []

for rec in tab_files.find():
    for sheet in ast.literal_eval(rec['sheet_names']):
        sheet_id = sheet_ref[sheet]
        join_rec = {
            "file_id":rec['id'],
            "sheet_id":sheet_id,
            "header_start":None,
            "header_end":None,
            "header_values":None
        }
        join_records.append(join_rec)

# Bulk inserts are faster than individual inserts
tab_files_sheets_join.insert_many(join_records)

## End of this workbook

To minimize the chance of accidently deleting data and requiring additional calls to the Google Translate API, this process is going to be cut into a series of notebooks, each of which works with a different database as its primary starting point.


### Copy the database

In [29]:
shutil.copy2("sams_data.sqlite","sams_data_phase02_template.sqlite")

'sams_data_phase02_template.sqlite'