The raw Arabic data (from processed files) is in the template database. The purpose of this notebook is to take the raw data, scrub it of PII, consolidate it to a polymorphic table for translation, and then send the remaining Arabic data through Google Translated for a rough version of the Arabic. 

Note that the translation process is expensive and should only be performed a single time on a reduced set of data. Therefore, this notebook should not be run many times.

Also, it's possible that the raw data will need to be reimported after correcting the variable reduction. That means that the template that is loaded into Notebook 4 would need to have the normalized values in the variables table adjusted to produce a more sane consolidation of the data. The risk in this is that collapsing to too few columns will inevitably collapse disparate data points together into the same blob of data. A reason why this might be done would be to fix a prior consolidation that is incorrect to an issue where there are fields in the data that only contain several records with values entered into them. 

In [1]:
# Manipulate the file system
import os
import shutil

# Copy dictionaries
import copy

# work with dates
import datetime
import arrow

# For scrubbing PII
import hashlib

# Convert stored string representation of a list to a list
import ast

# Recurse through a directory tree and return file names with glob
import glob

# Decode and re-encode mangled Arabic file names
import codecs

# Connect to a SQLite database in a lazy manner.
import sqlalchemy
import dataset

# Enables opening and reading of Excel files
import openpyxl

# Translating variables, sheet names, and workbook names from Arabic
# This is NOT free to use.
from google.cloud import translate

# Set the environment variable for the Google Service Account
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:\\Users\\clay\\Documents\\fxb-lcs-2b24f4f8a73a.json'

In [5]:
#If there's an existing db for this sheet, delete it
#so that we can copy from the template for a fresh start

try:
    os.remove("sams_data_phase05.sqlite")
    print("Removed template clone sams_data_phase05.sqlite")
except:
    pass

try:
    # Try to preserve a copy in case there is a problem and it has to be restored
    shutil.copy2("sams_data_phase05_template.sqlite","sams_data_phase05.sqlite")
    
    print("Created database from template: sams_data_phase05.sqlite")
except:
    pass

Removed template clone sams_data_phase05.sqlite
Created database from template: sams_data_phase05.sqlite


In [6]:
db = dataset.connect("sqlite:///sams_data_phase05.sqlite")

### Scrub the PII from the data

Store the scrubbed data in a new table. Look into the full HIPAA compliance on scrubbing. 

Fields that might need to be scrubbed begin with `info_`. For now, only scrub name fields.

In [7]:
tab_raw = db['full_raw_data']
tab_raw_scrubbed = db['full_raw_scrubbed']

In [19]:
buffer = []
buffer_size = 10000

# Do not save this value in a source code repository!
salt = 'REDACTED'.encode()

# Note, may need to include residence fields
# depending on their level of specificity
# info_residence
# info_residence_of_relative
# info_residence_original

fields = [
    "info_name",
    "info_name_author",
    "info_name_caregiver",
    "info_name_facility",
    "info_name_group",
    "info_name_of_coach",
    "info_name_processor",
    "info_name_surgeon",
    "info_phone_skype"
]

for rec in tab_raw.find():
    for pii_field in fields:
        if rec[pii_field] is None or rec[pii_field] == '.':
            continue
        else:
            # Hash the value in the field
            h = hashlib.sha256()
            h.update(rec[pii_field].encode())
            h.update(salt)
            rec[pii_field] = h.hexdigest()
            
    buffer.append(rec)
    
    if len(buffer) > buffer_size:
        tab_raw_scrubbed.insert_many(buffer)
        buffer.clear()
        
# Catch the remaining records
tab_raw_scrubbed.insert_many(buffer)
buffer.clear()

In [20]:
# Drop the non-scrubbed raw data
tab_raw.drop()

True

In [21]:
# Copy the database over as the template for the next file.
# This Notebook did not include manual editing of the data.

# Do not rerun this cell!
# shutil.copy2('sams_data_phase05.sqlite','sams_data_phase06_template.sqlite')

'sams_data_phase06_template.sqlite'