### Dummy Variable / Flag Generation

This script takes the latest version of the SQLite database, or whatever version you specify, and generates dummy variables based on the criteria you provide in the indicated cell below.

Note that I wrote and run this and other scripts in Windows 10 that is running in Bootcamp on my Mac because MacOS can introduce issues when dealing with Arabic. There in nothing in this script that bumps up against that problem, however, so you should be able to run it in MacOS. Where you may run into trouble is if you try to write SQL in the notebook that includes Arabic or you try to copy/paste or otherwise insert Arabic into this notebook. There are a few short Arabic strings in the notebook, but as long as you don't edit them, they should not run into the problem described above.

Please let Clay know if you have questions or run into problems.

Again, I recommend [DB Browser for SQLite](http://sqlitebrowser.org/) for directly navigating the SQLite files. Note that some of these cells will lock the DB and/or require exclusive access to the DB. Hence, if you open the DB in something other than this notebook, do so when there isn't a cell running in the notebook and close it before you run another cell from the notebook. that way you can avoid collisions between apps trying to lock the database.

In [1]:
# For copying files and working with file directories
import os
import shutil

# Regular expressions
# You can use these for pattern matching if you're so inclined
import re

# Connect to a SQLite database in a lazy manner.
import dataset

# Export database table to CSV
import csv

In [2]:
# What do you want to name the database you create after running the script?
# This will delete that if it exists and then create a new copy of the baseline
# database and make alterations to it. This is a method of ensuring that the
# original database is not mistakenly overwritten

# Name format: whatever_name_you_want.sqlite
original_db_name = "sams_data_phase17.sqlite"
new_db_name = "db_with_flags.sqlite"

try:
    os.remove(new_db_name)
    print("Removed", new_db_name)
except:
    pass

try:
    # Try to preserve a copy in case there is a problem and it has to be restored
    shutil.copy2(original_db_name, new_db_name)
    
    print("Created", new_db_name,"from template:", original_db_name)
except:
    pass

# All operations will be on the new database, not the original source one

Removed db_with_flags.sqlite
Created db_with_flags.sqlite from template: sams_data_phase17.sqlite


In [3]:
# Create a connection to the database
db = dataset.connect("sqlite:///" + new_db_name)

In [4]:
# Get a reference to the arabic_values table
tab_arabic_values = db['arabic_values']

In the event that we previously generated flags, we want to get rid of them before we generate new ones to make sure that we aren't polluting the new flag set with values previously set that aren't explicitly overwritten. Hence, drop the tables derived previously, if they exist and drop the flag columns from `arabic_values`. If you are working with the source db I sent you a few weeks ago, that means we want to drop the `full_raw_flags` table and the `full_raw_flags_reduced` table and then the `flag_` columns from the `arabic_values` table.

In [5]:
try:
    db['full_raw_flags'].drop()
    print("Dropped full_raw_flags")
except:
    pass

try:
    db['full_raw_flags_reduced'].drop()
    print("Dropped full_raw_flags_reduced").drop()
except:
    pass

# One SQLite limitation is you cannot drop columns, so you have to create a new table and then rename it.
preserve_fields = [k for k in tab_arabic_values.find_one().keys() if 'flag_' not in k]

# We don't use result but assigning it skips printing some garbage below
result = db.query("""
CREATE TABLE new_arabic_values AS 
    SELECT """ + ",".join(preserve_fields) + """ 
    FROM arabic_values;
""")

# Drop the original arabic_values table
tab_arabic_values.drop()

# Rename new_arabic_values to arabic_values & now we have a table with no flag columns
result = db.query("""
ALTER TABLE new_arabic_values RENAME TO arabic_values;
""")

Dropped full_raw_flags
Dropped full_raw_flags_reduced


In [6]:
# Now because we futzed with the arabic_values table, we have to create a new reference to the database
# and to our arabic_values table. The db object stores some schema information that isn't updated with
# our direct query calls above.

del db
del tab_arabic_values

db = dataset.connect("sqlite:///" + new_db_name)
tab_arabic_values = db['arabic_values']

In [7]:
# Now create an in-memory representation of the arabic_values table
# and store it in variable `data`
data = [x for x in tab_arabic_values]

In [8]:
# It's always good to look at what you have in lists, etc... to guarantee it's what you expect.
data[:2]

[OrderedDict([('id', 1),
              ('arabic', 'قبول عابر'),
              ('google_translate', 'Transient admission'),
              ('human_translate', 'monitoring'),
              ('normalized', None),
              ('appears_in',
               "['acceptance_pattern', 'diagnosis', 'treatment']")]),
 OrderedDict([('id', 2),
              ('arabic', 'حواضن'),
              ('google_translate', 'Cushions'),
              ('human_translate', 'incubators'),
              ('normalized', None),
              ('appears_in',
               "['acceptance_pattern', 'analysis_type', 'diagnosis', 'treatment']")])]

### Flag Generation

The cell below contains the data structures that have to be updated to generate the flags.

In [9]:
# Update this if you want to change what flags you are making on the dataset.
# The logic for creating them is in the following cell.

# Require and flag term
flag_terms = [
    "injury",
    "blunt",
    "trauma",
    "shrapnel",
    "traffic",
    "explosive",
    "blast",
    "gunshot",
    "stab",
    "wound",
    "upper extremity",
    "lower extremity",
    "neck",
    "chest",
    "back",
    "spinal",
    "neurologic",
    "nerve",
    "vascular",
    "orthopedic",
    "fracture",
    "suspected",
    "follow-up",
    "complication",
    "history of"
]

# require all terms - not in use at the moment
multiple_flag_terms = [
#     ("burn","fracture")
]

# require any of the terms but name the flag after the first
synonym_flag_terms = [
    ("facial","face"),
    ("pelvic","pelvis"),
    ("head","eye","ear","face","brain","scalp","mouth","nose"),
    ("spine","spinal"),
    ("abdomen","abdominal")
]

# require the first term and the absence of the remaining terms
# name the flag after the first term.
complex_flag_terms = [
    ("urologic","neurologic"),
    ("burn","heartburn"),
    ("extremity","lower extremity","upper extremity")
]

The `arabic_values` table has one record for each unique Arabic string in the raw data that was imported. Since we can map these back to their source in the raw data, we'll generate flags against these strings and then map the flags back by joining on the arabic strings in this table.

In [10]:
# Store the rows we change here
# so that we can update the table

update_data = []

# Iterate through the in-memory representation
for rec in data:
    # Create a placeholder update record
    update_rec = {'id':rec['id']}
    
    # A flag we'll use to determine whether the record needs to be updated
    update_record = False
    
    # Get the human_translate value from the record
    ht = rec['human_translate']
    
    # If it is not None, then convert it to lowercase
    if ht:
        ht = ht.lower()
    
    # Get the google_translate value and convert it to lowercase
    # We are not currently using this to generate flags so it is commented out
    # but you could substitute it in below or write additional code if you want
    # to use it for flag generation
    # gt = rec['google_translate'].lower()
    
    # Walk through the different flag types from above and check whether the 
    # human_translate value matches for that flag. If so, create the update record
    # for that flag and then mark our update boolean indicator true so that we know
    # to update the appropriate record in the database. All records that will be updated
    # have their update record put into the update_data list.
    if ht:
        for term in flag_terms:
            if term in ht:
                update_rec["flag_" + "_".join(term.replace("-","_").split())] = 1
                update_record = True
                
        for tup in multiple_flag_terms:
            if all(x in ht for x in tup):
                update_rec["flag_" + "_and_".join(tup)] = 1
                update_record = True
                
        for tup in synonym_flag_terms:
            if any(x in ht for x in tup):
                update_rec["flag_" + tup[0]] = 1
                update_record = True
                
        for tup in complex_flag_terms:
            if tup[0] in ht and not any(x in ht for x in tup[1:]):
                update_rec["flag_" + tup[0].replace(" ","_").replace("-","_")] = 1
                update_record = True
            
        # Handle war-related separately. This very likely can be improved upon
        if 'war-related injury' in ht and 'not war-related injury' not in ht:
            update_rec['flag_conflict_related'] = 1
            update_record = True
        
    # If we created any flags, update_record is true so put this record in the list 
    # of records to update.
    if update_record:
        update_data.append(update_rec)

In [11]:
# How many records are we going to update in the arabic_values table?
len(update_data)

33431

In [12]:
# What do the update records look like? 
update_data[:10]

[{'flag_abdomen': 1, 'id': 606},
 {'flag_head': 1, 'id': 613},
 {'flag_conflict_related': 1, 'flag_injury': 1, 'id': 614},
 {'flag_suspected': 1, 'id': 619},
 {'flag_head': 1, 'id': 628},
 {'flag_chest': 1, 'id': 636},
 {'flag_head': 1, 'id': 642},
 {'flag_suspected': 1, 'id': 643},
 {'flag_suspected': 1, 'id': 665},
 {'flag_suspected': 1, 'id': 667}]

In [13]:
# Update the arabic_values table with the update_records' data
# 1. Create the columns we need
# 2. Bulk update for each column

flag_cols = set()
for rec in update_data:
    for k in rec.keys():
        if k != 'id':
            flag_cols.add(k)
flag_cols = sorted(list(flag_cols))

# The trick here is to get the id from the a record in arabic values and update that
# record with a None value for each of these flags - that will cause dataset to generate the columns
ref_rec = tab_arabic_values.find_one()
ref_rec_update = {'id':ref_rec['id']}
for col in flag_cols:
    ref_rec_update[col] = None
tab_arabic_values.update(ref_rec_update, ['id'])

# At this point maybe open DB Browser for SQLite to make sure the columns were created.
# The 1 that prints below is the number of records updated.

1

In [14]:
# These are the flags generated from the code above.
flag_cols

['flag_abdomen',
 'flag_back',
 'flag_blast',
 'flag_blunt',
 'flag_burn',
 'flag_chest',
 'flag_complication',
 'flag_conflict_related',
 'flag_explosive',
 'flag_extremity',
 'flag_facial',
 'flag_follow_up',
 'flag_fracture',
 'flag_gunshot',
 'flag_head',
 'flag_history_of',
 'flag_injury',
 'flag_lower_extremity',
 'flag_neck',
 'flag_nerve',
 'flag_neurologic',
 'flag_orthopedic',
 'flag_pelvic',
 'flag_shrapnel',
 'flag_spinal',
 'flag_spine',
 'flag_stab',
 'flag_suspected',
 'flag_traffic',
 'flag_trauma',
 'flag_upper_extremity',
 'flag_urologic',
 'flag_vascular',
 'flag_wound']

In [15]:
# Now iterate through the flag cols and create a list of each record that needs to set the value for each
# flag column and then bulk update. It is orders of magnitude faster to do it this way than one by one.

# Note - this is generating and executing some super gnarly long SQL queries with tons of ID numbers

for col in flag_cols:
    recs_to_update = []
    for rec in update_data:
        if col in rec.keys():
            recs_to_update.append(rec['id'])
    recs_to_update = sorted(recs_to_update)

    db.query("""
    UPDATE arabic_values
    SET """ + col + """ = 1 
    WHERE id IN (""" + ",".join([str(a) for a in recs_to_update]) +""");
    """)
    
# After this runs, check in the database against to make sure the flags were properly applied.

Ok, that's done so now we want to apply the flags back to the raw data.

1. Pseudo-update the first record to trigger the addition of the flag fields
2. Compare each value in each column to an in-memory arabic_values lookup
3. Apply relevant flags to the record in question
4. Buffer and update the raw table when the buffer is full.

In [16]:
# Get a new db connection again in case the schema has changed.
# This probably isn't necessary but is a safety measure.

del db
del tab_arabic_values

db = dataset.connect("sqlite:///" + new_db_name)
tab_arabic_values = db['arabic_values']

In [17]:
# Get a reference to the raw Arabic data table
tab_raw_ar = db['full_raw_scrubbed']

In [18]:
# Get the list of variables used in full_raw_scrubbed and full_raw_english
rec_raw = tab_raw_ar.find_one()
variables = list(rec_raw.keys())
print(",".join(variables))

# Due to previous work, there are flag columns in the full_raw_scrubbed table, but we will ignore them
# because they aren't used in this flag-generation methodology. 

id,a_file_id,a_files_sheets_id,a_sheet_id,acceptance_pattern,analysis,analysis_request,analysis_type,anesthesia_type,assign_method,case,category,center,clinic,clinical_case,col_1,col_2,col_3,col_4,col_5,col_6,col_misc,col_moawak,col_none,col_null,col_to,conflict_related,consultations,daily_number,data_validation,date,date_first_exam,death,death_cause,death_certificate,death_date,death_location,death_time,department,diagnosis,diagnosis_confirmed,discharge,discharge_date,discharge_status,discharged_to,disclaimers,disease,displaced,displacement_duration,dose,drug_class,er,events,exam_type,examination_1,examination_2,examination_3,examination_4,examination_type,facility_type,housing,housing_persons_number,image,image_request,image_type,import_status,info_age,info_card_number,info_card_type,info_care_type,info_geo_address,info_geo_area,info_geo_community,info_geo_country_of_origin,info_geo_district,info_geo_governorate,info_geo_injury_city,info_geo_injury_site,info_geo_injury_state,info_geo

In [19]:
# Create the in-memory arabic_values lookup
# This time, since we created the flags, they'll be in the records

arabic_lookup = {}
arabic_values = [x for x in tab_arabic_values.find()]

for v in arabic_values:
    arabic_lookup[v['arabic']] = v

In [20]:
list(arabic_lookup.keys())[:10]

['قبول عابر',
 'حواضن',
 'استشفاء',
 'عناية',
 'اختبار الزمرة الدموية',
 'اختبار كريات الدم البيضاء, اختبار حمض البول, اختبار سرعة النزف, اختبار زمن التخثر, اختبار الزمرة الدموية',
 'اختبار تحليل البول (بول و رواسب)',
 'اختبار خضاب الدم / الهيموجلوبين, اختبار حمض البول, اختبار كرياتينين, اختبار سرعة النزف, اختبار زمن التخثر',
 'اختبار تحليل البول (بول و رواسب), اختبار كريات الدم البيضاء',
 'اختبار الزمرة الدموية, اختبار خضاب الدم / الهيموجلوبين']

In [21]:
# Let's test that a value we pull out of the database has a hit in the lookup table.
test_rec = tab_raw_ar.find_one()
diagnosis = test_rec['diagnosis']
print(diagnosis)
print("---------------------- Lookup result below")
print(arabic_lookup[diagnosis])

التهاب مجاري تنفسية سفلى, التهاب قصبات
---------------------- Lookup result below
OrderedDict([('id', 3863), ('arabic', 'التهاب مجاري تنفسية سفلى, التهاب قصبات'), ('google_translate', 'Inflammation of lower respiratory tracts, bronchitis'), ('human_translate', 'bronchiolitis, bronchitis'), ('normalized', None), ('appears_in', "['diagnosis']"), ('flag_abdomen', None), ('flag_back', None), ('flag_blast', None), ('flag_blunt', None), ('flag_burn', None), ('flag_chest', None), ('flag_complication', None), ('flag_conflict_related', None), ('flag_explosive', None), ('flag_extremity', None), ('flag_facial', None), ('flag_follow_up', None), ('flag_fracture', None), ('flag_gunshot', None), ('flag_head', None), ('flag_history_of', None), ('flag_injury', None), ('flag_lower_extremity', None), ('flag_neck', None), ('flag_nerve', None), ('flag_neurologic', None), ('flag_orthopedic', None), ('flag_pelvic', None), ('flag_shrapnel', None), ('flag_spinal', None), ('flag_spine', None), ('flag_stab', Non

**The following cell will take a while to run because it has to iterate through all raw records. Depending on how many flags you create, etc... **

**It might take 30 minutes to execute. This examines all fields in each record.**

The idea is to create a parallel record in the flag table for every record in the raw data table. The id numbers between the two tables will be parallel and some other metadata is included in the flag table. Because these flag tables will have far fewer columns and most columns will only have 1 or NULL as values, they are much faster to query.

In [22]:
# The insert_many method inserts in chunks of 1000, but this specifies that we don't want
# to start the process until we have this many records to insert.
buffer_size = 50000

flags_to_insert = []

try:
    db['tab_raw_flags'].drop()
except:
    pass

tab_raw_flags = db['full_raw_flags']

# Iterate through the raw records one by one
for rec in tab_raw_ar.find():
    
    # Include foreign keys that allow us to query against the flag table instead of 
    # joining with the raw data table, which is slow.
    flag_record = {
        'id':rec['id'],
        'file_id':rec['a_file_id'],
        'files_sheets_id':rec['a_files_sheets_id'],
        'sheet_id':rec['a_sheet_id']
    }
    
    # Initialize each flag_record
    for flag in flag_cols:
        flag_record[flag] = None
        
    # Scan the conflict related column for values, but do this before looking at the
    # corresponding Arabic values so that we don't overwrite the Arabic value setting.
    if rec['conflict_related'] is not None:
        if rec['conflict_related'].strip() == 'كبرى' or rec['conflict_related'].strip() =='كبرى':
            flag_record['flag_conflict_related'] = 1
        elif rec['conflict_related'].strip() == 'لا':
            flag_record['flag_conflict_related'] = 0
        else:
            flag_record['flag_conflict_related'] = None
    else:
        flag_record['flag_conflict_related'] = None
        
    # Loop through the variables for each raw data record
    for v in variables:
        # These are obfuscated PII cols, or the flag columns we're ignoring, so skip them
        if 'info_' in v or 'flag_' in v or v == 'id':
            continue
        
        # Get the value in the column
        to_lookup = rec[v]
        
        if to_lookup is None or to_lookup == '.':
            continue
        else:
            
            # We have a legit value, so look it up and grab the flags
            try:
                # There might be a keyerror on the info_ columns' hashed values, etc.
                # I also manually removed some PII from arabic_values, so that might
                # cause an occassional mismatch.
                arabic_values_rec = arabic_lookup[to_lookup]
                for flag in flag_cols:
                    # Should be None if not flagged, so just check for existence
                    if arabic_values_rec[flag]:
                        flag_record[flag] = arabic_values_rec[flag]
            except:
                pass

    # Store the record
    flags_to_insert.append(flag_record)

    # Check if we need to insert
    if len(flags_to_insert) > buffer_size:
        tab_raw_flags.insert_many(flags_to_insert)
        
        # Clear the buffer
        flags_to_insert.clear()
        
# We've been through all raw records so make sure the buffer is clear
tab_raw_flags.insert_many(flags_to_insert)
flags_to_insert.clear()

### Reduced variable checking for alternative flag set

Only run the code in the next two cells if you also want to create a "reduced" flag set that is derived from only looking at the specified columns.

In [23]:
# For the reduced flag dataset we'll generate, only examine these columns
reduced_variables = ["diagnosis"]

In [25]:
buffer_size = 50000

flags_to_insert_reduced = []

try:
    db['tab_raw_flags_reduced'].drop()
except:
    pass

tab_raw_flags_reduced = db['full_raw_flags_reduced']

# Iterate through the raw records one by one
for rec in tab_raw_ar.find():
    
    # Same as above but for reduced search fields
    flag_record_reduced = {
        'id':rec['id'],
        'file_id':rec['a_file_id'],
        'files_sheets_id':rec['a_files_sheets_id'],
        'sheet_id':rec['a_sheet_id']
    }
    
    # Initialize each flag_record
    for flag in flag_cols:
        flag_record_reduced[flag] = None
        
    # Scan the conflict related column for values, but do this before looking at the
    # corresponding Arabic values so that we don't overwrite the Arabic value setting.
    if rec['conflict_related'] is not None:
        if rec['conflict_related'].strip() == 'كبرى' or rec['conflict_related'].strip() =='كبرى':
            flag_record_reduced['flag_conflict_related'] = 1
        elif rec['conflict_related'].strip() == 'لا':
            flag_record_reduced['flag_conflict_related'] = 0
        else:
            flag_record_reduced['flag_conflict_related'] = None
    else:
        flag_record_reduced['flag_conflict_related'] = None
        
    # Loop through the variables for each raw data record in the reduced variable set
    for v in reduced_variables:
        # These are PII cols, so skip them
        if 'info_' in v or v == 'id':
            continue
        
        # Get the value in the column
        to_lookup = rec[v]
        
        if to_lookup is None or to_lookup == '.':
            continue
        else:
            
            # We have a legit value, so look it up and grab the flags
            try:
                # There might be a keyerror on the info_ columns' hashed values, etc.
                arabic_values_rec = arabic_lookup[to_lookup]
                for flag in flag_cols:
                    if arabic_values_rec[flag]:
                        flag_record_reduced[flag] = arabic_values_rec[flag]
            except:
                pass
    
    # Store the record
    flags_to_insert_reduced.append(flag_record_reduced)
    
    # Check if we need to insert
    if len(flags_to_insert_reduced) > buffer_size:
        tab_raw_flags_reduced.insert_many(flags_to_insert_reduced)
        
        # Clear the buffer
        flags_to_insert_reduced.clear()
        
# We've been through all raw records so make sure the buffer is inserted
tab_raw_flags_reduced.insert_many(flags_to_insert_reduced)
flags_to_insert_reduced.clear()

Now you need to get into the SQLite database and start querying to generate the datasets that you want. Below is an example of how to run the query through Python and easily dump the results to CSV files for analysis elsewhere.

### Export full flags table, full_raw_flags, to CSV

This is a big data set and might take a few minutes. Resulting CSV will be ~ 160mb.

You should be able to pull this into python, R, Tableau, etc... for analysis. It probably has too many records to open in Excel.

In [30]:
# You can change this query to export a different set of data
result = db.query("""
SELECT files.id as files_id,
       files.year,
       files.month,
       facilities.id AS facility_id,
       facilities.facility_parent_id,
       facilities.facilityname,
       facilities.country,
       facilities.governorate,
       facilities.district,
       facilities.subdistrict,
       facilities.facility_type,
       full_raw_flags.flag_abdomen,
       full_raw_flags.flag_back,
       full_raw_flags.flag_blunt,
       full_raw_flags.flag_burn,
       full_raw_flags.flag_chest,
       full_raw_flags.flag_complication,
       full_raw_flags.flag_conflict_related,
       full_raw_flags.flag_explosive,
       full_raw_flags.flag_extremity,
       full_raw_flags.flag_facial,
       full_raw_flags.flag_follow_up,
       full_raw_flags.flag_fracture,
       full_raw_flags.flag_gunshot,
       full_raw_flags.flag_head,
       full_raw_flags.flag_history_of,
       full_raw_flags.flag_injury,
       full_raw_flags.flag_lower_extremity,
       full_raw_flags.flag_neck,
       full_raw_flags.flag_nerve,
       full_raw_flags.flag_neurologic,
       full_raw_flags.flag_orthopedic,
       full_raw_flags.flag_pelvic,
       full_raw_flags.flag_shrapnel,
       full_raw_flags.flag_spinal,
       full_raw_flags.flag_spine,
       full_raw_flags.flag_stab,
       full_raw_flags.flag_suspected,
       full_raw_flags.flag_traffic,
       full_raw_flags.flag_trauma,
       full_raw_flags.flag_upper_extremity,
       full_raw_flags.flag_urologic,
       full_raw_flags.flag_vascular,
       full_raw_flags.flag_wound

FROM full_raw_flags
JOIN files on files.id = full_raw_flags.file_id
JOIN facilities on files.facility_id = facilities.id

WHERE files.facility_id IS NOT NULL 
AND files.month IS NOT NULL
AND files.skipped = 0
AND files.ignore = 0;
""")
dataset.freeze(result, format='csv', filename='full_raw_flags.csv')

### Export reduced flags table, full_raw_flags_reduced, to CSV

This is a big dataset and might take a few minutes. Resulting CSV file will be ~ 160mb.

You should be able to pull this into python, R, Tableau, etc... for analysis. It probably has too many records to open in Excel.

In [31]:
# You can change this query to export a different set of data
# Keep in mind these flags were generated from the subset of raw data columns
# listed above
result = db.query("""
SELECT files.id as files_id,
       files.year,
       files.month,
       facilities.id AS facility_id,
       facilities.facility_parent_id,
       facilities.facilityname,
       facilities.country,
       facilities.governorate,
       facilities.district,
       facilities.subdistrict,
       facilities.facility_type,
       full_raw_flags_reduced.flag_abdomen,
       full_raw_flags_reduced.flag_back,
       full_raw_flags_reduced.flag_blunt,
       full_raw_flags_reduced.flag_burn,
       full_raw_flags_reduced.flag_chest,
       full_raw_flags_reduced.flag_complication,
       full_raw_flags_reduced.flag_conflict_related,
       full_raw_flags_reduced.flag_explosive,
       full_raw_flags_reduced.flag_extremity,
       full_raw_flags_reduced.flag_facial,
       full_raw_flags_reduced.flag_follow_up,
       full_raw_flags_reduced.flag_fracture,
       full_raw_flags_reduced.flag_gunshot,
       full_raw_flags_reduced.flag_head,
       full_raw_flags_reduced.flag_history_of,
       full_raw_flags_reduced.flag_injury,
       full_raw_flags_reduced.flag_lower_extremity,
       full_raw_flags_reduced.flag_neck,
       full_raw_flags_reduced.flag_nerve,
       full_raw_flags_reduced.flag_neurologic,
       full_raw_flags_reduced.flag_orthopedic,
       full_raw_flags_reduced.flag_pelvic,
       full_raw_flags_reduced.flag_shrapnel,
       full_raw_flags_reduced.flag_spinal,
       full_raw_flags_reduced.flag_spine,
       full_raw_flags_reduced.flag_stab,
       full_raw_flags_reduced.flag_suspected,
       full_raw_flags_reduced.flag_traffic,
       full_raw_flags_reduced.flag_trauma,
       full_raw_flags_reduced.flag_upper_extremity,
       full_raw_flags_reduced.flag_urologic,
       full_raw_flags_reduced.flag_vascular,
       full_raw_flags_reduced.flag_wound

FROM full_raw_flags_reduced
JOIN files on files.id = full_raw_flags_reduced.file_id
JOIN facilities on files.facility_id = facilities.id

WHERE files.facility_id IS NOT NULL 
AND files.month IS NOT NULL
AND files.skipped = 0
AND files.ignore = 0;
""")
dataset.freeze(result, format='csv', filename='full_raw_flags_reduced.csv')

### Back up the derived SQLite database

In [32]:
# This is optional and will generate a copy of the database that will be gigabytes in size.

shutil.copy2(new_db_name,'backup_' + new_db_name)

'backup_db_with_flags.sqlite'