### Dummy Variable / Flag Generation

This script takes the latest version of the SQLite database, or whatever version you specify, and generates dummy variables based on the criteria you provide in the indicated cell below.

Note that I wrote and run this and other scripts in Windows 10 that is running in Bootcamp on my Mac because MacOS can introduce issues when dealing with Arabic. There in nothing in this script that bumps up against that problem, however, so you should be able to run it in MacOS. Where you may run into trouble is if you try to write SQL in the notebook that includes Arabic or you try to copy/paste or otherwise insert Arabic into this notebook. There are a few short Arabic strings in the notebook, but as long as you don't edit them, they should not run into the problem described above.

Please let Clay know if you have questions or run into problems.

Again, I recommend [DB Browser for SQLite](http://sqlitebrowser.org/) for directly navigating the SQLite files. Note that some of these cells will lock the DB and/or require exclusive access to the DB. Hence, if you open the DB in something other than this notebook, do so when there isn't a cell running in the notebook and close it before you run another cell from the notebook. that way you can avoid collisions between apps trying to lock the database.

In [1]:
# For copying files and working with file directories
import os
import shutil

# Regular expressions
# You can use these for pattern matching if you're so inclined
import re

# Connect to a SQLite database in a lazy manner.
import dataset

# This used to be a part of dataset but was extracted to its own library
# https://github.com/pudo/datafreeze
from datafreeze import freeze

# Export database table to CSV
import csv

In [2]:
# What do you want to name the database you create after running the script?
# This will delete that if it exists and then create a new copy of the baseline
# database and make alterations to it. This is a method of ensuring that the
# original database is not mistakenly overwritten

# Name format: whatever_name_you_want.sqlite
original_db_name = "sams_data_phase20_template.sqlite"
new_db_name = "sams_data_phase20.sqlite"

try:
    os.remove(new_db_name)
    print("Removed", new_db_name)
except:
    pass

try:
    # Try to preserve a copy in case there is a problem and it has to be restored
    shutil.copy2(original_db_name, new_db_name)
    
    print("Created", new_db_name,"from template:", original_db_name)
except:
    pass

# All operations will be on the new database, not the original source one

Removed sams_data_phase20.sqlite
Created sams_data_phase20.sqlite from template: sams_data_phase20_template.sqlite


In [3]:
# Create a connection to the database
db = dataset.connect("sqlite:///" + new_db_name)

In [4]:
# Get a reference to the arabic_values table
tab_arabic_values = db['arabic_values']

In the event that we previously generated flags, we want to get rid of them before we generate new ones to make sure that we aren't polluting the new flag set with values previously set that aren't explicitly overwritten. Hence, drop the tables derived previously, if they exist and drop the flag columns from `arabic_values`. If you are working with the source db I sent you a few weeks ago, that means we want to drop the `full_raw_flags` table and the `full_raw_flags_reduced` table and then the `flag_` columns from the `arabic_values` table.

In [5]:
try:
    db['full_raw_flags'].drop()
    print("Dropped full_raw_flags")
except:
    pass

try:
    db['full_raw_flags_reduced'].drop()
    print("Dropped full_raw_flags_reduced").drop()
except:
    pass

# One SQLite limitation is you cannot drop columns, so you have to create a new table and then rename it.
preserve_fields = [k for k in tab_arabic_values.find_one().keys() if 'flag_' not in k]

# We don't use result but assigning it skips printing some garbage below
result = db.query("""
CREATE TABLE new_arabic_values AS 
    SELECT """ + ",".join(preserve_fields) + """ 
    FROM arabic_values;
""")

# Drop the original arabic_values table
tab_arabic_values.drop()

# Rename new_arabic_values to arabic_values & now we have a table with no flag columns
result = db.query("""
ALTER TABLE new_arabic_values RENAME TO arabic_values;
""")

Dropped full_raw_flags
Dropped full_raw_flags_reduced


In [6]:
# Now because we futzed with the arabic_values table, we have to create a new reference to the database
# and to our arabic_values table. The db object stores some schema information that isn't updated with
# our direct query calls above.

del db
del tab_arabic_values

db = dataset.connect("sqlite:///" + new_db_name)
tab_arabic_values = db['arabic_values']

In [7]:
# Now create an in-memory representation of the arabic_values table
# and store it in variable `data`
data = [x for x in tab_arabic_values]

In [8]:
# It's always good to look at what you have in lists, etc... to guarantee it's what you expect.
data[:2]

[OrderedDict([('id', 1),
              ('arabic', 'قبول عابر'),
              ('google_translate', 'Transient admission'),
              ('human_translate', 'monitoring'),
              ('normalized', None),
              ('appears_in',
               "['acceptance_pattern', 'diagnosis', 'treatment']"),
              ('google_translate_feb', 'Transient admission'),
              ('google_tokens_joined', 'transient admission'),
              ('orig_value', None)]),
 OrderedDict([('id', 2),
              ('arabic', 'حواضن'),
              ('google_translate', 'Cushions'),
              ('human_translate', 'incubators'),
              ('normalized', None),
              ('appears_in',
               "['acceptance_pattern', 'analysis_type', 'diagnosis', 'treatment']"),
              ('google_translate_feb', 'Cushions'),
              ('google_tokens_joined', 'cushions'),
              ('orig_value', None)])]

### Flag Generation

The cell below contains the data structures that have to be updated to generate the flags.

In [9]:
# Update this if you want to change what flags you are making on the dataset.
# The logic for creating them is in the following cell.

# Require and flag term
flag_terms = [
    "blunt",
    "explosive",
    "blast",
    "stab",
    "upper extremity",
    "lower extremity",
    "neck",
    "chest",
    "back",
    "spinal",
    "neurologic",
    "nerve",
    "vascular",
    "orthopedic",
    "fracture",
    "suspected",
    "follow-up",
    "complication",
    "history of",
    "traffic accident"
]

# require all terms - not in use at the moment
multiple_flag_terms = [
#     ("burn","fracture")
]

# require any of the terms but name the flag after the first
synonym_flag_terms = [
    ("allergy", "allergic"),
    ("anemia", "thalassemia"),
    ("cancer", "bcc", "leukemia", "lymphoma", "malignancy", "malignant", "scc"),
    ("cardiovascular"," asd "," vsd ","cholesterol","hypercholesterolemia","hyperlipidemia","hypertriglyceridemia","triglycerides","blood pressure"," bp ","high blood pressure","hypertension","acute coronary syndrome","angina","arrhythmia","atrial fibrillation"," avr ","cardiac ischemia","chest pain","clot","clotting","coronary atery","coronary heart disease","coronary ischemia","dvt","endocarditis","heart attack","heart disease","heart failure","heart valve","hf","hypotension","ihd"," mi ","mitral valve prolapse","mvr","myocardial hypoperfusion","myocardial infarction","palpitations","pericarditis","pulmonary embolism","pvd","svt","thromboembolism","thrombophlebitis","thrombosis","vasculitis"),
    ("congenital", "asd", "vsd"),
    ("dehydration", "dehydration", "hypovolemic shock"),
    ("dental complaint", "dental", "gingivitis", " gum ", "odonitis", "teeth", "tooth", "toothache"),
    ("derm", "acne","alopecia","blisters","cellulitis","dermatitis","dermoid","dry skin","eczema","folliculitis","hair loss","inflammatory papules","intertrigo","itch","lice","pruritis","psoriasis","rash","ringworm","scabies","skin disease","skin disorder","skin eruption","skin infection","skin lesion","tinea","warts"),
    ("diabetes","diabetic","DKA","glucose","hyperglycemia","hypoglycemia","sugar"),
    ("endocrine","hyperthyroid","hyperthyroidism","hypocalcemia","hypothyroid","hypothyroidism","parathyroid","thyroid"," TSH "),
    ("infection","conjunctivitis","eye discharge","eye infection","keratoconjunctivitis","ophthalmic infection"),
    ("pain", "corneal inflammation", "eye sensitivity", "keratitis", "pain in the eye"),
    ("fatigue", "exhaustion", "tired", "tiredness"),
    ("fever", "hyperthermia", "temperature"),
    ("constipation", "intestinal stasis"),
    ("shrapnel", "fragments","sliver","splinter"),
    ("musculoskeletal pain","ankylosing spondylitis","arthralgia","Arthritis","back pain","bruise","bruising","chondritis","contusion","costochondritis","disc herniation","disc herniation","discitis","elbow pain","extremity pain","gout","inflammation of the shoulder","joint","knee degeneration","knee inflammation","knee pain","loin pain","low back pain","lumbar pain","musclar pain","Muscle spasm","muscular pain","myalgia","myositis","neck pain","osteoarthritis","osteomyelitis","osteomylitis","plantar fasciitis","polyarthritis","rheumatism","sacroiliitis","spine degeneration","sprain","strain","synovitis","tendinitis","tendonitis","tendonopathy","tmj"),
    ("headache", "head pain"),
    ("stroke","cerebral accident","cerebral hemorrhage","cerebral infarction","cerebral ischemia","cerebrovascular accident"," cva "),
    ("gunshot", " shot "),
    
    # Prior flags, preserved
    ("facial","face"),
    ("pelvic","pelvis"),
    ("head","eye","ear","face","brain","scalp","mouth","nose"),
    ("spine","spinal"),
    ("abdomen","abdominal")
]

# require the first term and the absence of the remaining terms
# name the flag after the first term.
complex_flag_terms = [
    ("urologic","neurologic"),
    ("burn","heartburn"),
    ("trauma", "psychological trauma")
]

# Look for any of the terms in terms_to_find but only apply if terms in terms_to_avoid are absent.
# Check human or google translation (ht, gt)

complex_set_flag_terms = [
    {
        "flag_name": "hyperlipidemia",
        "terms_to_find": ["blood pressure", "bp", "high blood pressure", "hypertension"],
        "terms_to_avoid": ["hypotension"],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "ENT",
        "terms_to_find": ["adenoiditis","ear congestion","ear discharge","ear infection","ear inflammation","eustachian tube infection","mucositis","mumps","nasal congestion","nose congestion","otitis","otorrhea","pharyngitis","throat ache","tonsillitis","tonsils enlargement","cerumen impaction","dysphagia","earache","epistaxis","hearing impairment","hearing loss","nasal obstruction","pain in the ear","pharyngeal pain","pharynx pain","swallowing pain","vestibulitis"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "infection",
        "terms_to_find": ["adenoiditis","ear congestion","ear discharge","ear infection","ear inflammation","eustachian tube infection","laryngitis","mucositis","mumps","nasal congestion","nose congestion","otitis","otorrhea","pharyngitis","rhinitis","sinusitis","throat ache","tonsillitis","tonsils enlargement"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "eye",
        "terms_to_find": ["conjunctivitis","eye discharge","eye infection","keratoconjunctivitis","ophthalmic infection","corneal inflammation","eye sensitivity","keratitis","pain in the eye","blepharitis","cataract","eye redness","eyelid","eye-redness","glaucoma","left eye","my eye","npdr","pterygium","pupil","redness of the eye","retinal","retinopathy","right eye","swelling of the eye","uveitis","vision"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "gi_complaint",
        "terms_to_find": ["abdominal injury","apendicitis","appendicitis","belly pain","bile duct obstruction","bile stones","cholecystitis","colic","colitis","colon spasm","Crohn","duodenal ulcer","enteritis","epigastric pain","flank pain","gallbladder inflammation","gastric pain","gastric ulcer","gastritis","gastroenteritis","gastrointestinal infection","hiatal hernia","ibd","ibs","indigestion","inflammation of the stomach","intestinal pain","intestinal ulcer","pain in the stomach","pancreatitis","peptic ulcer","peritoneal inflammation","peritonitis","sore stomach","stomach hurts","stomach pain","Digestive bleed","Gastric bleeding","Gastric hemorrhage","Gastrointestinal bleeding","hemorrhoids","Ulcer of the colon","Constipation","intestinal stasis","diarrhea","dysentery","food poisoning","giardia","typhoid","cirrhosis","hapatitis","hep a","hep b","hep c","hepatic","jaundice","nausea","vomiting","vomitting","anal fissure","bloating","celiac disease","esophageal reflux","gastroesophageal reflux","gerd","heartburn","inguinal hernia","malabsorption","umbilical fistula","umbilical hernia"],
        "terms_to_avoid": ["renal colic"],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "abdominal_pain",
        "terms_to_find": ["abdominal injury","apendicitis","appendicitis","belly pain","bile duct obstruction","bile stones","cholecystitis","colic","colitis","colon spasm","Crohn","duodenal ulcer","enteritis","epigastric pain","flank pain","gallbladder inflammation","gastric pain","gastric ulcer","gastritis","gastroenteritis","gastrointestinal infection","hiatal hernia","ibd","ibs","indigestion","inflammation of the stomach","intestinal pain","intestinal ulcer","pain in the stomach","pancreatitis","peptic ulcer","peritoneal inflammation","peritonitis","sore stomach","stomach hurts","stomach pain"],
        "terms_to_avoid": ["renal colic"],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "bleed",
        "terms_to_find": ["Digestive bleed","Gastric bleeding","Gastric hemorrhage","Gastrointestinal bleeding","hemorrhoids","Ulcer of the colon"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "diarrhea_dysentery",
        "terms_to_find": ["diarrhea","dysentery","food poisoning","giardia","typhoid"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "liver_dysfunction",
        "terms_to_find": ["cirrhosis","hapatitis","hep a","hep b","hep c","hepatic","jaundice"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "nausea_vomiting",
        "terms_to_find": ["nausea","vomiting","vomitting"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "gu",
        "terms_to_find": ["cystitis","dysuria","epididymitis","genital infection","herpes","orchitis","sexually transmitted infection","urethritis","urinary infection","Urinary tract infection","urogenital infection","UTI","bladder","hematuria","incontinence","pelvic mass","urinary disorder","urinary retention","urinary symptoms","varicocele"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "infection",
        "terms_to_find": ["cystitis","dysuria","epididymitis","genital infection","herpes","orchitis","sexually transmitted infection","urethritis","urinary infection","Urinary tract infection","urogenital infection","UTI"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "gyn_women",
        "terms_to_find": ["breast","endometriosis","fibroids","gynecological","hot flashes","irregular cycle","mastitis","menopause","menstrual","ovarian","ovary","ovulation","reproductive health","uterine","uterus","vagina","vaginal","vaginitis"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "injury",
        "terms_to_find": ["bite","sting","stinging","cut","wound","injury","blast","burn","fracture","gunshot","shot","hemiplegia","paralysis","paraplegia","quadriplegia","fragments","shrapnel","sliver","splinter","traffic accident","abrasion","bruise","bruising","Concussion","contusion","falling","knee rupture","splint","trauma"],
        "terms_to_avoid": ["psychological trauma","heartburn"],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "injury",
        "terms_to_find": ["ulcer"],
        "terms_to_avoid": ["gastric", "stomach", "peptic", "intestinal", "duodenal"],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "injury_neuro",
        "terms_to_find": ["hemiplegia","paralysis","paraplegia","quadriplegia"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "malnutrition",
        "terms_to_find": ["delayed growth","growth delay","growth retardation","short stature","malnutrition"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "growth_delay",
        "terms_to_find": ["delayed growth","growth delay","growth retardation","short stature"],
        "terms_to_avoid": [],
        "check": ["ht","gt"]
    },
    {
        "flag_name": "mental_health",
        "terms_to_find": ["anxiety","bipolar","mental illness","personality disorder","post traumatic syndrome","post-traumatic syndrome","psychiatric","psychological","ptsd","schizophrenia"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "mental_health",
        "terms_to_find": ["depression"],
        "terms_to_avoid": [],
        "check": ["ht"]
    },
    {
        "flag_name": "neuro_complaint",
        "terms_to_find": ["head pain","headache","cerebral accident","cerebral hemorrhage","cerebral infarction","cerebral ischemia","cerebrovascular accident","cva","stroke","benign paroxysmal postitional vertigo","brachial plexus","brain infection","brain tumor","cauda equina","cerebral palsy","cervical root","convulsion","convulsions","dementia","dizziness","encephalitis","epilepsy","epileptic","foot drop","hand drop","meningitis","meningocele","migraine","nerve","neuritis","neurodegenerative","neurological","neuropathy","numbness","nystagmus","polyneuritis","sciatica","seizure","subarachnoid hemorrhage","TIA","tinnitus","Vertigo"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "other_infection",
        "terms_to_find": ["mediterranean fever","mf","abcess","abscess","sepsis","septic shock","bacteremia","brucellosis","chickenpox","diphtheria","finger infection","foot infection","fungal","hand foot","hand infection","hand mouth","hand-foot","hookworm","infection of blood","intestinal worms","leprosy","lymphadenitis","lymphadenopathy","measles","nemotodes","omphalitis","parasite","pinworm","rheumatic fever","rubella","scarlet fever","thrush","toe infection","worms"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "other_infection",
        "terms_to_find": ["leishmania","leishmaniasis"],
        "terms_to_avoid": ["excluding leishmaniasis", "excluding leishmania", "except leishmaniasis"],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "pregnancy",
        "terms_to_find": ["abortion","antenatal","birth","caesarean section","csection","delivery","gestation","miscarriage","placenta","postnatal","postpartum","pregnancy","pregnant","prenatal"],
        "terms_to_avoid": ["not pregnant"],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "renal",
        "terms_to_find": ["hydronephrosis","kidney cysts","kidney failure","kidney stone","nephritis","nephrolithiasis","nephropathy","pyelonephritis","renal calculi","renal calculus","renal failure","renal impairment","renal insufficiency","renal stones"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "respiratory",
        "terms_to_find": ["laryngitis","rhinitis","sinusitis","bronchiolitis","bronchitis","cold","congestion","cough","croup","flu","grippe","influenza","penumonia","pneumonia","pneumonitis","pulmonary infection","respiratory infection","respiratory tract infection","rhinorrhea","running nose","runny nose","tuberculosis","urti","asthma","bronchospasm","COPD","difficulty breathing","dyspnea","emphysema","hemoptysis","lung disease","nebulization","nebulizing","pulmonary disease","pulmonary fibrosis","shortness of breath","sneezing"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "infection",
        "terms_to_find": ["bronchiolitis","bronchitis","cold","congestion","cough","croup","flu","grippe","influenza","penumonia","pneumonia","pneumonitis","pulmonary infection","respiratory infection","respiratory tract infection","rhinorrhea","running nose","runny nose","tuberculosis","urti"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "wound",
        "terms_to_find": ["dressing change"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    },
    {
        "flag_name": "animal_insect_bite",
        "terms_to_find": ["bite","sting","stinging"],
        "terms_to_avoid": [],
        "check": ["ht", "gt"]
    }
]

The `arabic_values` table has one record for each unique Arabic string in the raw data that was imported. Since we can map these back to their source in the raw data, we'll generate flags against these strings and then map the flags back by joining on the arabic strings in this table.

In [10]:
# Store the rows we change here
# so that we can update the table

update_data = []

# Iterate through the in-memory representation
for rec in data:
    # Create a placeholder update record
    update_rec = {'id':rec['id']}
    
    # A flag we'll use to determine whether the record needs to be updated
    update_record = False
    
    # Get the human_translate value from the record
    ht = rec['human_translate']
    
    # If it is not None, then convert it to lowercase
    if ht:
        ht = ht.lower()
    
    # Get the google_translate value and convert it to lowercase
    # We are not currently using this to generate flags so it is commented out
    # but you could substitute it in below or write additional code if you want
    # to use it for flag generation
    gt = rec['google_translate_feb']
    if gt:
        gt = gt.lower()
    
    # Look at google_tokens_joined field
    gtj = rec['google_tokens_joined']
    if gtj:
        gtj = gtj.lower()
    
    
    # Walk through the different flag types from above and check whether the 
    # human_translate value matches for that flag. If so, create the update record
    # for that flag and then mark our update boolean indicator true so that we know
    # to update the appropriate record in the database. All records that will be updated
    # have their update record put into the update_data list.
    for term in flag_terms:
        if (ht and term in ht) or (gt and term in gt) or (gtj and term in gtj):
            update_rec["flag_" + "_".join(term.replace("-","_").split())] = 1
            update_record = True

    for tup in multiple_flag_terms:
        if (ht and all(x in ht for x in tup)) or (gt and all(x in gt for x in tup)) or (gtj and all(x in gtj for x in tup)):
            update_rec["flag_" + "_and_".join(tup)] = 1
            update_record = True

    for tup in synonym_flag_terms:
        if (ht and any(x in ht for x in tup)) or (gt and any(x in gt for x in tup)) or (gtj and any(x in gtj for x in tup)):
            update_rec["flag_" + "_".join(tup[0].split())] = 1
            update_record = True

    for tup in complex_flag_terms:
        if (ht and tup[0] in ht and not any(x in ht for x in tup[1:])) or (gt and tup[0] in gt and not any(x in gt for x in tup[1:])) or (gtj and tup[0] in gtj and not any(x in gtj for x in tup[1:])):
            update_rec["flag_" + tup[0].replace(" ","_").replace("-","_")] = 1
            update_record = True

    # complex_set_flag_terms
    for rule in complex_set_flag_terms:
        flag_name = "flag_" + "_".join(rule['flag_name'].split())

        # Continue because we already set this flag
        if flag_name in update_rec.keys():
            if update_rec[flag_name] == 1:
                continue

        if "ht" in rule['check']:
            if ht and any(x in ht for x in rule["terms_to_find"]) and not any(x in ht for x in rule["terms_to_avoid"]):
                update_rec[flag_name] = 1
                update_record = True
                # We set the flag so stop searching
                continue

        if "gt" in rule['check']:
            if gt and any(x in gt for x in rule["terms_to_find"]) and not any(x in gt for x in rule["terms_to_avoid"]):
                update_rec[flag_name] = 1
                update_record = True
                # We set the flag so stop searching
                continue

            if gtj and any(x in gtj for x in rule["terms_to_find"]) and not any(x in gtj for x in rule["terms_to_avoid"]):
                update_rec[flag_name] = 1
                update_record = True
                # We set the flag so stop searching
                continue
            
    # Handle war-related separately. This very likely can be improved upon
    if ht and 'war-related injury' in ht and 'not war-related injury' not in ht:
        update_rec['flag_conflict_related'] = 1
        update_record = True
    
    # If we created any flags, update_record is true so put this record in the list 
    # of records to update.
    if update_record:
        # Create comprehensive injury flag per Ranya's request
        keys = update_rec.keys()
        if ('flag_injury' in keys and update_rec['flag_injury'] == 1) or ('flag_wound' in keys and update_rec['flag_wound'] == 1):
            update_rec['flag_comprehensive_injury'] = 1
        else:
            update_rec['flag_comprehensive_injury'] = 0
                
        update_data.append(update_rec)

In [11]:
# How many records are we going to update in the arabic_values table?
len(update_data)

118708

In [12]:
# What do the update records look like? 
update_data[:10]

[{'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 11},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 13},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 15},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 16},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 17},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 21},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 32},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 39},
 {'flag_comprehensive_injury': 0, 'flag_diabetes': 1, 'id': 51},
 {'flag_comprehensive_injury': 0, 'flag_derm': 1, 'id': 61}]

In [13]:
# Update the arabic_values table with the update_records' data
# 1. Create the columns we need
# 2. Bulk update for each column

flag_cols = set()
for rec in update_data:
    for k in rec.keys():
        if k != 'id':
            flag_cols.add(k)
flag_cols = sorted(list(flag_cols))

# The trick here is to get the id from a record in arabic values and update that
# record with a None value for each of these flags - that will cause dataset to generate the columns
ref_rec = tab_arabic_values.find_one()
ref_rec_update = {'id':ref_rec['id']}
for col in flag_cols:
    ref_rec_update[col] = None
tab_arabic_values.update(ref_rec_update, ['id'])

# At this point maybe open DB Browser for SQLite to make sure the columns were created.
# The 1 that prints below is the number of records updated.

1

In [14]:
# These are the flags generated from the code above.
flag_cols

['flag_ENT',
 'flag_abdomen',
 'flag_abdominal_pain',
 'flag_allergy',
 'flag_anemia',
 'flag_animal_insect_bite',
 'flag_back',
 'flag_blast',
 'flag_bleed',
 'flag_blunt',
 'flag_burn',
 'flag_cancer',
 'flag_cardiovascular',
 'flag_chest',
 'flag_complication',
 'flag_comprehensive_injury',
 'flag_conflict_related',
 'flag_congenital',
 'flag_constipation',
 'flag_dehydration',
 'flag_dental_complaint',
 'flag_derm',
 'flag_diabetes',
 'flag_diarrhea_dysentery',
 'flag_endocrine',
 'flag_explosive',
 'flag_eye',
 'flag_facial',
 'flag_fatigue',
 'flag_fever',
 'flag_follow_up',
 'flag_fracture',
 'flag_gi_complaint',
 'flag_growth_delay',
 'flag_gu',
 'flag_gunshot',
 'flag_gyn_women',
 'flag_head',
 'flag_headache',
 'flag_history_of',
 'flag_hyperlipidemia',
 'flag_infection',
 'flag_injury',
 'flag_injury_neuro',
 'flag_liver_dysfunction',
 'flag_lower_extremity',
 'flag_malnutrition',
 'flag_mental_health',
 'flag_musculoskeletal_pain',
 'flag_nausea_vomiting',
 'flag_neck',
 'f

In [15]:
# Now iterate through the flag cols and create a list of each record that needs to set the value for each
# flag column and then bulk update. It is orders of magnitude faster to do it this way than one by one.

# Note - this is generating and executing some super gnarly long SQL queries with tons of ID numbers

for col in flag_cols:
    recs_to_update = []
    for rec in update_data:
        if col in rec.keys():
            recs_to_update.append(rec['id'])
    recs_to_update = sorted(recs_to_update)

    db.query("""
    UPDATE arabic_values
    SET """ + col + """ = 1 
    WHERE id IN (""" + ",".join([str(a) for a in recs_to_update]) +""");
    """)
    
# After this runs, check in the database against to make sure the flags were properly applied.

Ok, that's done so now we want to apply the flags back to the raw data.

1. Pseudo-update the first record to trigger the addition of the flag fields
2. Compare each value in each column to an in-memory arabic_values lookup
3. Apply relevant flags to the record in question
4. Buffer and update the raw table when the buffer is full.

In [16]:
# Get a new db connection again in case the schema has changed.
# This probably isn't necessary but is a safety measure.

del db
del tab_arabic_values

db = dataset.connect("sqlite:///" + new_db_name)
tab_arabic_values = db['arabic_values']

In [17]:
# Get a reference to the raw Arabic data table
tab_raw_ar = db['full_raw_scrubbed']

In [18]:
# Get the list of variables used in full_raw_scrubbed and full_raw_english
rec_raw = tab_raw_ar.find_one()
variables = list(rec_raw.keys())
print(",".join(variables))

# Due to previous work, there are flag columns in the full_raw_scrubbed table, but we will ignore them
# because they aren't used in this flag-generation methodology. 

id,a_file_id,a_files_sheets_id,a_sheet_id,acceptance_pattern,analysis,analysis_request,analysis_type,anesthesia_type,assign_method,case,category,center,clinic,clinical_case,col_1,col_2,col_3,col_4,col_5,col_6,col_misc,col_moawak,col_none,col_null,col_to,conflict_related,consultations,daily_number,data_validation,date,date_first_exam,death,death_cause,death_certificate,death_date,death_location,death_time,department,diagnosis,diagnosis_confirmed,discharge,discharge_date,discharge_status,discharged_to,disclaimers,disease,displaced,displacement_duration,dose,drug_class,er,events,exam_type,examination_1,examination_2,examination_3,examination_4,examination_type,facility_type,housing,housing_persons_number,image,image_request,image_type,import_status,info_age,info_card_number,info_card_type,info_care_type,info_geo_address,info_geo_area,info_geo_community,info_geo_country_of_origin,info_geo_district,info_geo_governorate,info_geo_injury_city,info_geo_injury_site,info_geo_injury_state,info_geo

In [19]:
# Create the in-memory arabic_values lookup
# This time, since we created the flags, they'll be in the records

arabic_lookup = {}
arabic_values = [x for x in tab_arabic_values.find()]

for v in arabic_values:
    arabic_lookup[v['arabic']] = v

In [20]:
list(arabic_lookup.keys())[:10]

['قبول عابر',
 'حواضن',
 'استشفاء',
 'عناية',
 'اختبار الزمرة الدموية',
 'اختبار كريات الدم البيضاء, اختبار حمض البول, اختبار سرعة النزف, اختبار زمن التخثر, اختبار الزمرة الدموية',
 'اختبار تحليل البول (بول و رواسب)',
 'اختبار خضاب الدم / الهيموجلوبين, اختبار حمض البول, اختبار كرياتينين, اختبار سرعة النزف, اختبار زمن التخثر',
 'اختبار تحليل البول (بول و رواسب), اختبار كريات الدم البيضاء',
 'اختبار الزمرة الدموية, اختبار خضاب الدم / الهيموجلوبين']

In [21]:
# Let's test that a value we pull out of the database has a hit in the lookup table.
test_rec = tab_raw_ar.find_one()
diagnosis = test_rec['diagnosis']
print(diagnosis)
print("---------------------- Lookup result below")
print(arabic_lookup[diagnosis])

التهاب مجاري تنفسية سفلى, التهاب قصبات
---------------------- Lookup result below
OrderedDict([('id', 3863), ('arabic', 'التهاب مجاري تنفسية سفلى, التهاب قصبات'), ('google_translate', 'Inflammation of lower respiratory tracts, bronchitis'), ('human_translate', 'bronchiolitis, bronchitis'), ('normalized', None), ('appears_in', "['diagnosis']"), ('google_translate_feb', 'Inflammation of lower respiratory tracts, bronchitis'), ('google_tokens_joined', 'lower respiratory tract infection, bronchitis'), ('orig_value', None), ('flag_ENT', None), ('flag_abdomen', None), ('flag_abdominal_pain', None), ('flag_allergy', None), ('flag_anemia', None), ('flag_animal_insect_bite', None), ('flag_back', None), ('flag_blast', None), ('flag_bleed', None), ('flag_blunt', None), ('flag_burn', None), ('flag_cancer', None), ('flag_cardiovascular', None), ('flag_chest', None), ('flag_complication', None), ('flag_comprehensive_injury', '1'), ('flag_conflict_related', None), ('flag_congenital', None), ('flag_co

**The following cell will take a while to run because it has to iterate through all raw records. Depending on how many flags you create, etc... **

**It might take 30 minutes to execute. This examines all fields in each record.**

The idea is to create a parallel record in the flag table for every record in the raw data table. The id numbers between the two tables will be parallel and some other metadata is included in the flag table. Because these flag tables will have far fewer columns and most columns will only have 1 or NULL as values, they are much faster to query.

In [22]:
# The insert_many method inserts in chunks of 1000, but this specifies that we don't want
# to start the process until we have this many records to insert.
buffer_size = 50000

flags_to_insert = []

try:
    db['full_raw_flags'].drop()
except:
    pass

tab_raw_flags = db['full_raw_flags']

# Insert a dummy record to create the table
dummy_record = {
    'file_id':None,
    'files_sheets_id':None,
    'sheet_id':None
}

for flag in flag_cols:
    dummy_record[flag] = None
    
tab_raw_flags.insert(dummy_record)
print(tab_raw_flags.count())
tab_raw_flags.delete()


# Iterate through the raw records one by one
for rec in tab_raw_ar.find():
    
    # Include foreign keys that allow us to query against the flag table instead of 
    # joining with the raw data table, which is slow.
    flag_record = {
        'id':rec['id'],
        'file_id':rec['a_file_id'],
        'files_sheets_id':rec['a_files_sheets_id'],
        'sheet_id':rec['a_sheet_id']
    }
    
    # Initialize each flag_record
    for flag in flag_cols:
        flag_record[flag] = None
        
    # Scan the conflict related column for values, but do this before looking at the
    # corresponding Arabic values so that we don't overwrite the Arabic value setting.
    if rec['conflict_related'] is not None:
        if rec['conflict_related'].strip() == 'كبرى' or rec['conflict_related'].strip() =='كبرى':
            flag_record['flag_conflict_related'] = 1
        elif rec['conflict_related'].strip() == 'لا':
            flag_record['flag_conflict_related'] = 0
        else:
            flag_record['flag_conflict_related'] = None
    else:
        flag_record['flag_conflict_related'] = None
        
    # Loop through the variables for each raw data record
    for v in variables:
        # These are obfuscated PII cols, or the flag columns we're ignoring, so skip them
        if 'info_' in v or 'flag_' in v or v == 'id':
            continue
        
        # Get the value in the column
        to_lookup = rec[v]
        
        if to_lookup is None or to_lookup == '.':
            continue
        else:
            
            # We have a legit value, so look it up and grab the flags
            try:
                # There might be a keyerror on the info_ columns' hashed values, etc.
                # I also manually removed some PII from arabic_values, so that might
                # cause an occassional mismatch.
                arabic_values_rec = arabic_lookup[to_lookup]
                for flag in flag_cols:
                    # Should be None if not flagged, so just check for existence
                    if arabic_values_rec[flag]:
                        flag_record[flag] = arabic_values_rec[flag]
            except:
                pass
    
    # Store the record
    flags_to_insert.append(flag_record)

    # Check if we need to insert
    if len(flags_to_insert) > buffer_size:
        tab_raw_flags.insert_many(flags_to_insert)
        
        # Clear the buffer
        flags_to_insert.clear()
        
# We've been through all raw records so make sure the buffer is clear
tab_raw_flags.insert_many(flags_to_insert)
flags_to_insert.clear()

1


Now you need to get into the SQLite database and start querying to generate the datasets that you want. Below is an example of how to run the query through Python and easily dump the results to CSV files for analysis elsewhere.

In [23]:
# Fix a few minor spelling errors in the facilities table

result = db.query("""
UPDATE facilities SET district = 'Idlib' WHERE district = 'Idleb';
""")

result = db.query("""
UPDATE facilities SET subdistrict = 'Idlib' WHERE subdistrict = 'Idleb';
""")

result = db.query("""
UPDATE facilities SET district = 'Jisr Ash Shugar' WHERE district = 'Jisr-Ash-Shugur';
""")

result = db.query("""
UPDATE facilities SET subdistrict = 'Jisr Ash Shugar' WHERE subdistrict = 'Jisr-Ash-Shugur';
""")

result = db.query("""
UPDATE facilities SET district = 'Daraa' WHERE district = "Dar'a";
""")

result = db.query("""
UPDATE facilities SET subdistrict = 'Daraa' WHERE subdistrict = "Dar'a";
""")

result = db.query("""
UPDATE facilities SET district = 'Al Mara' WHERE district = "Al Ma'ra";
""")

In [24]:
# Update to location information match on facility_code
new_data = [
    ("SAMS002","Abu Fadel","Syria","Daraa","Daraa","Dael "),
    ("SAMS010","Al Ehsan","Syria","Daraa","Daraa","Kherbet Ghazala"),
    ("SAM011","Al Ehsan Clinic","Syria","Daraa","Daraa","Kherbet Ghazala"),
    ("SAMS010","Al Ehsan RH","Syria","Daraa","Daraa","Kherbet Ghazala"),
    ("SAMS013","Al Hara","Syria","Daraa","As-Sanamayn","As-Sanamayn"),
    ("SAMS015","Al Herak","Syria","Daraa","Izra","Herak"),
    ("SAMS015","Al Herak RH","Syria","Daraa","Izra","Herak"),
    ("SAMS017","Al Jeza","Syria","Daraa","Daraa","Jizeh"),
    ("SAMS017","Al Jeza Clinic","Syria","Daraa","Daraa","Jizeh"),
    ("SAMS024","Al Msaifra","Syria","Daraa","Daraa","msaifra"),
    ("SAMS027","Al Noor","Syria","Daraa","Daraa","Daraa"),
    ("SAMS030","Al Rafed","Syria","Quneitra","Quneitra","Al-Khashniyyeh"),
    ("SAMS030","Al Rafed RH","Syria","Quneitra","Quneitra","Al-Khashniyyeh"),
    ("SAMS030","Al Rafed SAMS Clinic","Lebanon","Quneitra","Quneitra","Al-Khashniyyeh"),
    ("SAMS032","Al Redwan","Syria","Daraa","Izraa","Jasim"),
    ("SAMS032","Al Redwan Clinic","Syria","Daraa","Izraa","Jasim"),
    ("SAMS035","Al Salam Hospital","Syria","Daraa","Daraa","Daraa"),
    ("SAMS036","Al Salam Midwife","Syria","Daraa","Daraa","Daraa"),
    ("SAMS037","Al Yadudeh","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS038","Al Yaman","Syria","Rural Damascus","Rural Damascus","Kafr Batna"),
    ("SAMS060","Ankhal","Syria","Daraa","As-Sanamayn","As-Sanamayn"),
    ("SAMS062","Artificial Limbs Center (Farha)","Syria","Rural Damascus","Rural Damascus","Kafr Batna"),
    ("SAMS068","Beer Ajam","Syria","Quneitra","Quneitra","Quneitra"),
    ("SAMS073","Douma Dialysis","Syria","Rural Damascus","Duma","Duma"),
    ("SAMS074","Douma FH","Syria","Rural Damascus","Duma","Duma"),
    ("SAMS075","Douma OBGYN","Syria","Rural Damascus","Duma","Duma"),
    ("SAMS076","East Ghouta ICU","Syria","Rural Damascus","Rural Damascus","Kafr Batna"),
    ("SAMS077","Eissa Ajaj","Syria","Daraa","Daraa","Daraa"),
    ("SAMS079","Erbin FH","Syria","Rural Damascus","Rural Damascus","Rural Damascus"),
    ("SAMS087","Hit Med Point","Syria","Daraa","Daraa","Ash-Shajara"),
    ("SAMS090","Ibta RH","Syria","Daraa","Daraa","Dael"),
    ("SAMS097","Ishraqat Amal PSS","Syria","Rural Damascus","Rural Damascus","Rural Damascus"),
    ("SAMS101","Jassim","Syria","Daraa","Izra","Jasim"),
    ("SAMS102","Jbt Khashab","Syria","Quneitra","Quneitra","Khan Arnaba"),
    ("SAMS105","Jlein Med Point","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS106","Jobar Med Center","Syria","Rural Damascus","Rural Damascus","Kafr Batna"),
    ("SAMS142","Maaraba","Syria","Daraa","Daraa","Busra Esh-Sham"),
    ("SAMS143","Maaraba RH","Syria","Daraa","Daraa","Busra Esh-Sham"),
    ("SAMS153","Muwafeq Dakhl Alla ","Syria","Daraa","Izra","Tassil"),
    ("SAMS154","Muwafeq Dakhl Alla Clinic","Syria","Daraa","Izra","Tassil"),
    ("SAMS155","Nabd Horan ","Syria","Daraa","Daraa","Dael "),
    ("SAMS156","Nabd Horan Clinic","Syria","Daraa","Daraa","Dael "),
    ("SAMS159","Nawa","Syria","Daraa","Izra","Nawa"),
    ("SAMS160","Nawa Clinic","Syria","Daraa","Izra","Nawa"),
    ("SAMS161","Neonatal ICU-hamoria","Syria","Rural Damascus","Rural Damascus","Kafr Batna"),
    ("SAMS162","Nuaima","Syria","Daraa","Daraa","Daraa"),
    ("SAMS173","Rawan Birth Center","Syria","Daraa","Daraa","Ash-Shajara"),
    ("SAMS175","Sahm Al Jolan Clinic","Syria","Daraa","Daraa","Ash-Shajara"),
    ("SAMS176","Saida","Syria","Daraa","Daraa","Sayda"),
    ("SAMS177","Saida PSS","Syria","Daraa","Daraa","Sayda"),
    ("SAMS185","Sham ","Syria","Rural Damascus","Rural Damascus","Arbin"),
    ("SAMS186","Shifa","Jordan","Rural Damascus","Duma","Duma"),
    ("SAMS187","Shifa Mobile hospital","Jordan","Rural Damascus","Duma","Duma"),
    ("SAMS194","Shohadaa Horan","Syria","Daraa","Daraa","Daraa"),
    ("SAMS192","Tafas Clinic","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS193","Tafas PSS","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS195","Tal Shehab","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS198","Tassil RH","Syria","Daraa","Izra","Tassil"),
    ("SAMS201","Um Walad","Syria","Daraa","Daraa","Mseifra"),
    ("SAMS203","Wadi Al Yarmouk","Syria","Daraa","Daraa","Ash-Shajara"),
    ("SAMS204","Wadi Al Yarmouk FH","Syria","Daraa","Daraa","Ash-Shajara"),
    ("SAMS205","White Hands PSS","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS208","Zayzun","Syria","Daraa","Daraa","Mzeireb"),
    ("SAMS209","Zayzun","Syria","Daraa","Daraa","Mzeireb")
]

def make_query_string(t, field, index):
    s = "UPDATE facilities SET " + field + " ='" + t[index] + "' WHERE facility_code = '" + t[0] + "';"
    return s

query_queue = []

for facility in new_data:
    query_queue.append(make_query_string(facility, "country", 2))
    query_queue.append(make_query_string(facility, "governorate", 3))
    query_queue.append(make_query_string(facility, "district", 4))
    query_queue.append(make_query_string(facility, "subdistrict", 5))
    
for query in query_queue:
    result = db.query(query)

Put in place some missing facility_id numbers for the files table; Data provided in an email by Ranya

```
Record IDs: 594, 599, 747 & 947. Attributable to Bab Al Hawa, SAMS ID 065, located in Idlib, Subdistrict Harim. 
Record ID: 42. Attributable to Sohad Al Mzereb, SAMS ID 190, located in Dar'a. 
Record IDs: 627, 1023. Attributable to Al Batrana, SAMS ID 009, located in Rural Damascus. 
Record IDs: 635, 796, 884, 956, 1102. Attributable to Al Ikhlas, SAMS ID 049, located in Idlib, subdistrict Jisr As Shugur.
Record IDs: 668, 1039, 1041, 1042. Attributable to Hama Mobile Clinic, SAMS ID 083, located in Hama, subdistrict Hama.  
Record IDs: 719, 874, 940. Attributable to Sarmin PHC, SAMS ID 183, located in Idlib, subdistrict Idlib.
Record IDs: 806, 891, 964, 1034. Attributable to Al Salam, SAMS ID 036, located in Idlib.
Record ID: 834. Attributable to Kafrenboudeh, SAMS ID 111, located in Hama, subdistrict Al Mara. 
Record IDs: 1061 & 1120. Attributable to Jabal Al Zawia, SAMS ID 099, located in Idlib.
Record ID: 981. Attributable to Maree, SAMS ID 147, located in Aleppo, subdistrict Azaz. 
```

In [25]:
query_queue = [
    "UPDATE files SET facility_id = 124 WHERE id IN (594, 599, 747, 947);",
    "UPDATE files SET facility_id = 71  WHERE id IN (42);",
    "UPDATE files SET facility_id = 312 WHERE id IN (627, 1023);",
    "UPDATE files SET facility_id = 119 WHERE id IN (635, 796, 884, 956, 1102);",
    "UPDATE files SET facility_id = 88  WHERE id IN (668, 1039, 1041, 1042);",
    "UPDATE files SET facility_id = 146 WHERE id IN (719, 874, 940);",
    "UPDATE files SET facility_id = 118 WHERE id IN (806, 891, 964, 1034);",
    "UPDATE files SET facility_id = 92  WHERE id IN (834);",
    "UPDATE files SET facility_id = 129 WHERE id IN (1061, 1120);",
    "UPDATE files SET facility_id = 3   WHERE id IN (981);"
]

for query in query_queue:
    result = db.query(query)

### Export full flags table, full_raw_flags, to CSV

This is a big data set and might take a few minutes. Resulting CSV will be ~ 160mb.

You should be able to pull this into python, R, Tableau, etc... for analysis. It probably has too many records to open in Excel.

In [26]:
# You can change this query to export a different set of data
result = db.query("""
SELECT  files.id as files_id,
        files.year,
        files.month,
        files.year || '-' || files.month || '-01' AS full_date,
        facilities.id AS facility_id,
        facilities.facility_parent_id,
        facilities.facilityname,
        facilities.country,
        facilities.governorate,
        facilities.district,
        facilities.subdistrict,
        facilities.facility_type,
        full_raw_flags.flag_abdomen,
        full_raw_flags.flag_abdominal_pain,
        full_raw_flags.flag_allergy,
        full_raw_flags.flag_anemia,
        full_raw_flags.flag_animal_insect_bite,
        full_raw_flags.flag_back,
        full_raw_flags.flag_blast,
        full_raw_flags.flag_bleed,
        full_raw_flags.flag_blunt,
        full_raw_flags.flag_burn,
        full_raw_flags.flag_cancer,
        full_raw_flags.flag_cardiovascular,
        full_raw_flags.flag_chest,
        full_raw_flags.flag_complication,
        full_raw_flags.flag_conflict_related,
        full_raw_flags.flag_congenital,
        full_raw_flags.flag_constipation,
        full_raw_flags.flag_dehydration,
        full_raw_flags.flag_dental_complaint,
        full_raw_flags.flag_derm,
        full_raw_flags.flag_diabetes,
        full_raw_flags.flag_diarrhea_dysentery,
        full_raw_flags.flag_endocrine,
        full_raw_flags.flag_ENT,
        full_raw_flags.flag_explosive,
        full_raw_flags.flag_eye,
        full_raw_flags.flag_facial,
        full_raw_flags.flag_fatigue,
        full_raw_flags.flag_fever,
        full_raw_flags.flag_follow_up,
        full_raw_flags.flag_fracture,
        full_raw_flags.flag_gi_complaint,
        full_raw_flags.flag_growth_delay,
        full_raw_flags.flag_gu,
        full_raw_flags.flag_gunshot,
        full_raw_flags.flag_gyn_women,
        full_raw_flags.flag_head,
        full_raw_flags.flag_headache,
        full_raw_flags.flag_history_of,
        full_raw_flags.flag_hyperlipidemia,
        full_raw_flags.flag_infection,
        full_raw_flags.flag_injury,
        full_raw_flags.flag_injury_neuro,
        full_raw_flags.flag_liver_dysfunction,
        full_raw_flags.flag_lower_extremity,
        full_raw_flags.flag_malnutrition,
        full_raw_flags.flag_mental_health,
        full_raw_flags.flag_musculoskeletal_pain,
        full_raw_flags.flag_nausea_vomiting,
        full_raw_flags.flag_neck,
        full_raw_flags.flag_nerve,
        full_raw_flags.flag_neuro_complaint,
        full_raw_flags.flag_neurologic,
        full_raw_flags.flag_orthopedic,
        full_raw_flags.flag_other_infection,
        full_raw_flags.flag_pain,
        full_raw_flags.flag_pelvic,
        full_raw_flags.flag_pregnancy,
        full_raw_flags.flag_renal,
        full_raw_flags.flag_respiratory,
        full_raw_flags.flag_shrapnel,
        full_raw_flags.flag_spinal,
        full_raw_flags.flag_spine,
        full_raw_flags.flag_stab,
        full_raw_flags.flag_stroke,
        full_raw_flags.flag_suspected,
        full_raw_flags.flag_traffic_accident,
        full_raw_flags.flag_trauma,
        full_raw_flags.flag_upper_extremity,
        full_raw_flags.flag_urologic,
        full_raw_flags.flag_vascular,
        full_raw_flags.flag_wound,
        full_raw_flags.flag_comprehensive_injury

FROM full_raw_flags
JOIN files on files.id = full_raw_flags.file_id
JOIN facilities on files.facility_id = facilities.id

WHERE files.facility_id IS NOT NULL 
AND files.month IS NOT NULL
AND files.skipped = 0
AND files.ignore = 0;
""")

# This used to be a part of dataset but was extracted to its own library
# https://github.com/pudo/datafreeze
freeze(result, format='csv', filename='full_raw_flags.csv')

In [27]:
# Dump the arabic_values table and ignore the values that were appended at the end of it

result = db.query("SELECT * FROM arabic_values WHERE orig_value IS NULL ORDER BY id ASC;")
freeze(result, format='csv', filename='arabic_values.csv')

### Back up the derived SQLite database

In [28]:
# This is optional and will generate a copy of the database that will be gigabytes in size.
shutil.copy2(new_db_name,'backup_' + new_db_name)

'backup_sams_data_phase20.sqlite'