### The three source files contain distinct sections that pertain to different datasets. I split them into separate .json files for accessibility.

Here are the sections and the file names they're saved under:
#### Source file: Inpatient DirectCare Utilization 508_20180921v2.pdf
1. Direct Care Patient Population Counts with Respective Mental Health Disorders (saved as DC_pts_by_dx.json)
2. Direct Care Hospitalizations Among Patients with Respective Mental Health Disorders (DC_hospitalizations.json)
3. Direct Care Average Bed Days per Hospitalization (BDPH) with Respective Mental Health Disorders (DC_BDPH.json)


#### Source file: Inpatient PurchasedCare Utilization 508_20180921v2(1).pdf
1. Purchased Care Patient Population Counts with Respective Mental Health Disorders (saved as PC_pts_by_dx.json)
2. Purchased Care Hospitalizations Among Patients with Respective Mental Health Disorders (PC_hospitalizations.json)
3. Purchased Care Average Bed Days per Hospitalization (BDPH) with Respective Mental Health Disorders (PC_BDPH.json)

#### Source file: Hospitalization Rate 508_20180921v2(1).pdf
1. Combined Direct and Purchased Care Hospitalizations per 1,000 Patients Diagnosed with Respective Mental Health Disorders (saved as total_hospitalizations_per_1000.json)
2. Combined Direct and Purchased Care Patient Population Counts with Respective Mental Health Disorders (total_pts_by_dx.json)
3. Combined Direct and Purchased Care Hospitalizations Among Patients with Respective Mental Health Disorders (total_hospitalizations_by_dx.json)


All data in json files are structured like this.

    { Diagnosis: 

            { Year : { Air Force : Relevant Stat }
                    { Army : Relevant Stat }
                    { Marines : Relevant Stat }
                    { Navy : Relevant Stat }
                    { Active Component : Relevant Stat }
                    { National Guard : Relevant Stat } }
                
            { Year : { Air Force : Relevant Stat }
                    { Army : Relevant Stat }
                    { Marines : Relevant Stat }
                    { Navy : Relevant Stat }
                    { Active Component : Relevant Stat }
                    { National Guard : Relevant Stat } }

		...

		 }



The methods for turning the pdfs into text files and extracting the data are below, mostly for posterity.


An important note if we need to run them again: I made minor changes by hand to the raw text files produced by PyPDF2 to make this coding easier. Unfortunately it won't work just running all of these in a row. The manipulated text files are saved in "Data as Text Files" in case we need them again.

## Convert PDFs to .txt files
Assumes PDFs are in current working directory.

In [None]:
# importing required modules
import PyPDF2

# creating a pdf file object
with open('Inpatient PurchasedCare Utilization 508_20180921v2(1).pdf', 'rb') as pdfFileObj:

    # creating a pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    # creating a text version of pdf file
    textFile = ""
    for page in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        textFile += (pageObj.extractText())

with open('RAW TEXT Purchased Care.txt', 'wt') as file2:
    file2.writelines(textFile)

In [None]:
# creating a pdf file object
with open('Hospitalization Rate 508_20180921v2(1).pdf', 'rb') as pdfFileObj:

    # creating a pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    # creating a text version of pdf file
    textFile = ""
    for page in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        textFile += (pageObj.extractText())

with open('NEW TEXT Hospitalization Rate.txt', 'wt') as file2:
    file2.writelines(textFile)

In [None]:
# creating a pdf file object
with open('Inpatient DirectCare Utilization 508_20180921v2.pdf', 'rb') as pdfFileObj:

    # creating a pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    # creating a text version of pdf file
    textFile = ""
    for page in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        textFile += (pageObj.extractText())

with open('NEW TEXT Direct Care.txt', 'wt') as file2:
    file2.writelines(textFile)

## Store data from Inpatient PurchasedCare Utilization 508_20180921v2(1) 

In [None]:
import collections
import re
import json
import pprint

PC_pts_by_dx = {}

with open("Group1_Purchased_Care.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    
    # Create dictionary
    for table in tables:
        dx_stop = table[2].index('by')
        diagnosis = table[2][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        PC_pts_by_dx[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                PC_pts_by_dx[diagnosis][year][branch] = groups[branch][years.index(year)]

with open("PC_pts_by_dx.json", "w") as outfile:
    json.dump(PC_pts_by_dx, outfile)

In [None]:
import collections
import re
import json
import pprint

PC_hospitalizations = {}

with open("Group2_Purchased_Care.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    
    # Create dictionary
    for table in tables:
        dx_stop = table[3].index('by')
        diagnosis = table[3][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        PC_hospitalizations[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                PC_hospitalizations[diagnosis][year][branch] = groups[branch][years.index(year)]

with open("PC_hospitalizations.json", "w") as outfile:
    json.dump(PC_hospitalizations, outfile)

In [None]:
import collections
import re
import json
import pprint

PC_BDPH = {}

with open("Group3_Purchased_Care.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    
    # Create dictionary
    for table in tables:
        dx_stop = table[3].index('by')
        diagnosis = table[3][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        PC_BDPH[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                PC_BDPH[diagnosis][year][branch] = groups[branch][years.index(year)]

with open("PC_BDPH.json", "w") as outfile:
    json.dump(PC_BDPH, outfile)

## Store data from Inpatient DirectCare Utilization 508_20180921v2

In [None]:
import collections
import re
import json
import pprint

DC_pts_by_dx = {}

with open("Group1_Direct_Care.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    
    # Create dictionary
    for table in tables:
        dx_stop = table[4].index('by')
        diagnosis = table[4][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        DC_pts_by_dx[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                DC_pts_by_dx[diagnosis][year][branch] = groups[branch][years.index(year)]

with open("DC_pts_by_dx.json", "w") as outfile:
    json.dump(DC_pts_by_dx, outfile)

In [None]:
import collections
import re
import json
import pprint

DC_hospitalizations = {}

with open("Group2_Direct_Care.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    
    # Create dictionary
    for table in tables:
        dx_stop = table[5].index('by')
        diagnosis = table[5][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        DC_hospitalizations[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                DC_hospitalizations[diagnosis][year][branch] = groups[branch][years.index(year)]

with open("DC_hospitalizations.json", "w") as outfile:
    json.dump(DC_hospitalizations, outfile)

In [None]:
import collections
import re
import json
import pprint

DC_BDPH = {}

with open("Group3_Direct_Care.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    
    # Create dictionary
    for table in tables:
        dx_stop = table[6].index('by')
        diagnosis = table[6][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        DC_BDPH[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                DC_BDPH[diagnosis][year][branch] = groups[branch][years.index(year)]

    with open("DC_BDPH.json", "w") as outfile:
        json.dump(DC_BDPH, outfile)

## Store data from Hospitalization Rate 508_20180921v2(1)

In [None]:
import collections
import re
import json

hospitalized_per_1000 = {}

with open("Group1_Combined_Hospitalization.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    # Create dictionary
    for table in tables:
        dx_stop = table[3].index('by')
        diagnosis = table[3][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        hospitalized_per_1000[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')
        
        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total
        
        for year in years:
            for branch in groups:
                hospitalized_per_1000[diagnosis][year][branch] = groups[branch][years.index(year)]

with open("hospitalizations_per_1000.json", "w") as outfile:
    json.dump(hospitalized_per_1000, outfile)

In [None]:
import collections
import re
import json

total_pts_by_dx = {}

with open("Group2_Combined_Hospitalization.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}

    # Create dictionary
    for table in tables:
        dx_stop = table[2].index('by')
        diagnosis = table[2][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        total_pts_by_dx[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')

        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total

        for year in years:
            for branch in groups:
                total_pts_by_dx[diagnosis][year][branch] = groups[branch][years.index(year)]
        
with open("total_pts_by_dx.json", "w") as outfile:
    json.dump(total_pts_by_dx, outfile)

In [None]:
import collections
import re
import json

total_hospitalizations_by_dx = {}

with open("Group3_Combined_Hospitalization.txt") as file:
    report = file.read()    
    split_report = re.split("Table ", report)

    tables = []
    for table in split_report:
        table_lines = table.split('\n')
        # skip first line of table (empty)
        if len(table_lines) == 1:
            continue
        stripped_lines = [x.strip(' ') for x in table_lines]
        tables.append(stripped_lines)
        
    groups = {'Air Force' : None, 'Army': None, 'Marines': None, 'Navy': None, 
            'Active': None, 'Reserve': None, 'Total': None}
   
    # Create dictionary
    for table in tables:
        dx_stop = table[2].index('by')
        diagnosis = table[2][:dx_stop]
        service = table.index('Service')
        years = table[service+1:service+14]
        total_hospitalizations_by_dx[diagnosis] = {year : {group : None for group in groups} 
                                 for year in years}
        
        af_index = table.index('A')
        army_index = table.index('A', af_index+1)
        marines_index = table.index('Marines')
        navy_index = table.index('Navy')
        navy_end = table.index('Component')
        active_index = table.index('ACT')
        reserve_index = table.index('GRD')
        total_index = table.index('TOTAL')

        af = table[af_index+1:army_index]
        army = table[army_index+1:marines_index]
        marines = table[marines_index+1:navy_index]
        navy = table[navy_index+1:navy_end]
        active = table[active_index+1:reserve_index]
        reserve = table[reserve_index+1:total_index]
        total = table[total_index+1:]
        if total[-1] == '':
            total.pop()
            
        groups['Air Force'] = af
        groups['Army'] = army
        groups['Marines'] = marines
        groups['Navy'] = navy
        groups['Active'] = active
        groups['Reserve'] = reserve
        groups['Total'] = total

        for year in years:
            for branch in groups:
                total_hospitalizations_by_dx[diagnosis][year][branch] = groups[branch][years.index(year)]
        
with open("total_hospitalizations_by_dx.json", "w") as outfile:
    json.dump(total_hospitalizations_by_dx, outfile)