# Reading More Complex Data
Sometimes data files have a more complex structure than a simple rectangular table or a rectangular table with folded lines.

A common example is Census files having a hierarchical structure such as having both household and person records. Each typeof record has its own set of variables and structure. There may be some common variables. In the census example the person records would have both a person identifier (key) and a household identifier to indicate to which househld the person belonged. Each record might also have a record type variable.

There are several ways that these sorts of files can be turned into DataFrames. 


# Setup

In [1]:
import pandas as pd
import pprint
import warnings
from pathlib import Path

# capture the path to the data directory

dataDirectory =Path.home() / r'OneDrive - University of Kansas\LAS792_Fall2021_ForStudents\data\reading2'
dataDirectory

WindowsPath('C:/Users/zambrana/OneDrive - University of Kansas/LAS792_Fall2021_ForStudents/data/reading2')

### Specifying Record Formats
First, it is useful to have a method of describing the format of fixed column records in a data structure.

We'll do that with a nested dictionary structure. The outer dictionary has keys that are the variable names. For each variable there is a dictionary that has keys for
  * dataTypeFunction - a function that converts the string to the proper datatype
  * startColumn - the index of the starting position (1 is the first position)
  * endColumn - the index of the ending column
  * fieldWidth - the width of the column

EndColumn is preferred. If it is missing then the program will look for fieldWidth.  A value for startColumn must be supplied


In [52]:
import pprint
# describe the structure of the Household records
# This description could (should) come from the metadata for the file
# note the mix of specifying endPosition and fieldWidth
householdRecordStructure = {'recordType': {'dataTypeFunction':str,
                                         'startPosition':1,
                                         'endPosition':1                                  
                                        },
                            'householdIdGroup': {'dataTypeFunction':str,
                                               'startPosition':2,
                                               'endPosition':2                                  
                                               }, 
                            'householdId': {'dataTypeFunction':str,
                                          'startPosition':2,
                                          'fieldWidth':4                                  
                                          },                            
                            'numberOfPersons': {'dataTypeFunction':int,
                                              'startPosition':6,
                                              'endPosition':7,
                                              'fieldWidth':2                                  
                                              },                            
                           }

print("\n Plain print of a dictionary\n")
print(householdRecordStructure)
print("\n PPRINT\n")
pprint.pprint(householdRecordStructure)



 Plain print of a dictionary

{'recordType': {'dataTypeFunction': <class 'str'>, 'startPosition': 1, 'endPosition': 1}, 'householdIdGroup': {'dataTypeFunction': <class 'str'>, 'startPosition': 2, 'endPosition': 2}, 'householdId': {'dataTypeFunction': <class 'str'>, 'startPosition': 2, 'fieldWidth': 4}, 'numberOfPersons': {'dataTypeFunction': <class 'int'>, 'startPosition': 6, 'endPosition': 7, 'fieldWidth': 2}}

 PPRINT

{'householdId': {'dataTypeFunction': <class 'str'>,
                 'fieldWidth': 4,
                 'startPosition': 2},
 'householdIdGroup': {'dataTypeFunction': <class 'str'>,
                      'endPosition': 2,
                      'startPosition': 2},
 'numberOfPersons': {'dataTypeFunction': <class 'int'>,
                     'endPosition': 7,
                     'fieldWidth': 2,
                     'startPosition': 6},
 'recordType': {'dataTypeFunction': <class 'str'>,
                'endPosition': 1,
                'startPosition': 1}}


The next cell tests applying the dataTypeFunction to a string to produce an object of the right datatype. It converts the string '123' into an integer.

Note that householdRecordStructure['numberOfPersons']['dataTypeFunction'] returns **int** *WHICH IS A FUNCTION*

so this becomes castedValue = int('123')

In [53]:
print('The function to be applied is ', 
      householdRecordStructure['numberOfPersons']['dataTypeFunction'])
castedValue = householdRecordStructure['numberOfPersons']['dataTypeFunction']('123')
print("castedValue is an ", type(castedValue), " of ", castedValue)

The function to be applied is  <class 'int'>
castedValue is an  <class 'int'>  of  123


What happens if we try to get endPosition when it doesn't exist?

In [54]:
householdRecordStructure['householdId']['endPosition']

KeyError: 'endPosition'

Use the dictionary **get** function to get the value and not throw an error when there is no such key. See https://docs.python.org/3/library/stdtypes.html#typesmapping


In [55]:
end = householdRecordStructure['householdId'].get('endPosition', None)
print(end)

None


We can use this data structure to parse a string into its component parts. We'll put those into a dictionary.


In [56]:
rawRecord='HG12310'
dataStructure = {}
for fieldName, fieldDict in (householdRecordStructure).items():
    startIndex = fieldDict['startPosition']-1
    endIndex = fieldDict.get('endPosition',None)
    width = fieldDict.get('fieldWidth',None)
    if endIndex != None:
        dataString = rawRecord[startIndex:endIndex]
    else:
        dataString = rawRecord[startIndex:startIndex+width]
    dataStructure[fieldName] = fieldDict['dataTypeFunction'](dataString)
pprint.pprint(dataStructure)    

{'householdId': 'G123',
 'householdIdGroup': 'G',
 'numberOfPersons': 10,
 'recordType': 'H'}


# Creating a Function
This code is pretty obviously code that could be reused many times. Making it a function will allow it to be reused without copying it each time.

The function is defined with a *def* statement. Documentation for the function appears in a string literal immediately following the def line.


The parameters for the function appear inside the parentheses. When the function is invoked the variables in the function code take on the values given in the function call. If these are references to mutable objects the calling object can be changed. This is called a *side effect* and is often a bad idea. The safest practice is to not change the values of arguments, only to return information through a return statement.

See the invocation of the function in the code segment following the function definition to see how the function is used.

This function does not do a good job of checking for valid arguments. What if both end and width are missing? What if width does not equal 1+end-start?  We could write another function that does this checking.


In [57]:
def parseFwfRecord(rawRecord, recordStructure):
    """
    Return a dictionary with keys as variable names and 
    values taken from the designated positions in the 
    raw string (the rawRecord).
    
    rawRecord is the string containing the raw data. 
    recordStructure is a nested dictionary.
    The outer dictionary has keys that are the variable names. 
    For each variable there is a dictionary that has keys for:
      * dataTypeFunction - a function that converts the string to the
      proper datatype
      * startColumn - the index of the starting position 
      (1 is the first position)
      * endColumn - the index of the ending column
      * fieldWidth - the width of the column

    EndColumn is preferred. If it is missing then the program will 
    look for fieldWidth. 
    A value for startColumn must be supplied
    """
    dataStructure = {}
    for fieldName, fieldDict in (recordStructure).items():
        startIndex = fieldDict['startPosition']-1
        endIndex = fieldDict.get('endPosition',None)
        width = fieldDict.get('fieldWidth',None)
        if endIndex != None:
            dataString = rawRecord[startIndex:endIndex]
        else:
            dataString = rawRecord[startIndex:startIndex+width]
        dataStructure[fieldName] = fieldDict['dataTypeFunction'](dataString)
    return dataStructure

The function is *invoked* by listing objects for each of the function parameters (arguments). The value returned from the function is a dictionary of the parsed values from the raw data.

In [58]:
anotherRecord='HM12302'
parseFwfRecord(rawRecord=anotherRecord, 
               recordStructure=householdRecordStructure)

{'recordType': 'H',
 'householdIdGroup': 'M',
 'householdId': 'M123',
 'numberOfPersons': 2}

In [59]:
# print the function documentation
print(parseFwfRecord.__doc__)


    Return a dictionary with keys as variable names and 
    values taken from the designated positions in the 
    raw string (the rawRecord).
    
    rawRecord is the string containing the raw data. 
    recordStructure is a nested dictionary.
    The outer dictionary has keys that are the variable names. 
    For each variable there is a dictionary that has keys for:
      * dataTypeFunction - a function that converts the string to the
      proper datatype
      * startColumn - the index of the starting position 
      (1 is the first position)
      * endColumn - the index of the ending column
      * fieldWidth - the width of the column

    EndColumn is preferred. If it is missing then the program will 
    look for fieldWidth. 
    A value for startColumn must be supplied
    


### Hierarchical Data
Now suppose there is a data file that looks like this:

`
 HierarchicalData.txt raw records follow this line
HM45605
PM4560001M45
PM4560002F43
PM4560003F09
PM4560004M07
PM4560005M04
HM12302
PM1230007M65
PM1230008F66
HG12903
PG1290007M51
PG1290009F51
PG1290010F10
`

This file has household records like the ones above and person records that have the following structure:


In [60]:
personRecordStructure = {'recordType': {'dataTypeFunction':str,
                                         'startPosition':1,
                                         'endPosition':1                                  
                                        },
                            'householdIdGroup': {'dataTypeFunction':str,
                                               'startPosition':2,
                                               'endPosition':2                                  
                                               }, 
                            'householdId': {'dataTypeFunction':str,
                                          'startPosition':2,
                                          'fieldWidth':4                                  
                                          },                            
                            'personID': {'dataTypeFunction':str,
                                              'startPosition':6,
                                              'endPosition':9,
                                              'fieldWidth':4                                  
                                              },                            
                            'gender': {'dataTypeFunction':str,
                                              'startPosition':10,
                                              'endPosition':10,
                                              'fieldWidth':1                                  
                                              },
                            'age': {'dataTypeFunction':int,
                                              'startPosition':11,
                                              'endPosition':12,
                                              'fieldWidth':2                                  
                                              },                       
                        }
pprint.pprint(personRecordStructure)

{'age': {'dataTypeFunction': <class 'int'>,
         'endPosition': 12,
         'fieldWidth': 2,
         'startPosition': 11},
 'gender': {'dataTypeFunction': <class 'str'>,
            'endPosition': 10,
            'fieldWidth': 1,
            'startPosition': 10},
 'householdId': {'dataTypeFunction': <class 'str'>,
                 'fieldWidth': 4,
                 'startPosition': 2},
 'householdIdGroup': {'dataTypeFunction': <class 'str'>,
                      'endPosition': 2,
                      'startPosition': 2},
 'personID': {'dataTypeFunction': <class 'str'>,
              'endPosition': 9,
              'fieldWidth': 4,
              'startPosition': 6},
 'recordType': {'dataTypeFunction': <class 'str'>,
                'endPosition': 1,
                'startPosition': 1}}


To read the hierarchical file, first read a line and look at the record type, then read whole line using the correct structure. Add the structure to the appropriate DataFrame

In [61]:
import pandas as pd
personDf = pd.DataFrame()
householdDf = pd.DataFrame()

# read the hierarchical file
hierarchicalFile=open(dataDirectory / "HierarchicalData.txt")
for line in hierarchicalFile:
    if(line[0:1]=="H"):
        # this is a household record
        houseDict = parseFwfRecord(rawRecord=line, 
                                   recordStructure=householdRecordStructure)
        householdDf = householdDf.append(houseDict, ignore_index=True)
    else:    
        # this is a person record
        personDict = parseFwfRecord(rawRecord=line, 
                                   recordStructure=personRecordStructure)
        personDf = personDf.append(personDict, ignore_index=True)

print("HOUSEHOLDS\n")
pprint.pprint(householdDf)

print("\nPERSONS\n")
pprint.pprint(personDf)

HOUSEHOLDS

  householdId householdIdGroup  numberOfPersons recordType
0        M456                M              5.0          H
1        M123                M              2.0          H
2        G129                G              3.0          H

PERSONS

    age gender householdId householdIdGroup personID recordType
0  45.0      M        M456                M     0001          P
1  43.0      F        M456                M     0002          P
2   9.0      F        M456                M     0003          P
3   7.0      M        M456                M     0004          P
4   4.0      M        M456                M     0005          P
5  65.0      M        M123                M     0007          P
6  66.0      F        M123                M     0008          P
7  51.0      M        G129                G     0007          P
8  51.0      F        G129                G     0009          P
9  10.0      F        G129                G     0010          P


### Using these DataFrames together
Later we will see how to combiine information from the two DataFrames.

# Robust, Reusable Functions
Even for one use writing a function like parseFwfRecord is useful. A function like this, though, is likely to be useful in many projects. If a function is to be reusable, it should not only return proper results, but also be able to handle bad input gracefully.

In the record structure below each field has a different kind of error. Writing functions to check the record structure is a good first step.

Note that one of the errors is caught when we try to run the next cell. The function "foo" is not defined.

In [62]:
# this record structure has a different error in each field definition

badPersonRecordStructure = {'recordType': {'dataTypeFunction':str,
                                         'startPosition':2                              
                                        },
                            'householdIdGroup': {'dataTypeFunction':str,
                                               'startPosition':2,
                                               'endPosition':1                                  
                                               }, 
                            'householdId': {'dataTypeFunction':str,
                                          'startPosition':2,
                                          'fieldWidth':0                                  
                                          },                            
                            'personID': {'dataTypeFunction':str,
                                              'startPosition':6,
                                              'endPosition':9,
                                              'fieldWidth':3                                  
                                              },                            
                            'gender': {'dataTypeFunction':str,
                                              'startPosition':100,
                                              'endPosition':100,
                                              'fieldWidth':1                                  
                                              },
                            'age': {'dataTypeFunction':foo,
                                              'startPosition':11,
                                              'endPosition':12,
                                              'fieldWidth':2                                  
                                              },                       
                        }
pprint.pprint(badPersonRecordStructure)

NameError: name 'foo' is not defined

In [63]:
# this record structure has at least one error in each field definition
import pprint
badPersonRecordStructure2 = {'recordType': {'dataTypeFunction':str,
                                         'startPosition':2                              
                                        },
                            'householdIdGroup': {'dataTypeFunction':str,
                                               'startPosition':2,
                                               'endPosition':1                                  
                                               }, 
                            'householdId': {'dataTypeFunction':str,
                                          'startPosition':-2,
                                          'fieldWidth':0                                  
                                          },                            
                            'personID': {'dataTypeFunction':str,
                                              'startPosition':6,
                                              'endPosition':9,
                                              'fieldWidth':3                                  
                                              },                            
                            'gender': {'dataTypeFunction':1,
                                              'startPosition':1000,
                                              'endPosition':1000,
                                              'fieldWidth':1                                  
                                              },
                            'age': {'startPosition':11,
                                              'endPosition':12,
                                              'fieldWidth':2                                  
                                              },                       
                        }
pprint.pprint(badPersonRecordStructure2)

{'age': {'endPosition': 12, 'fieldWidth': 2, 'startPosition': 11},
 'gender': {'dataTypeFunction': 1,
            'endPosition': 1000,
            'fieldWidth': 1,
            'startPosition': 1000},
 'householdId': {'dataTypeFunction': <class 'str'>,
                 'fieldWidth': 0,
                 'startPosition': -2},
 'householdIdGroup': {'dataTypeFunction': <class 'str'>,
                      'endPosition': 1,
                      'startPosition': 2},
 'personID': {'dataTypeFunction': <class 'str'>,
              'endPosition': 9,
              'fieldWidth': 3,
              'startPosition': 6},
 'recordType': {'dataTypeFunction': <class 'str'>, 'startPosition': 2}}


In [64]:
import warnings
def validRecordStructure(recordStructure, verbose=True):
    '''
    validRecordStructure returns True if no errors are found, False if an error is found.
    In the latter case error message is printed to the log if verbose is True (the default).
    recordStructure is the structure to check.
    '''
    # assume no errors
    hasError = False
    
    # examine each field 
           
    for fieldName,structureDict in recordStructure.items():
        # missing keys

        start = structureDict.get('startPosition')
        end = structureDict.get('endPosition')
        width = structureDict.get('fieldWidth')       
        dataTypeFunction = structureDict.get('dataTypeFunction')
        
        if dataTypeFunction == None:
            # this could issue a warning
            warnings.warn(fieldName + " is missing a dataTypeFunction")
            hasError = True    
        else:
            fnType = str(type(dataTypeFunction))
            isBadType = fnType not in ["<class 'function'>", "<class 'type'>"]
            if isBadType:
                # this could issue a warning
                print('NOTE: ', fieldName, " has a dataTypeFunction that is not a function or type: " , fnType)
                warnings.warn(fieldName + " has a dataTypeFunction that is not a function or type: " + fnType)
                hasError = True    
                      
        if start == None:
            # or it could just print warnings
            print("NOTE: ", fieldName, " is missing a valid startPosition")
            hasError = True 
        else:
            if start <= 0:
              print("NOTE: ", fieldName, " startPosition must be greater than 0")
              hasError = True 
    
           
        # must have at least an end or a width. It must be positive, non-zero.   
        
        if (end == None) and (width == None):
            print("NOTE: ", fieldName, " is missing both an endPosition and a fieldWidth.\n         At least one must be specified.")
            hasError = True  
        elif end != None and end <= 0:
            print("NOTE: ", fieldName, " endPosition  must be greater than 0.")
            hasError = True  
        elif width != None and width <= 0:
            print("NOTE: ", fieldName, " fieldWidth  must be greater than 0.")
            hasError = True  

        
        # end must be greater than start   
        if (start != None and
           end != None and
           start > end):
             print("NOTE: ", fieldName, " start must not be greater than end.")
             hasError = True   
                
 
            
        # start, end, and width must be consistent    
        if (start != None and
           end != None and 
           width != None and
           width != 1 + end - start):
            print("NOTE: ", fieldName, " start end and width are not consistent ", 
                  "\n        ",
                  width, 
                  "  != 1 + ", end, " - ", start  )
            hasError = True  
            
    return not hasError

In [65]:
pprint.pprint(badPersonRecordStructure2)
print("\n\n\n")
print("\n\n Valid structure: ", validRecordStructure(badPersonRecordStructure2))

{'age': {'endPosition': 12, 'fieldWidth': 2, 'startPosition': 11},
 'gender': {'dataTypeFunction': 1,
            'endPosition': 1000,
            'fieldWidth': 1,
            'startPosition': 1000},
 'householdId': {'dataTypeFunction': <class 'str'>,
                 'fieldWidth': 0,
                 'startPosition': -2},
 'householdIdGroup': {'dataTypeFunction': <class 'str'>,
                      'endPosition': 1,
                      'startPosition': 2},
 'personID': {'dataTypeFunction': <class 'str'>,
              'endPosition': 9,
              'fieldWidth': 3,
              'startPosition': 6},
 'recordType': {'dataTypeFunction': <class 'str'>, 'startPosition': 2}}




NOTE:  recordType  is missing both an endPosition and a fieldWidth.
         At least one must be specified.
NOTE:  householdIdGroup  start must not be greater than end.
NOTE:  householdId  startPosition must be greater than 0
NOTE:  householdId  fieldWidth  must be greater than 0.
NOTE:  personID  start end an



In [66]:
print("that's all folks")

that's all folks
