# InfoGroup data

> Process and prepare InfoGroup dataset.

## Processing

Starting from original CSV files.

- Convert to unicode
- Validate against JSON schema. A few erroneous data entries are erased here (e.g. text in numerical column). Existing implementation uses datapackage validator and takes several days with single core.
- Save to disk in parquet format.
- Provide interface to load single year of data. Allow filtering, column selection and small (optionally random) sample.

In [None]:
#default_exp infogroup

In [None]:
#export
import logging
import json
import gzip
import shutil

import numpy as np
import pandas as pd
import fastparquet
from IPython import display

from rurec import resources
from rurec.resources import Resource
from rurec import util

In [None]:
logging.basicConfig(filename=resources.get('infogroup/schema').path.parent/'processing.log', 
                    filemode='w', level=logging.INFO, format='%(levelname)s:%(message)s')

In [None]:
data_years = range(1997, 2018)
resources.add(Resource('infogroup/schema', '/InfoGroup/data/processed/schema.json', 'Processed InfoGroup data, schema', False))
for y in data_years:
    resources.add(Resource(f'infogroup/csv/{y}', f'/InfoGroup/data/processed/{y}.csv', f'Processed InfoGroup data, {y}, CSV format', False))
    resources.add(Resource(f'infogroup/pq/{y}', f'/InfoGroup/data/processed/{y}.pq', f'Processed InfoGroup data, {y}, parquet format', False))
    resources.add(Resource(f'infogroup/orig/{y}', f'/InfoGroup/data/original/raw/{y}_Business_Academic_QCQ.csv', f'Original unprocessed InfoGroup data, {y}', False))

## Clear and validate raw data

- Change "latin-1" encoding to "utf-8", remove double quotes around values.
- Remove double quotes around every value.
- Rename columns to ALL_CAPS.
- In 2009: pad string fields with zeroes.
- Validate values format, replace errors with missing values.

### Future work

- Correct 2-digit state part of the FIPS code.
- Correct missing CBSA code and CBSA level, mainly in 2009.
- add indicator variables for different samples (random, WI, FAI, ...) to be used as parquet partitions to allow quick read of data subsets
- put meaningfult labels to categoricals ("1-5" instead if "A" etc).
  - This will be tricky for POPULATION_CODE that changes coding between 2015 and 2016
- make up and add enum constraints for TITLE_CODE and CALL_STATUS_CODE
- add logging of errors to a file
- if categoricals are worthy on fields with large number of unique values, possible unknown a priori, such as city or NAICS, then they should be applied. care should be taken because set of unique values can vary between years, and it might create problems when merging.
- validations:
  - codes are valid (i.e. can be found in lookup tables) for fields such as SIC, NAICS, FIPS, CBSA_CODE etc.
  - geo variable consistency: CBSA_LEVEL vs CBSA_CODE, lon-lat, nesting of areas
- few variables have many values like "00000", those should possibly be replaced with np.nan
  - subsidiary_number, parent_number, site_number, census_tract, csa_code, maybe others

## Geographic information

- ADDRESS: historical address
- CITY: historical address city
- STATE: historical address state
- ZIP: historical address zip code
- ZIP4: historical address zip code zip + 4
- COUNTY_CODE: county code based upon location address/zip4 (postal)
- AREA_CODE: area code of business
- ADDRESS_TYPE: indicates if type of address. "F": "Firm", "G": "General delivery", "H": "High-rise", "M": "Military", "P": "Post office box", "R": "Rural route or hwy contract", "S": "Street", "N": "Unknown", "": "No match to Zip4".
- CENSUS_TRACT: identifies a small geographic area for the purpose of collecting and compiling population and housing data.  census tracts are unique only within census county, and census counties are unique only within census state.  
- CENSUS_BLOCK: bgs are subdivisions of census tracts and unique only within a specific census tract.  census tracts/block groups are assigned to address records via a geocoding process.
- LATITUDE: parcel level assigned via point geo coding.  half of a pair of coordinates (the other being longitude)  provided in a formatted value, with decimals or a negative sign. not available in puerto rico & virgin island.
- LONGITUDE: parcel level assigned via point geo coding.  note: longitudes are negatives values in the western hemisphere.  provided in its formatted value, with decimals or a negative sign. not available in puerto rico & virigin island
- MATCH_CODE: parcel level match code of the business location. "0": "Site level", "2": "Zip+2 centroid", "4": "Zip+4 centroid", "P": "Parcel", "X": "Zip centroid".
- CBSA_CODE: core bases statistical area (expanded msa code)
- CBSA_LEVEL: indicates if an area is a micropolitan or metropolitan area. "1": "Micropolitan", "2": "Metropolitan"
- CSA_CODE: adjoining cbsa's.  combination of metro and micro areas
- FIPS_CODE: first 2 bytes = state code, last 3 bytes = county code (location)

In [None]:
#export

def convert_schema(datapackage_schema_path):
    """Convert old datapackage schema.json into a new file, to be used for data validation."""
    sch0 = json.load(open(datapackage_schema_path))

    def get_field_years(field_name):
        years = []
        for fl in sch0['field_lists']:
            if field_name in fl['fields']:
                years += fl['years']
        if 2015 in years:
            years += [2016, 2017]
        if field_name == 'gender':
            years += [2017]
        return sorted(years)


    sch = dict()
    sch['info'] = 'Schema for cleaned InfoGroup data.'
    sch['fields'] = list()

    for f0 in sch0['fields']:
        f = dict()
        name = f0['name']
        f['name'] = name.upper()
        f['years'] = get_field_years(name)
        if 'enum' in f0['constraints']:
            enum = f0['constraints']['enum'].copy()
            if '' in enum:
                enum.pop(enum.index(''))
            f['enum'] = enum
        if 'values' in f0:
            values = f0['values'].copy()
            if name == 'cbsa_level':
                del values['0'] # code never used
            f['enum_labels'] = values

        field_widths = {
            'ZIP': 5,
            'ZIP4': 4,
            'COUNTY_CODE': 3,
            'AREA_CODE': 3,
            'PHONE': 7,
            'SIC': 6,
            'SIC0': 6,
            'SIC1': 6,
            'SIC2': 6,
            'SIC3': 6,
            'SIC4': 6,
            'NAICS': 8,
            'YP_CODE': 5,
            'ABI': 9,
            'SUBSIDIARY_NUMBER': 9,
            'PARENT_NUMBER': 9,
            'SITE_NUMBER': 9,
            'CENSUS_TRACT': 6,
            'CENSUS_BLOCK': 1,
            'CBSA_CODE': 5,
            'CSA_CODE': 3,
            'FIPS_CODE': 5
        }
        if f['name'] in field_widths:
            f['width'] = field_widths[f['name']]

        f['original_name'] = f0['originalName']
        f['original_description'] = f0['originalDescription']
        sch['fields'].append(f)

    json.dump(sch, open(resources.get('infogroup/schema').path, 'w'), indent=1)
    

def pad_with_zeroes(df, schema_fields):
    """Prepend string column values with zeroes to have constant width."""

    for field in schema_fields:
        if 'width' in field:
            df[field['name']] = df[field['name']].str.zfill(field['width'])

def validate_raw_strings(df, schema_fields):
    """Validate values in raw InfoGroup data according to string constraints.
    Return list of dicts of invalid values.
    """
    
    constraints = {
        'ZIP': {'number': True},
        'ZIP4': {'number': True},
        'COUNTY_CODE': {'number': True},
        'AREA_CODE': {'number': True},
        'PHONE': {'number': True},
        'SIC': {'number': True},
        'SIC0': {'number': True},
        'SIC1': {'number': True},
        'SIC2': {'number': True},
        'SIC3': {'number': True},
        'SIC4': {'number': True},
        'NAICS': {'number': True},
        'YEAR': {'notna': True, 'number': True},
        'YP_CODE': {'number': True},
        'EMPLOYEES': {'number': True},
        'SALES': {'number': True},
        'PARENT_EMPLOYEES': {'number': True},
        'PARENT_SALES': {'number': True},
        'YEAR_EST': {'number': True},
        'ABI': {'unique': True, 'notna': True, 'number': True},
        'SUBSIDIARY_NUMBER': {'number': True},
        'PARENT_NUMBER': {'number': True},
        'SITE_NUMBER': {'number': True},
        'CENSUS_TRACT': {'number': True},
        'CENSUS_BLOCK': {'number': True},
        'LATITUDE': {'number': True},
        'LONGITUDE': {'number': True},
        'CBSA_CODE': {'number': True},
        'CSA_CODE': {'number': True},
        'FIPS_CODE': {'number': True}
    }
    
    # the above hard coded list of constraints must be consistent with field availability in given year
    constraints = {k: v for k, v in constraints.items() if k in df}
    
    
    for field in schema_fields:
        name = field['name']
        if 'enum' in field:
            if name not in constraints: constraints[name] = dict()
            constraints[name]['cats'] = field['enum']
        if 'width' in field:
            constraints[name]['nchar'] = field['width']
    
    return util.validate_values(df, constraints)


def convert_dtypes(df, schema_fields):
    """Inplace convert string columns to appropriate types."""
    
    for col in ['YEAR', 'EMPLOYEES', 'SALES', 'PARENT_EMPLOYEES', 'PARENT_SALES', 'YEAR_EST', 'LATITUDE', 'LONGITUDE']:
        df[col] = pd.to_numeric(df[col])
        
    for field in schema_fields:
        if 'enum' in field:
            df[field['name']] = pd.Categorical(df[field['name']], categories=field['enum'])
    

def validate_raw_numbers(df):
    """Validate values in raw InfoGroup data according to numerical constraints.
    Return list of dicts of invalid values.
    """
    
    constraints = {
        'YEAR': {'eq': year},
        'EMPLOYEES': {'ge': 0},
        'SALES': {'ge': 0},
        'PARENT_EMPLOYEES': {'ge': 0},
        'PARENT_SALES': {'ge': 0},
        'YEAR_EST': {'ge': 1000, 'le': year},
        'LATITUDE': {'ge': 0, 'le': 90},
        'LONGITUDE': {'ge': -180, 'le': 0}
    }
    return util.validate_values(df, constraints)

def replace_invalid(df, invalid_list, replacement=np.nan):
    """Replace invalid values."""
    for inv in invalid_list:
        df.loc[inv['idx'], inv['col']] = replacement
        logging.info(f'Replace invalid value `{inv["val"]}` with `{replacement}` at .loc[{inv["idx"]}, \'{inv["col"]}\']')

In [None]:
# for year in data_years:
for year in range(1997, 2018):
    logging.info(f'\nProcessing started for year {year}\n' + '-'*80)
    sch = json.load(resources.get(f'infogroup/schema').path.open())
    sch = [x for x in sch['fields'] if year in x['years']]
    
    # POPULATION_CODE has different values in 2016 and 2017
    if year >= 2016:
        for f in sch:
            if f['name'] == 'POPULATION_CODE':
                f['enum'] = list('0123456789')
                break

    df = pd.read_csv(resources.get(f'infogroup/orig/{year}').path, dtype='str', encoding='latin-1')

    df.rename(columns={x['original_name']: x['name'] for x in sch}, inplace=True)

    if year == 2009:
        pad_with_zeroes(df, sch)

    invalid_str = validate_raw_strings(df, sch)
    if len(invalid_str) < 100:
        replace_invalid(df, invalid_str)
    else:
        logging.error(f'Very many invalid_str values: {len(invalid_str)}, processing aborted')
        logging.error(invalid_str[:5])
        continue # skip to next year

    convert_dtypes(df, sch)

    invalid_num = validate_raw_numbers(df)
    if len(invalid_num) < 100:
        replace_invalid(df, invalid_num)
    else:
        logging.error(f'Very many invalid_num values: {len(invalid_num)}, processing aborted')
        logging.error(invalid_num[:5])
        continue # skip to next year

    df.to_csv(resources.get(f'infogroup/csv/{year}').path, index=False)
    fastparquet.write(resources.get(f'infogroup/pq/{year}').path, df, write_index=False)
    logging.info(f'\nProcessing finished for year {year}\n' + '-'*80)

In [None]:
#export
def get_df(year, cols=None):
    """Return one year of InfoGroup data with appropriate data types.
    Subset of columns can be loaded by passing list to `cols`.
    """
    res = resources.get(f'infogroup/pq/{year}')
    return pd.read_parquet(res.path, 'fastparquet', columns=cols)

# Tests