# InfoGroup data

> Process and prepare InfoGroup dataset.

## Processing

Starting from original CSV files.

- Convert to unicode
- Validate against JSON schema. A few erroneous data entries are erased here (e.g. text in numerical column). Existing implementation uses datapackage validator and takes several days with single core.
- Save to disk in parquet format.
- Provide interface to load single year of data. Allow filtering, column selection and small (optionally random) sample.

In [None]:
#default_exp infogroup

In [None]:
#export
import json
import gzip
import shutil

import numpy as np
import pandas as pd
import fastparquet
from IPython import display

from rurec import resources
from rurec.resources import Resource
from rurec import util

In [None]:
resources.add(Resource('infogroup/schema', '/InfoGroup/data/processed/schema.json', 'Processed InfoGroup data, schema', False))
for y in range(1997, 2018):
    resources.add(Resource(f'infogroup/csv/{y}', f'/InfoGroup/data/processed/{y}.csv', f'Processed InfoGroup data, {y}, CSV format', False))
    resources.add(Resource(f'infogroup/pq/{y}', f'/InfoGroup/data/processed/{y}.pq', f'Processed InfoGroup data, {y}, parquet format', False))
    resources.add(Resource(f'infogroup/orig/{y}', f'/InfoGroup/data/original/raw/{y}_Business_Academic_QCQ.csv', f'Original unprocessed InfoGroup data, {y}', False))

## Clear and validate raw data

- Change "latin-1" encoding to "utf-8", remove double quotes around values.
- Remove double quotes around every value.
- Rename columns to ALL_CAPS.
- In 2009: pad string fields with zeroes.
- Validate values format, replace errors with missing values.

### Future work

- Correct 2-digit state part of the FIPS code.
- Correct missing CBSA code and CBSA level, mainly in 2009.
- add indicator variables for different samples (random, WI, FAI, ...) to be used as parquet partitions to allow quick read of data subsets
- put meaningfult labels to categoricals ("1-5" instead if "A" etc).
  - This will be tricky for POPULATION_CODE that changes coding between 2015 and 2016
- read through schema file and see if there is anything useful left in notes or elswhere
- make up and add enum constraints for TITLE_CODE and CALL_STATUS_CODE
- add logging of errors to a file
- if categoricals are worthy on fields with large number of unique values, possible unknown a priori, such as city or NAICS, then they should be applied. care should be taken because set of unique values can vary between years, and it might create problems when merging.
- further validations:
  - codes are valid (i.e. can be found in lookup tables) for fields such as SIC, NAICS, FIPS, CBSA_CODE etc.
  - geo variable consistency: CBSA_LEVEL vs CBSA_CODE, lon-lat, nesting of areas

In [None]:
#export

def convert_schema(datapackage_schema_path):
    """Convert old datapackage schema.json into a new file, to be used for data validation."""
    sch0 = json.load(open(datapackage_schema_path))

    def get_field_years(field_name):
        years = []
        for fl in sch0['field_lists']:
            if field_name in fl['fields']:
                years += fl['years']
        if 2015 in years:
            years += [2016, 2017]
        return sorted(years)


    sch = dict()
    sch['info'] = 'Schema for cleaned InfoGroup data.'
    sch['fields'] = list()

    for f0 in sch0['fields']:
        f = dict()
        name = f0['name']
        f['name'] = name.upper()
        f['years'] = get_field_years(name)
        if 'enum' in f0['constraints']:
            enum = f0['constraints']['enum'].copy()
            if '' in enum:
                enum.pop(enum.index(''))
            f['enum'] = enum
        if 'values' in f0:
            values = f0['values'].copy()
            if name == 'cbsa_level':
                del values['0'] # code never used
            f['enum_labels'] = values

        field_widths = {
            'ZIP': 5,
            'ZIP4': 4,
            'COUNTY_CODE': 3,
            'AREA_CODE': 3,
            'PHONE': 7,
            'SIC': 6,
            'SIC0': 6,
            'SIC1': 6,
            'SIC2': 6,
            'SIC3': 6,
            'SIC4': 6,
            'NAICS': 8,
            'YP_CODE': 5,
            'ABI': 9,
            'SUBSIDIARY_NUMBER': 9,
            'PARENT_NUMBER': 9,
            'SITE_NUMBER': 9,
            'CENSUS_TRACT': 6,
            'CENSUS_BLOCK': 1,
            'CBSA_CODE': 5,
            'CSA_CODE': 3,
            'FIPS_CODE': 5
        }
        if f['name'] in field_widths:
            f['width'] = field_widths[f['name']]

        f['original_name'] = f0['originalName']
        f['original_description'] = f0['originalDescription']
        sch['fields'].append(f)

    json.dump(sch, open(resources.get('infogroup/schema').path, 'w'), indent=1)
    

def pad_with_zeroes(df, schema_fields):
    """Prepend string column values with zeroes to have constant width."""

    for field in schema_fields:
        if 'width' in field:
            df[field['name']] = df[field['name']].str.zfill(field['width'])

def validate_raw_strings(df, schema_fields):
    """Validate values in raw InfoGroup data according to string constraints.
    Return list of dicts of invalid values.
    """
    
    constraints = {
        'ZIP': {'number': True},
        'ZIP4': {'number': True},
        'COUNTY_CODE': {'number': True},
        'AREA_CODE': {'number': True},
        'PHONE': {'number': True},
        'SIC': {'number': True},
        'SIC0': {'number': True},
        'SIC1': {'number': True},
        'SIC2': {'number': True},
        'SIC3': {'number': True},
        'SIC4': {'number': True},
        'NAICS': {'number': True},
        'YEAR': {'notna': True, 'number': True},
        'YP_CODE': {'number': True},
        'EMPLOYEES': {'number': True},
        'SALES': {'number': True},
        'PARENT_EMPLOYEES': {'number': True},
        'PARENT_SALES': {'number': True},
        'YEAR_EST': {'number': True},
        'ABI': {'unique': True, 'notna': True, 'number': True},
        'SUBSIDIARY_NUMBER': {'number': True},
        'PARENT_NUMBER': {'number': True},
        'SITE_NUMBER': {'number': True},
        'CENSUS_TRACT': {'number': True},
        'CENSUS_BLOCK': {'number': True},
        'LATITUDE': {'number': True},
        'LONGITUDE': {'number': True},
        'CBSA_CODE': {'number': True},
        'CSA_CODE': {'number': True},
        'FIPS_CODE': {'number': True}
    }
    
    # the above hard coded list of constraints must be consistent with field availability in given year
    constraints = {k: v for k, v in constraints.items() if k in df}
    
    
    for field in schema_fields:
        name = field['name']
        if 'enum' in field:
            if name not in constraints: constraints[name] = dict()
            constraints[name]['cats'] = field['enum']
        if 'width' in field:
            constraints[name]['nchar'] = field['width']
    
    return util.validate_values(df, constraints)


def convert_dtypes(df, schema_fields):
    """Inplace convert string columns to appropriate types."""
    
    for col in ['YEAR', 'EMPLOYEES', 'SALES', 'PARENT_EMPLOYEES', 'PARENT_SALES', 'YEAR_EST', 'LATITUDE', 'LONGITUDE']:
        df[col] = pd.to_numeric(df[col])
        
    for field in schema_fields:
        if 'enum' in field:
            df[field['name']] = pd.Categorical(df[field['name']], categories=field['enum'])
    

def validate_raw_numbers(df):
    """Validate values in raw InfoGroup data according to numerical constraints.
    Return list of dicts of invalid values.
    """
    
    constraints = {
        'YEAR': {'eq': year},
        'EMPLOYEES': {'ge': 0},
        'SALES': {'ge': 0},
        'PARENT_EMPLOYEES': {'ge': 0},
        'PARENT_SALES': {'ge': 0},
        'YEAR_EST': {'ge': 1000, 'le': year},
        'LATITUDE': {'ge': 0, 'le': 90},
        'LONGITUDE': {'ge': -180, 'le': 0}
    }
    return util.validate_values(df, constraints)

def replace_invalid(df, invalid_list, replacement=np.nan):
    """Replace invalid values."""
    for inv in invalid_list:
        df.loc[inv['idx'], inv['col']] = replacement
        print(f'Replace invalid value `{inv["val"]}` with `{replacement}` at .loc[{inv["idx"]}, \'{inv["col"]}\']')

In [None]:
for year in [1999, 2002, 2004, 2006]:
    print(year)
    sch = json.load(resources.get(f'infogroup/schema').path.open())
    sch = [x for x in sch['fields'] if year in x['years']]

    df = pd.read_csv(resources.get(f'infogroup/orig/{year}').path, dtype='str', encoding='latin-1')

    df.rename(columns={x['original_name']: x['name'] for x in sch}, inplace=True)

    if year == 2009:
        pad_with_zeroes(df, sch)

    invalid_str = validate_raw_strings(df, sch)
    if len(invalid_str) < 100:
        replace_invalid(df, invalid_str)
    else:
        print(invalid_str[:5])
        raise Exception(f'Very many invalid_str values: {len(invalid_str)}')

    convert_dtypes(df, sch)

    invalid_num = validate_raw_numbers(df)
    if len(invalid_num) < 100:
        replace_invalid(df, invalid_num)
    else:
        print(f'Very many invalid_num values: {len(invalid_num)}')
        print(invalid_num[:5])

    df.to_csv(resources.get(f'infogroup/csv/{year}').path, index=False)
    fastparquet.write(resources.get(f'infogroup/pq/{year}').path, df, write_index=False)

```
1997
Replace invalid value `9 1` with `nan` at EMPLOYEES, 3756551.
Replace invalid value `1 6` with `nan` at SALES, 2827962.
Replace invalid value `121  0` with `nan` at SALES, 2835233.
Replace invalid value `1 6 0` with `nan` at SALES, 5601711.
Replace invalid value `2 930` with `nan` at SALES, 5601723.
Replace invalid value `39 6 0` with `nan` at SALES, 5645704.
Replace invalid value `26 0` with `nan` at SALES, 5645707.
Replace invalid value `44 0` with `nan` at SALES, 5652929.
Replace invalid value `A` with `nan` at SALES, 7949970.
Replace invalid value `18 4` with `nan` at SALES, 10777263.
Replace invalid value `/` with `nan` at PARENT_NUMBER, 1970582.

1998
Replace invalid value `F000` with `nan` at EMPLOYEES, 10554007.
Replace invalid value `0` with `nan` at BUSINESS_STATUS, 10554007.
Replace invalid value `0` with `nan` at OFFICE_SIZE_CODE, 10554007.
Replace invalid value `IG4775007` with `nan` at ABI, 10554007.
Replace invalid value `6` with `nan` at ADDRESS_TYPE, 10554007.

2003
Replace invalid value `1 6` with `nan` at SALES, 3179995.
Replace invalid value `121  0` with `nan` at SALES, 3188560.
Replace invalid value `1 6 0` with `nan` at SALES, 6401323.
Replace invalid value `44 0` with `nan` at SALES, 6456105.
Replace invalid value `26 0` with `nan` at SALES, 6516738.
Replace invalid value `18 4` with `nan` at SALES, 12214748.

2004
Replace invalid value `1 6` with `nan` at SALES, 3132298.
Replace invalid value `121  0` with `nan` at SALES, 3140153.
Replace invalid value `1 6 0` with `nan` at SALES, 6330767.
Replace invalid value `44 0` with `nan` at SALES, 6383666.
Replace invalid value `26 0` with `nan` at SALES, 6441673.
Replace invalid value `18 4` with `nan` at SALES, 12093651.

2005
Replace invalid value `1 6` with `nan` at SALES, 3262278.
Replace invalid value `121  0` with `nan` at SALES, 3270294.
Replace invalid value `1 6 0` with `nan` at SALES, 6602040.
Replace invalid value `44 0` with `nan` at SALES, 6656877.
Replace invalid value `26 0` with `nan` at SALES, 6716447.
Replace invalid value `18 4` with `nan` at SALES, 12657797.

2006
Replace invalid value `1 6` with `nan` at SALES, 3348096.
Replace invalid value `121  0` with `nan` at SALES, 3356659.
Replace invalid value `1 6 0` with `nan` at SALES, 6729279.
Replace invalid value `44 0` with `nan` at SALES, 6784843.
Replace invalid value `26 0` with `nan` at SALES, 6848756.
Replace invalid value `18 4` with `nan` at SALES, 12800833.

2007
Replace invalid value `1 6` with `nan` at SALES, 3445276.
Replace invalid value `121  0` with `nan` at SALES, 3452053.
Replace invalid value `1 6 0` with `nan` at SALES, 6929699.
Replace invalid value `1493.0` with `nan` at YEAR_EST, 13571251.


2008
Replace invalid value `C` with `nan` at SALES, 2473525.
Replace invalid value `1 6` with `nan` at SALES, 3527847.
Replace invalid value `121  0` with `nan` at SALES, 3534969.
Replace invalid value `1 6 0` with `nan` at SALES, 7040776.
Replace invalid value `1493.0` with `nan` at YEAR_EST, 13754797. # replaced during numerical validation, constraint was year_est >= 1500

2012
Replace invalid value `INT` with `nan` at SALES, 12383573.

2013
Replace invalid value `INT` with `nan` at SALES, 13419903.


```

In [None]:
#export
def get_df(year, cols=None):
    """Return one year of InfoGroup data with appropriate data types.
    Subset of columns can be loaded by passing list to `cols`.
    """
    res = resources.get(f'infogroup/pq/{year}')
    return pd.read_parquet(res.path, 'fastparquet', columns=cols)

# Tests