# Overview of original InfoGroup data

> Variable definitions and summary statistics by year.

Missing counts are reported as captured by `isna()` method. This will include empty strings from raw CSV for string dtype.

Some variables also have a special value such as "000" which is reported separately. These special values are not documented, and so are discovered by inspection.

### Location and Rurality Measures

##### Address Line 1
The street address, name and number.
    
##### City
The name of the city as a text string.
    
##### County Code
The 3-digit FIPS code identifying a county uniquely within a state.
    
##### FIPS Code
The 5-digit FIPS code identifying a county uniquely within the U.S., the first two digits
corrected by reference to the 'state_fips' dictionary below/
    
##### State
The 2-character postal state name abbreviation. InfoGroup does not include the 2-digit state 
FIPS code.

##### State Code
The FIPS state code as 2-digit number string. This is created from the 'state_fips' dictionary 
below in the process of correcting the 5-digit 'FIPS Code' variable.
    
##### ZipCode
The 5-digit postal zip code. <span style="color:#fc3d03">Postal Zip codes can cross county 
and state boundaries.</span>
    
##### Zip4
The 4-digit add-on to the postal zip code.
    
##### Area Code
The 3-digit telephone area code. <span style="color:#fc3d03">Area codes do not cross state 
boundaries.</span>
    
##### Census Tract
##### Census Block
<span style="color:#fc3d03">Tracts are contained within a county, blocks within a tract.
Tracts and blocks do not necessarily respect other political boundaries.</span> 
'Census Tract' is a 6-digit numeric string identifying the tract as unique within a county. 
'Census Block' is a 3-digit numeric string that is unique within a tract.

Census tracts and blocks are subject to revision every ten years.

##### Full Census Tract
This variable is created out of the state, county, and census tract identifiers included
in the InfoGroup file and the 'state_fips' dictionary, resulting in an 11-digit numeric 
string that identifies a census tract uniquely within the country.

##### Latitude
##### Longitude
Imputed presumably from a sufficient combination of the locational variables above.

In [None]:
#hide
import pandas as pd
from joblib import Memory

from rurec import resources
from rurec.resources import Resource
from rurec import util

memory = Memory(resources.paths.cache)


In [None]:
#hide
@memory.cache
def describe_variable(col, df=None, distribution=False, count_values=[]):
    """Return summary statistics of `col` for all years of data.

    Dataframe is read from disk unless given in `df`.
    Distribution moments are reported by setting `distribution` to True.
    Counts of values specified in `count_values` are reported.
    """
    
    if df is None:
        df = get_df(cols=['YEAR', col])
        
    stats = {}
    stats['Total'] = df.groupby('YEAR').size()
    stats['Total']['All'] = len(df)
    df['__FLAG'] = df[col].notna()
    stats['Not missing'] = df.groupby('YEAR')['__FLAG'].sum().astype(int)
    stats['Not missing']['All'] = stats['Not missing'].sum()
    stats['Missing'] = stats['Total'] - stats['Not missing']
    stats['Unique'] = df.groupby('YEAR')[col].nunique()
    stats['Unique']['All'] = df[col].nunique()

    if distribution:
        stats['Min'] = df.groupby('YEAR')[col].min()
        stats['Min']['All'] = stats['Min'].min()
        stats['Max'] = df.groupby('YEAR')[col].max()
        stats['Max']['All'] = stats['Max'].max()
        stats['Mean'] = df.groupby('YEAR')[col].mean()
        stats['Mean']['All'] = df[col].mean()
        stats['s.d.'] = df.groupby('YEAR')[col].std()
        stats['s.d.']['All'] = df[col].std()
        q = df.groupby('YEAR')[col].quantile([0.25, 0.5, 0.75]).unstack()
        qa = df[col].quantile([0.25, 0.5, 0.75])
        stats['25%'] = q[0.25]
        stats['25%']['All'] = qa[0.25]
        stats['50% (median)'] = q[0.5]
        stats['50% (median)']['All'] = qa[0.5]
        stats['75%'] = q[0.75]
        stats['75%']['All'] = qa[0.75]

    for val in count_values:
        df['__FLAG'] = (df[col] == val)
        stats[f'Value "{val}"'] = df.groupby('YEAR')['__FLAG'].sum().astype(int)
        stats[f'Value "{val}"']['All'] = stats[f'Value "{val}"'].sum()

    return pd.concat(stats, 1)


def style(df):
    """Apply formatting style to summary stats dataframe."""

    f = {}
    int_cols = ['Total', 'Not missing', 'Missing', 'Unique']
    int_cols += [c for c in df if c.startswith('Value "')]
    for c in int_cols:
        f[c] = '{:,}'
    for c in ['Min', 'Max', 'Mean', 's.d.', '25%', '50% (median)', '75%']:
        f[c] = '{:,g}'

    return df.style.format(f)


**COMPANY**: Name of business - will have blanks

In [None]:
#collapse_output
style(describe_variable('COMPANY'))

**ADDRESS**: Historical address

In [None]:
#collapse_output
style(describe_variable('ADDRESS'))

**CITY**: Historical address city

In [None]:
#collapse_output
style(describe_variable('CITY'))

**STATE**: Historical address state

In [None]:
#collapse_output
style(describe_variable('STATE', count_values=get_schema('STATE')['enum']))

**ZIP**: Historical address zip code

In [None]:
#collapse_output
style(describe_variable('ZIP'))

**ZIP4**: Historical address zip code zip + 4

In [None]:
#collapse_output
style(describe_variable('ZIP4'))

**COUNTY_CODE**: County code based upon location address/zip4 (postal)

In [None]:
#collapse_output
style(describe_variable('COUNTY_CODE'))

**AREA_CODE**: Area code of business

In [None]:
#collapse_output
style(describe_variable('AREA_CODE'))

**ID_CODE**: The code that indentifies the yellow page listing is for a business or for an individual.  this field helps clients indentify if the record represents a professional indivisual versus a firm record.  
Value labels:  
1: Individual  
2: Firm  

In [None]:
#collapse_output
style(describe_variable('ID_CODE', count_values=get_schema('ID_CODE')['enum']))

**EMPLOYEES_CODE**: Code indicating range of employees at that location  
Value labels:  
A: 1-4  
B: 5-9  
C: 10-19  
D: 20-49  
E: 50-99  
F: 100-249  
G: 250-499  
H: 500-999  
I: 1000-4999  
J: 5000-9999  
K: 10000-  

In [None]:
#collapse_output
style(describe_variable('EMPLOYEES_CODE', count_values=get_schema('EMPLOYEES_CODE')['enum']))

**SALES_CODE**: Corporate sales volume code (ranges) represents the total sales company wide  
Value labels:  
A: 1-499  
B: 500-999  
C: 1000-2499  
D: 2500-4999  
E: 5000-9999  
F: 10000-19999  
G: 20000-49999  
H: 50000-99999  
I: 100000-499999  
J: 500000-999999  
K: 1000000+  

In [None]:
#collapse_output
style(describe_variable('SALES_CODE', count_values=get_schema('SALES_CODE')['enum']))

**SIC**: This field contains the 6-digit sic code for the business’s primary activity

In [None]:
#collapse_output
style(describe_variable('SIC'))

**SIC_DESC**: The desciption for the sic code

In [None]:
#collapse_output
style(describe_variable('SIC_DESC'))

**NAICS**: The desciption for the primary naics code

In [None]:
#collapse_output
style(describe_variable('NAICS'))

**NAICS_DESC**: The desciption for the naics code

In [None]:
#collapse_output
style(describe_variable('NAICS_DESC'))

**SIC0**: A line of business that company engages in

In [None]:
#collapse_output
style(describe_variable('SIC0'))

**SIC0_DESC**: The desciption for the sic code

In [None]:
#collapse_output
style(describe_variable('SIC0_DESC'))

**SIC1**: This field identifies an additional activity of the business.  if there is no additional activity, this field will be blank.

In [None]:
#collapse_output
style(describe_variable('SIC1'))

**SIC1_DESC**: The desciption for the secondary sic code

In [None]:
#collapse_output
style(describe_variable('SIC1_DESC'))

**SIC2**: This field identifies an additional activity of the business.  if there is no additional activity, this field will be blank.

In [None]:
#collapse_output
style(describe_variable('SIC2'))

**SIC2_DESC**: The desciption for the secondary sic code

In [None]:
#collapse_output
style(describe_variable('SIC2_DESC'))

**SIC3**: This field identifies an additional activity of the business.  if there is no additional activity, this field will be blank.

In [None]:
#collapse_output
style(describe_variable('SIC3'))

**SIC3_DESC**: The desciption for the secondary sic code

In [None]:
#collapse_output
style(describe_variable('SIC3_DESC'))

**SIC4**: This field identifies an additional activity of the business.  if there is no additional activity, this field will be blank.

In [None]:
#collapse_output
style(describe_variable('SIC4'))

**SIC4_DESC**: The desciption for the secondary sic code

In [None]:
#collapse_output
style(describe_variable('SIC4_DESC'))

**YEAR**: Year of data

In [None]:
#collapse_output
style(describe_variable('YEAR'))

**YP_CODE**: A numeric value assigned to yellow page heading for the sic

In [None]:
#collapse_output
style(describe_variable('YP_CODE'))

**EMPLOYEES**: Number of employees at that location, could be modeled

In [None]:
#collapse_output
style(describe_variable('EMPLOYEES', distribution=True, count_values=[1]))

**SALES**: Sales volume at that location (in thousands)

In [None]:
#collapse_output
style(describe_variable('SALES', distribution=True))

**BUSINESS_STATUS**: Indicates if record is hq, sub, or branch  
Value labels:  
1: headquarters  
2: branch  
3: subsidiary  
9: standalone  

In [None]:
#collapse_output
style(describe_variable('BUSINESS_STATUS', count_values=get_schema('BUSINESS_STATUS')['enum']))

**IND_BYTE**: Contains "number of" info.  (# beds for nursing homes, # rooms for hotels)

In [None]:
#collapse_output
style(describe_variable('IND_BYTE'))

**YEAR_EST**: Year the business began operating

In [None]:
#collapse_output
style(describe_variable('YEAR_EST'))

**OFFICE_SIZE_CODE**: Number of professionals per office with the same phone number  
Value labels:  
A: 1  
B: 2  
C: 3  
D: 4  
E: 5-9  
F: 10+  

In [None]:
#collapse_output
style(describe_variable('OFFICE_SIZE_CODE', count_values=get_schema('OFFICE_SIZE_CODE')['enum']))

**HOLDING_STATUS**: Indicates if company is a public company, private company, or a branch

In [None]:
#collapse_output
style(describe_variable('HOLDING_STATUS', count_values=get_schema('HOLDING_STATUS')['enum']))

**ABI**: Also known as iusa number, abi number, infogroup number or location number, this provides a unique identifier for each business in the infogroup business database

In [None]:
#collapse_output
style(describe_variable('ABI'))

**SUBSIDIARY_NUMBER**: The subsidiary parent number identifies the business as a regional or subsidiary headquarters for a corporate family. the subsidiary will always have a parent and may or may not have branches assigned to it.

In [None]:
#collapse_output
style(describe_variable('SUBSIDIARY_NUMBER'))

**PARENT_NUMBER**: The parent number identifies the corporate parent of the business and also serves as the abi number for the headquarters site of the parent.  since all location of a business have the same ultimate parent number, this field provides ‘corporate ownwership’ linkage infomration.  this iinformation is not collected or maintained for the types organiation for wich ownership is ambiguous.  churches and schools, in particular, are not linked in the file for this reason

In [None]:
#collapse_output
style(describe_variable('PARENT_NUMBER'))

**PARENT_EMPLOYEES**: Parent actual employee size refers to the parent abi record only

In [None]:
#collapse_output
style(describe_variable('PARENT_EMPLOYEES', distribution=True))

**PARENT_SALES**: Parent actual sales refers to the parent abi record only

In [None]:
#collapse_output
style(describe_variable('PARENT_SALES', distribution=True))

**PARENT_EMPLOYEES_CODE**: Parent employee code refers to the parent abi record only  
Value labels:  
A: 1-4  
B: 5-9  
C: 10-19  
D: 20-49  
E: 50-99  
F: 100-249  
G: 250-499  
H: 500-999  
I: 1000-4999  
J: 5000-9999  
K: 10000-  

In [None]:
#collapse_output
style(describe_variable('PARENT_EMPLOYEES_CODE', count_values=get_schema('PARENT_EMPLOYEES_CODE')['enum']))

**PARENT_SALES_CODE**: Parent sales code refers to the parent abi records only  
Value labels:  
A: 1-499  
B: 500-999  
C: 1000-2499  
D: 2500-4999  
E: 5000-9999  
F: 10000-19999  
G: 20000-49999  
H: 50000-99999  
I: 100000-499999  
J: 500000-999999  
K: 1000000+

In [None]:
#collapse_output
style(describe_variable('PARENT_SALES_CODE', count_values=get_schema('PARENT_SALES_CODE')['enum']))

**SITE_NUMBER**: Designates related business at one site, identifying the primary business.  if abi# and site# are the same, then the record is primary business at the site.  if abi# and site# are different, then the record is a secondary business at the site.  determined through relationships between multiple data elements.

In [None]:
#collapse_output
style(describe_variable('SITE_NUMBER'))

**ADDRESS_TYPE**: Indicates if type of address  
Value labels:  
F: Firm  
G: General delivery  
H: High-rise  
M: Military  
P: Post office box  
R: Rural route or hwy contract  
S: Street  
N: Unknown  
: No match to Zip4

In [None]:
#collapse_output
style(describe_variable('ADDRESS_TYPE', count_values=get_schema('ADDRESS_TYPE')['enum']))

**CENSUS_TRACT**: Identifies a small geographic area for the purpose of collecting and compiling population and housing data.  census tracts are unique only within census county, and census counties are unique only within census state.

Special value "000000" - missing?

In [None]:
#collapse_output
style(describe_variable('CENSUS_TRACT', count_values=['000000']))

**CENSUS_BLOCK**: Bgs are subdivisions of census tracts and unique only within a specific census tract.  census tracts/block groups are assigned to address records via a geocoding process.

Special value "0" - missing?

In [None]:
#collapse_output
style(describe_variable('CENSUS_BLOCK', count_values=['0']))

**LATITUDE**: Parcel level assigned via point geo coding.  half of a pair of coordinates (the other being longitude)  provided in a formatted value, with decimals or a negative sign. not available in puerto rico & virgin island.

In [None]:
#collapse_output
style(describe_variable('LATITUDE', distribution=True))

**LONGITUDE**: Parcel level assigned via point geo coding.  note: longitudes are negatives values in the western hemisphere.  provided in its formatted value, with decimals or a negative sign. not available in puerto rico & virigin island

In [None]:
#collapse_output
style(describe_variable('LONGITUDE', distribution=True))

**MATCH_CODE**: Parcel level match code of the business location.  
Value labels:  
0: Site level  
2: Zip+2 centroid  
4: Zip+4 centroid  
P: Parcel  
X: Zip centroid

In [None]:
#collapse_output
style(describe_variable('MATCH_CODE', count_values=get_schema('MATCH_CODE')['enum']))

**CBSA_CODE**: Core bases statistical area (expanded msa code)

Special value "00000"

In [None]:
#collapse_output
style(describe_variable('CBSA_CODE', count_values=['00000']))

**CBSA_LEVEL**: Indicates if an area is a micropolitan or metropolitan area  
Value labels:  
1: Micropolitan  
2: Metropolitan

In [None]:
#collapse_output
style(describe_variable('CBSA_LEVEL', count_values=get_schema('CBSA_LEVEL')['enum']))

**CSA_CODE**: Adjoining cbsa's.  combination of metro and micro areas

Special value "000".

In [None]:
#collapse_output
style(describe_variable('CSA_CODE', count_values=['000']))

**FIPS_CODE**: First 2 bytes = state code, last 3 bytes = county code (location)

In [None]:
#collapse_output
style(describe_variable('FIPS_CODE'))

## Nonregular variables

The following variables are only available in select years.

- ...


Population code exists for all years, but coding changes after 2015.

**POPULATION_CODE**: The code for the resident population of the city in which the business is located, according to rand mcnally. some assignments can vary within a ciyt, such as when cities cross county lines. to maintain this granularity, the actual assignment is done at a zip level.