# Dataset Description
Your challenge in this competition is to forecast *microbusiness activity* across the United States, as measured by the density of microbusinesses in US counties. Microbusinesses are often too small or too new to show up in traditional economic data sources, but microbusiness activity may be correlated with other economic indicators of general interest.

As historic economic data are widely available, this is a forecasting competition. The forecasting phase public leaderboard and final private leaderboard will be determined using data gathered after the submission period closes. You will make static forecasts that can only incorporate information available before the end of the submission period. This means that while we will rescore submissions during the forecasting period we will not rerun any notebooks.

## Files
A great deal of data is publicly available about counties and we have not attempted to gather it all here. You are strongly encouraged to use external data sources for features.

### **train.csv**

**row_id** - An ID code for the row.

**cfips** - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.

**county_name** - The written name of the county.

**state_name** - The name of the state.

**first_day_of_month** - The date of the first day of the month.

**microbusiness_density** - Microbusinesses per 100 people over the age of 18 in the given county. This is the target variable. The population figures used to calculate the density are on a two-year lag due to the pace of update provided by the U.S. Census Bureau, which provides the underlying population data annually. 2021 density figures are calculated using 2019 population figures, etc.

**active** - The raw count of microbusinesses in the county. Not provided for the test set.

### **sample_submission.csv** 
A valid sample submission. This file will remain unchanged throughout the competition.

*row*_id* - An ID code for the row.
**microbusiness_density** - The target variable.

### **test.csv**
Metadata for the submission rows. This file will remain unchanged throughout the competition.

**row_id** - An ID code for the row.
**cfips** - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.
**first_day_of_month** - The date of the first day of the month.
revealed_test.csv During the submission period, only the most recent month of data will be used for the public leaderboard. Any test set data older than that will be published in revealed_test.csv, closely following the usual data release cycle for the microbusiness report. We expect to publish one copy of revealed_test.csv in mid February. This file's schema will match train.csv.

**census_starter**.csv Examples of useful columns from the Census Bureau's American Community Survey (ACS) at data.census.gov. The percentage fields were derived from the raw counts provided by the ACS. All fields have a two year lag to match what information was avaiable at the time a given microbusiness data update was published.

**pct_bb_[year]** - The percentage of households in the county with access to broadband of any type. Derived from ACS table B28002: PRESENCE AND TYPES OF INTERNET SUBSCRIPTIONS IN HOUSEHOLD.

**cfips** - The CFIPS code.

**pct_college_[year]** - The percent of the population in the county over age 25 with a 4-year college degree. Derived from ACS table S1501: EDUCATIONAL ATTAINMENT.
**pct_foreign_born_[year]** - The percent of the population in the county born outside of the United States. Derived from ACS table DP02: SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES.

**pct_it_workers_[year]** - The percent of the workforce in the county employed in information related industries. Derived from ACS table S2405: INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER.

**median_hh_inc_[year]** - The median household income in the county. Derived from ACS table S1901: INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS).

# Data Ingestion and Exploration

In [21]:
import pandas as pd

## set the path to the input files.
path = "data/godaddy-microbusiness-density-forecasting"
pathVF="data/VF_md_bundle_Q222"
## open the train and test files
train = pd.read_csv(path + "/train.csv")
test = pd.read_csv(path + "/test.csv")
census_starter = pd.read_csv(path + "/census_starter.csv")
sample_submission = pd.read_csv(path + "/sample_submission.csv")

In [22]:
train.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


In [23]:
test.head()

Unnamed: 0,row_id,cfips,first_day_of_month
0,1001_2022-11-01,1001,2022-11-01
1,1003_2022-11-01,1003,2022-11-01
2,1005_2022-11-01,1005,2022-11-01
3,1007_2022-11-01,1007,2022-11-01
4,1009_2022-11-01,1009,2022-11-01


In [24]:
## check the shape of the train and test files
print("Train shape: ", train.shape)
print("Test shape: ", test.shape)

Train shape:  (122265, 7)
Test shape:  (25080, 3)


In [25]:
## check for missing values in the train and test files
print("Train missing values: ", train.isnull().sum().sum())
print("Test missing values: ", test.isnull().sum().sum())


Train missing values:  0
Test missing values:  0


In [26]:
train.columns

Index(['row_id', 'cfips', 'county', 'state', 'first_day_of_month',
       'microbusiness_density', 'active'],
      dtype='object')

In [27]:
test.columns

Index(['row_id', 'cfips', 'first_day_of_month'], dtype='object')

In [28]:
## check the correlation for the microbusiness density
train.corr()['microbusiness_density'].sort_values(ascending=False)

microbusiness_density    1.000000
active                   0.316981
cfips                   -0.011767
Name: microbusiness_density, dtype: float64

In [29]:
## encode first_day_of_month as datetime   
train['first_day_of_month'] = pd.to_datetime(train['first_day_of_month'])

In [30]:
train['state'].unique().shape

(51,)

In [31]:
train['county'].unique().shape

(1871,)

In [32]:
## encode the categorical variables
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train['state_encoded'] = le.fit_transform(train['state'])
train['county_encoded'] = le.fit_transform(train['county'])




In [33]:
train['state'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

In [34]:
print(le.inverse_transform(train['state_encoded']))

['Abbeville County' 'Abbeville County' 'Abbeville County' ...
 'Appling County' 'Appling County' 'Appling County']


In [35]:
train['county'].unique()

array(['Autauga County', 'Baldwin County', 'Barbour County', ...,
       'Uinta County', 'Washakie County', 'Weston County'], dtype=object)

In [36]:
## check the correlation for the microbusiness density
train.corr()['microbusiness_density'].sort_values(ascending=False)

microbusiness_density    1.000000
active                   0.316981
county_encoded           0.015986
state_encoded           -0.011107
cfips                   -0.011767
Name: microbusiness_density, dtype: float64

In [37]:
## print each state matched with its encoded value
for i in range(0, 51):
    print(train['state'].unique()[i], train['state_encoded'].unique()[i])
    

Alabama 0
Alaska 1
Arizona 2
Arkansas 3
California 4
Colorado 5
Connecticut 6
Delaware 7
District of Columbia 8
Florida 9
Georgia 10
Hawaii 11
Idaho 12
Illinois 13
Indiana 14
Iowa 15
Kansas 16
Kentucky 17
Louisiana 18
Maine 19
Maryland 20
Massachusetts 21
Michigan 22
Minnesota 23
Mississippi 24
Missouri 25
Montana 26
Nebraska 27
Nevada 28
New Hampshire 29
New Jersey 30
New Mexico 31
New York 32
North Carolina 33
North Dakota 34
Ohio 35
Oklahoma 36
Oregon 37
Pennsylvania 38
Rhode Island 39
South Carolina 40
South Dakota 41
Tennessee 42
Texas 43
Utah 44
Vermont 45
Virginia 46
Washington 47
West Virginia 48
Wisconsin 49
Wyoming 50


In [38]:
## show the data for loudoun county
train[train['county'] == 'Loudoun']


Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active,state_encoded,county_encoded


# Defining Microbusiness Density Further

https://www.godaddy.com/ventureforward/microbusiness-datahub/

Microbusiness Density measures the number of microbusinesses per resident aged 18 and older. Higher microbusiness density is associated with a range of economic benefits.

In [42]:
## ingest the Venture Forward microbusiness density data
cbas=pd.read_csv(pathVF+"/VF_md_cbsas_Q222.csv")
cities=pd.read_csv(pathVF+"/VF_md_cities_Q222.csv")
counties=pd.read_csv(pathVF+"/VF_md_counties_Q222.csv")
states=pd.read_csv(pathVF+"/VF_md_states_Q222.csv")

In [43]:
cbas.head()

Unnamed: 0,cbsa,city_name,micro_metro,total_pop_20,activeaug19,activesep19,activeoct19,activenov19,activedec19,activejan20,...,mdsep21,mdoct21,mdnov21,mddec21,mdjan22,mdfeb22,mdmar22,mdapr22,mdmay22,mdjun22
0,10100.0,"Aberdeen, SD",Micropolitan Statistical Area,42864.0,931,928,970,971,933,934,...,1.378673,1.449202,1.442485,1.439127,1.434653,1.438005,1.434653,1.541588,1.544946,1.325443
1,10140.0,"Aberdeen, WA",Micropolitan Statistical Area,73769.0,2282,2289,2308,2304,2287,2331,...,2.792293,2.815263,2.790858,2.726255,2.589748,2.585528,2.582714,3.034267,3.096163,2.932984
2,10180.0,"Abilene, TX",Metropolitan Statistical Area,171354.0,4846,4834,4855,4826,4838,5037,...,2.32611,2.319005,2.324216,2.319953,2.32608,2.354674,2.354674,2.553133,2.551703,2.186111
3,10220.0,"Ada, OK",Micropolitan Statistical Area,38385.0,1230,1216,1224,1223,1221,1237,...,1.879739,1.886759,1.895535,1.904311,1.924632,1.935178,1.922874,2.100397,2.109186,1.980877
4,10300.0,"Adrian, MI",Micropolitan Statistical Area,98310.0,5622,5792,5801,5799,5831,5852,...,4.543527,4.534382,4.548516,4.592579,4.562569,4.580694,4.589756,4.944842,4.971206,4.810552


In [44]:
cities.head()

Unnamed: 0,city_id,city,state_abbrev,pop_18over_2020,activeaug19,activesep19,activeoct19,activenov19,activedec19,activejan20,...,mdsep21,mdoct21,mdnov21,mddec21,mdjan22,mdfeb22,mdmar22,mdapr22,mdmay22,mdjun22
0,1,Abbeville,LA,37706,264,269,265,269,277,275,...,0.652416,0.652416,0.65772,0.65772,0.750544,0.750544,0.729327,0.944147,0.944147,0.957407
1,2,Aberdeen,NC,10076,364,361,360,360,356,388,...,3.552997,3.334657,3.423978,3.354506,3.384279,3.354506,3.36443,3.771338,3.900357,4.138547
2,3,Aberdeen,SD,50386,751,746,792,794,758,760,...,1.329734,1.401183,1.393244,1.38729,1.389275,1.395229,1.391259,1.474616,1.474616,1.208669
3,4,Aberdeen,MD,19176,688,667,662,665,665,677,...,3.233208,3.238423,3.196704,3.14977,3.139341,3.113266,3.097622,3.634752,3.660826,3.551314
4,5,Aberdeen,WA,19342,549,550,551,549,550,548,...,1.918106,2.000827,2.052528,2.042188,2.078379,2.078379,2.099059,2.357564,2.404095,1.974977


In [45]:
counties.head()

Unnamed: 0,cfips,county,state,total_pop_20,activeaug19,activesep19,activeoct19,activenov19,activedec19,activejan20,...,mdsep21,mdoct21,mdnov21,mddec21,mdjan22,mdfeb22,mdmar22,mdapr22,mdmay22,mdjun22
0,1001.0,Autauga,AL,55639.0,222,219,220,221,221,225,...,0.46473,0.467101,0.481328,0.497925,2.567301,2.562594,2.564947,3.261483,3.289721,3.748588
1,1003.0,Baldwin,AL,218289.0,11837,12053,12061,12128,12335,12324,...,7.293136,7.310543,7.332753,7.336955,6.991991,7.035191,7.025266,7.804035,8.400663,8.157225
2,1005.0,Barbour,AL,25026.0,92,124,96,96,95,95,...,0.453775,0.453775,0.448788,0.463748,1.060392,1.050293,1.040194,1.161382,1.156332,1.070491
3,1007.0,Bibb,AL,22374.0,41,39,41,40,40,42,...,0.279924,0.268727,0.229538,0.223939,1.163575,1.191681,1.152333,1.247892,1.275998,1.062395
4,1009.0,Blount,AL,57755.0,68,69,69,68,69,68,...,0.124176,0.121918,0.124176,0.126434,1.487056,1.489309,1.448753,1.710114,1.725886,1.712367
