# Competition Information:

## Dataset Description

In this competition, the challenge is to forecast microbusiness activity across the United States, as measured by the density of microbusinesses in US counties. Microbusinesses are often too small or too new to show up in traditional economic data sources, but microbusiness activity may be correlated with other economic indicators of general interest.

As historic economic data are widely available, this is a forecasting competition. The forecasting phase public leaderboard and final private leaderboard will be determined using data gathered after the submission period closes. You will make static forecasts that can only incorporate information available before the end of the submission period. This means that while we will rescore submissions during the forecasting period we will not rerun any notebooks.

## File Information:

A great deal of data is publicly available about counties and we have not attempted to gather it all here. You are strongly encouraged to use external data sources for features.

train.csv

    row_id - An ID code for the row.
    cfips - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.
    county_name - The written name of the county.
    state_name - The name of the state.
    first_day_of_month - The date of the first day of the month.
    microbusiness_density - Microbusinesses per 100 people over the age of 18 in the given county. This is the target variable. The population figures used to calculate the density are on a two-year lag due to the pace of update provided by the U.S. Census Bureau, which provides the underlying population data annually. 2021 density figures are calculated using 2019 population figures, etc.
    active - The raw count of microbusinesses in the county. Not provided for the test set.

sample_submission.csv A valid sample submission. This file will remain unchanged throughout the competition.

    row_id - An ID code for the row.
    microbusiness_density - The target variable.

test.csv Metadata for the submission rows. This file will remain unchanged throughout the competition.

    row_id - An ID code for the row.
    cfips - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.
    first_day_of_month - The date of the first day of the month.

revealed_test.csv During the submission period, only the most recent month of data will be used for the public leaderboard. Any test set data older than that will be published in revealed_test.csv, closely following the usual data release cycle for the microbusiness report. We expect to publish one copy of revealed_test.csv in mid February. This file's schema will match train.csv.

census_starter.csv Examples of useful columns from the Census Bureau's American Community Survey (ACS) at data.census.gov. The percentage fields were derived from the raw counts provided by the ACS. All fields have a two year lag to match what information was avaiable at the time a given microbusiness data update was published.

    pct_bb_[year] - The percentage of households in the county with access to broadband of any type. Derived from ACS table B28002: PRESENCE AND TYPES OF INTERNET SUBSCRIPTIONS IN HOUSEHOLD.
    cfips - The CFIPS code.
    pct_college_[year] - The percent of the population in the county over age 25 with a 4-year college degree. Derived from ACS table S1501: EDUCATIONAL ATTAINMENT.
    pct_foreign_born_[year] - The percent of the population in the county born outside of the United States. Derived from ACS table DP02: SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES.
    pct_it_workers_[year] - The percent of the workforce in the county employed in information related industries. Derived from ACS table S2405: INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER.
    median_hh_inc_[year] - The median household income in the county. Derived from ACS table S1901: INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS).

In [None]:
# Standard Python Modules:
import os
from datetime import datetime


# Data Wrangling:
import numpy as np
import pandas as pd


# Machine Learning:

# Feature/Training Enhancements:
from sklearn.impute import KNNImputer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split


# Models:
from sklearn.ensemble import RandomForestRegressor


# Model Performance Metrics:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import explained_variance_score


# Deep Learning:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
train = pd.read_csv("train.csv")
train.columns = [feature_name.upper() for feature_name in train.columns]
train.FIRST_DAY_OF_MONTH = pd.to_datetime(train.FIRST_DAY_OF_MONTH, format="%Y-%m-%d")
train["YEAR"] = train.FIRST_DAY_OF_MONTH.dt.year
train["MONTH"] = train.FIRST_DAY_OF_MONTH.dt.month
train["STATE_FIPS_CODE"] = train.CFIPS.astype(str).apply(lambda x: x[:2])
train["COUNTY_FIPS_CODE"] = train.CFIPS.astype(str).apply(lambda x: x[1:])
del (
    train["COUNTY"],
    train["STATE"],
    train["ROW_ID"],
    train["ACTIVE"],
    train["FIRST_DAY_OF_MONTH"],
)
print(train.shape, "(ROWS, COLUMNS)")
train.head(3)

In [None]:
test = pd.read_csv("test.csv")
test.columns = [feature_name.upper() for feature_name in test.columns]
test.FIRST_DAY_OF_MONTH = pd.to_datetime(test.FIRST_DAY_OF_MONTH, format="%Y-%m-%d")
test["YEAR"] = test.FIRST_DAY_OF_MONTH.dt.year
test["MONTH"] = test.FIRST_DAY_OF_MONTH.dt.month
test["STATE_FIPS_CODE"] = test.CFIPS.astype(str).apply(lambda x: x[:2])
test["COUNTY_FIPS_CODE"] = test.CFIPS.astype(str).apply(lambda x: x[1:])
del test["ROW_ID"], test["FIRST_DAY_OF_MONTH"]
print(test.shape, "(ROWS, COLUMNS)")
test.head(3)

In [None]:
census_starter = pd.read_csv("census_starter.csv")
census_starter.columns = [
    feature_name.upper() for feature_name in census_starter.columns
]
census_starter.head(3)

In [None]:
imputed_census_starter = pd.DataFrame(
    KNNImputer(n_neighbors=5).fit_transform(census_starter),
    columns=list(census_starter.columns),
)
imputed_census_starter.head(3)

In [None]:
pd.concat(
    [
        census_starter.mean().rename("ORIGINAL_MEAN"),
        imputed_census_starter.mean().rename("IMPUTED_MEAN"),
        census_starter.std().rename("ORIGINAL_SD"),
        imputed_census_starter.std().rename("IMPUTED_SD"),
    ],
    axis=1,
)

In [None]:
del census_starter

In [None]:
print(train.shape)
train = pd.merge(left=train, right=imputed_census_starter, on=["CFIPS"], how="inner")
print(train.shape)
train.head(3)

In [None]:
print(test.shape)
test = pd.merge(left=test, right=imputed_census_starter, on=["CFIPS"], how="inner")
print(test.shape)
test.head(3)

In [None]:
train.STATE_FIPS_CODE.nunique(), test.STATE_FIPS_CODE.nunique()

In [None]:
print(train.shape)
train = pd.concat(
    [
        train,
        pd.get_dummies(
            train["STATE_FIPS_CODE"], prefix="STATE_FIPS_CODE_", drop_first=True
        ),
    ],
    axis=1,
)
del train["STATE_FIPS_CODE"]
print(train.shape)

In [None]:
print(test.shape)
test = pd.concat(
    [
        test,
        pd.get_dummies(
            test["STATE_FIPS_CODE"], prefix="STATE_FIPS_CODE_", drop_first=True
        ),
    ],
    axis=1,
)
del test["STATE_FIPS_CODE"]
print(test.shape)

In [None]:
feature_remove_list = ["CFIPS", "MICROBUSINESS_DENSITY"]

X = train[[f for f in train.columns if f not in feature_remove_list]]
y = train["MICROBUSINESS_DENSITY"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

In [None]:
random_forest_estimator = RandomForestRegressor(
    n_estimators=2,
    criterion="squared_error",
    max_depth=4,
    min_samples_split=0.01,
    min_samples_leaf=0.09,
    min_weight_fraction_leaf=0.0,
    max_features=1.0,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,
    n_jobs=-1,
    random_state=1024,
    verbose=0,
    warm_start=False,
    ccp_alpha=0.0,
    max_samples=None,
).fit(X_train, y_train)

In [None]:
training_metrics = pd.DataFrame({
    'ACTUAL': y_train,
    'PREDICTED': random_forest_estimator.predict(X_train)
})

test_metrics = pd.DataFrame({
    'ACTUAL': y_test,
    'PREDICTED': random_forest_estimator.predict(X_test)
})

In [None]:
print(
    'EXPLAINED_VARIANCE_TRAIN:', 
    explained_variance_score(training_metrics['ACTUAL'], training_metrics['PREDICTED']),
    '\nEXPLAINED_VARIANCE_TEST:', 
    explained_variance_score(test_metrics['ACTUAL'], test_metrics['PREDICTED'])
)

In [None]:
print(
    'MSE:', 
    mean_squared_error(training_metrics['ACTUAL'], training_metrics['PREDICTED']),
    '\nMSE:', 
    mean_squared_error(test_metrics['ACTUAL'], test_metrics['PREDICTED'])
)

In [24]:
submission = pd.read_csv('test.csv')
submission['microbusiness_density'] = random_forest_estimator.predict(test[[f for f in test.columns if f!='CFIPS']])
del submission['cfips'], submission['first_day_of_month']
submission.to_csv('rf_submission_02.csv', index=False)
submission.head(3)

Unnamed: 0,row_id,microbusiness_density
0,1001_2022-11-01,4.654223
1,1003_2022-11-01,4.654223
2,1005_2022-11-01,4.654223


In [None]:
model = Sequential()
model.add(Dense(100, X_train.shape[0], activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='linear'))
model.summary()

In [None]:
model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])

In [None]:
history = model.fit(X_train, y_train, epochs=150, batch_size=50,  verbose=1, validation_split=0.3)

In [None]:
print(history.history.keys())
# "Loss"
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

A KAP (Knowledge, Attitude, Practice) survey is a type of survey that is used to assess people's knowledge, attitudes, and practices related to a specific topic or issue. Here is a general overview of the steps involved in analyzing a KAP survey:

    Check the response rate: Look at the response rate to see how many people completed the survey out of the total number of people who were invited to participate. A high response rate indicates that the sample is likely to be representative of the population.

    Review the survey questions: Review the survey questions to make sure that they are clear and relevant to the topic of the survey.

    Prepare the data: Prepare the data by cleaning and coding it, so that it is ready for analysis.

    Summarize the data: Summarize the data by calculating descriptive statistics such as means, frequencies, and percentages. This will give you an overall picture of the data.

    Analyze the data: Analyze the data using appropriate statistical tests such as chi-square or t-tests to identify any significant differences between subgroups or to test hypotheses.

    Interpret the results: Interpret the results by comparing them to the objectives of the survey and the relevant literature. Also, consider the limitations of the survey and the potential impact of any biases or confounding variables.

    Communicate the findings: Communicate the findings by creating tables, figures, and reports that present the results in a clear and concise way.