# Exploratory Data Analysis

I will try to answer the following questions:

* Which columns from `train.csv` are going to be used for training?
* What is the distribution of the target variable?
* Who are the outliers and what to do with them?
* Which external datasets can we use?
* How can we use the external datasets to predict the target variable for the test set and submission?

Useful resources:

* [Detailed EDA - problem decomposition & GIS](https://www.kaggle.com/code/datark1/detailed-eda-problem-decomposition-gis) exploratory data analysis notebook
* [GoDaddy Better EDA XGB baseline](https://www.kaggle.com/code/eishkaran/godaddy-better-eda-xgb-baseline) this is a good baseline for making my analysis
* [Outliers - is there anything to learn from them?](https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/373149#2098794) - discussion about outliers
* [The Top Notebook - Explained](https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/375802) analysis about [this notebook (Better XGB Baseline)](https://www.kaggle.com/code/titericz/better-xgb-baseline)

## Concepts

### County

According to [Wikipedia - County (United States)](https://en.wikipedia.org/wiki/County_(United_States)): a county or county equivalent is an administrative or political subdivision of a state that consists of a geographic region with specific boundaries and usually some level of governmental authority.

### CFIP - County FIPS Code

According to [Wikipedia - Federal Information Processing Standard](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard): a standard published by the U.S. Government for the classification of counties by state. They are like sub-regions of a state.

## Data

We will first load the dependencies and the data needed for the whole notebook.

In [1]:
import pandas as pd
import plotly.express as px

# Load the data
train_raw = pd.read_csv('io/dataset/train.csv')

Let's see how the data looks like.

In [4]:
train_raw.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


In this dataset we don't have any missing values.

The columns we will use for forecasting are:

* `cfips: int` - County FIPS Code
* `first_day_of_month: Date` - First day of the month
* `microbusiness_density: float` - Microbusiness density (value to predict)

Note: cfips is a code that identifies a county. In this dataset it is a number, but some datasets use a string with 5 characters (2 for the state and 3 for the county).

## Data anomalities

> @start This notes were copied from [this notebook](https://www.kaggle.com/code/datark1/detailed-eda-problem-decomposition-gis)

Two things to note here:

I used cfips instead of counties because the counties' names are not unique in the USA. Only after combining the state and the county you can clearly identify the region.
No all counties are in the database, more about that below.
It seems that over years there were some changes to the number of counties, some were deleted while others created. The full list of these changes can be found on United States Census Bureau (https://www.census.gov/programs-surveys/geography/technical-documentation/county-changes.2010.html#list-tab-R3TI1GGRL2FELJQQDI). Also by worth noting that under cfips there are also independent cities which are treated as counties but appear with a different name.

List of counties from 2022 can be found for example here: https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-county/table/?disjunctive.ste_code&disjunctive.ste_name&disjunctive.coty_code&disjunctive.coty_name&sort=year

> @end



## Outliers

We have some outliers in the data. I found some useful resources in some notebooks and discussions citted above.

