# Exploratory Data Analysis
## Group 27
### January 2022

With your datasets sourced and cleaned now you will begin exploring the data. This entails performing Exploratory Data Analysis (EDA). EDA is used to understand and summarize the contents of a dataset. In this case, EDA will be used to help you investigate the specific questions for the business problem proposed for your capstone project. EDA relies heavily on visualizing the data to assess patterns and identify data characteristics. Please use the **Six Steps of Exploratory Data Analysis** outlined below to help guide your work. Refer back to your project scoping document where you have outlined the specific business problems you are looking to solve.

Selecting columns of interest and target feature(s)
* Which columns in your data sets will help you answer the questions posed by
your problem statement? 
* Which columns represent the key pieces of information you want to examine (i.e.
your target variables)?
* How many numerical, textual, datetime etc. columns are in your dataset?
* Pick out any similar columns among your disparate data sets for potential linking
later on on the EDA process

Explore Individual columns for preliminary insights
* How many null values are present in your data (what percentage)?
* Plot one-dimensional distributions of numerical columns (ex. histograms) and
observe the overall shape of the data (i.e. normal distribution, skewed,
multimodal, discontinuous
* Compute basic statistics of numerical columns
* Calculate subgroup size of text/categorical data (such as the pd.value_counts()
method)
* Explore any date/datetime columns for basic trends. How long is the period of
time covered by the dataset? Do any seasonality trends immediately become
apparent?

 Plot two-dimensional distributions of your variables of interest against your target
variable(s).
* Across different values of your independent variable, how does the dependent
variable change?
* Which interactions of variables provide the most interesting insights?
* What trends do you see in the data? Do they support or contradict the hypothesis
of your problem statement?

Analyze any correlations between your independent and dependent variables
* Understand and resolve surprising correlations between these variables, and use this information to validate your initial hypothesis.

Craft a compelling story from the work you’ve done in the previous steps
* Which charts, graphs, and tables provide the most compelling evidence in
support of your project idea?
* If your data analysis has largely disproved your initial hypothesis, can you craft a
narrative for this alternative?


In [1]:
# Packages
import pandas as pd

#### Reading Datasets

In [20]:
# Read CPS
cps_path = './../../datasets/Mississipi/'
cps_csvfile = 'CPS_2020Data_MS.csv'
cps_df = pd.read_csv(cps_path + cps_csvfile)

In [12]:
# Read Feeding America
fa_path = './../../datasets/Mississipi/'
fa_csvfile = 'FA_2019Data_MS.csv'
fa_df = pd.read_csv(fa_path + fa_csvfile)

In [13]:
# Read hardship 2016 2017
hs_path = './../../datasets/hardship/'
hs_csvfile = '2016_2017_state_food_hardship.csv'
hs_df = pd.read_csv(hs_path + hs_csvfile)

In [15]:
# Read households by type snap 2015 2019
hh_snap_path = './../../datasets/households/'
hh_snap_csvfile = 'households_by_type_snap2015_2019.csv'
hh_snap_df = pd.read_csv(hh_snap_path + hh_snap_csvfile)

In [16]:
# Read households by metropol area (food hardship) 2016 2015
hh_hs_path = './../../datasets/households/'
hh_hs_csvfile = 'households_in_metropolitanAreas_food_hardship_2016_2017.csv'
hh_hs_df = pd.read_csv(hh_hs_path + hh_hs_csvfile)

In [17]:
# Read households by individual snap 2019
hh_2019_path = './../../datasets/households/'
hh_2019_csvfile = 'households_individuals_snap_2019.csv'
hh_2019_df = pd.read_csv(hh_2019_path + hh_2019_csvfile)

## CPS First exploration

In [23]:
cps_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2535 entries, 0 to 2534
Data columns (total 29 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  2535 non-null   int64
 1   HRYEAR4     2535 non-null   int64
 2   HRHTYPE     2535 non-null   int64
 3   GTCO        2535 non-null   int64
 4   GTCBSASZ    2535 non-null   int64
 5   GTCSA       2535 non-null   int64
 6   PTDTRACE    2535 non-null   int64
 7   PREMPNOT    2535 non-null   int64
 8   HES8B       2535 non-null   int64
 9   HETS8CO     2535 non-null   int64
 10  HETS8DO     2535 non-null   int64
 11  HES9        2535 non-null   int64
 12  HESP1       2535 non-null   int64
 13  HESP6       2535 non-null   int64
 14  HESP7       2535 non-null   int64
 15  HESP7A      2535 non-null   int64
 16  HESP8       2535 non-null   int64
 17  HETSP9      2535 non-null   int64
 18  HESS1       2535 non-null   int64
 19  HESS2       2535 non-null   int64
 20  HESS3       2535 non-null   in

In [24]:
cps_df.head()

Unnamed: 0.1,Unnamed: 0,HRYEAR4,HRHTYPE,GTCO,GTCBSASZ,GTCSA,PTDTRACE,PREMPNOT,HES8B,HETS8CO,...,HESS2,HESS3,HESS4,HESH2,HESHF2,HESHM2,HESS5,HESS6,HESH1,HESSH2
0,2981,2020,0,0,0,0,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,2982,2020,7,0,0,0,2,1,3,-1,...,3,3,3,-1,-1,-1,-1,-1,-1,-1
2,2983,2020,1,0,0,0,2,4,3,-1,...,2,3,3,2,-1,-1,-1,-1,-1,-1
3,2984,2020,1,0,0,0,2,4,3,-1,...,2,3,3,2,-1,-1,-1,-1,-1,-1
4,2985,2020,0,0,0,0,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
