# Exploratory Data Analysis

### Anecdotal Evidence

**Problem**: reports based on anecdotal evidence are based on *unpublished* & usually *personal* data

Reasons for Failure:
  * **Small Number of Observations**: difference from observation is *probably small compared to natural variation*
  * **Selection Bias**: process of selecting data would bias the results
  * **Confirmation Bias**: more likely to contribute examples to confirm their stance or counter an opposing claim
  * **Inaccuracy**: anecdotes are often personal stories & often misremembered, misrepresented, repeatedly inaccurately, etc.
  
Examples: **Birth of First Babies**
People provide data to support *their own claims*.
  * "My 2 friends that have given birth recently to their first babies, BOTH went almost 2 weeks overdue before going into labor or being induced."
  * "My first one came 2 weeks late and now I think the second one is going to be 2 weeks early."


### A Statistical Approach
###### Solution addressing limitations of anecdotes

Tools of Statistics: used to reach justifiable conclusions & avoid pitfalls.  
Goal is to use collected data to generate statistically valid inferences.

  * **Data Collection**
  * **Descriptive Statistics**: generate statistics that summarize the data concisely and evaluate different ways to visualize data
  * **Exploratory Data Analysis**: look for *patterns, differences*, & *other features* that address the questions we're interested in
     * check for inconsistencies
     * identify limitations
  * **Estimation**: use data from a sample to *estimate characteristics* of the general population
  * **Hypothesis Testing**: When we see apparent effects, *like a difference between 2 groups*, we will evaluate whether the effect *might have happened **by chance**.*
     
**Sampling**
  * **sample**: a subset of a population used to collect data
  * **representative**: a sample is representative if every member of the population has the same chance of being in the sample.
  * **oversampling**: technique of increasing the representation of a sub-population in order to avoid errors due to small sample sizes.

### Study Types

  1. **Cross-Sectional**: snapshot of a group *at a point in time*
    * **Cycles**: each repitition of the study
    * meant to be **representative**, which means that every member of the target population has an equal chance of being in the sample
    * datasets may not be **representative** and deliberately **oversamples** instead
      * **Oversampling**: technique of increasing the represnetation of a sub-population to avoid errors due to small sample sizes
      * **Ex**: surveying that the number of respondents in each group is *large enough* to draw valid statistical inferences
      * **Drawback**: not easy to draw conclusions about the general population based on statistics from the survey
  2. **Longitudinal**: observes a group *repeatedly over a period of time*

### Data Transformation

**Data cleaning**: when you import data, you often:
  * check for errors
  * deal with special values
  * convert data into different formats
  * perform calculations
  
Data cleaning are processes that include validating data, identifying errors,

**NSFG Examples**
  * agepreg: change dtype from centiyears to years
  * birthwgt_lb & birthwgt_oz: change special encodings to np.nan using Pandas' **replace** method
     * special encodings are non-weight (codes/variab
  * totalwgt_lb: create a new column that combines pounds & oz column into a single quantity
  

### Data Validation

When data is exported from 1 source & imported into another, *errors are introduced.* It's important to validate data by:

  * **compute basic statistics & compare to published results**
    * value counts (bins for each possible value)
    
  

#### Value Counts (Pandas)
##### Series.value_counts()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html



In [2]:
# from code.nsfg import *
import nsfg
preg = nsfg.ReadFemPreg()
preg.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64