# Exploratory Data Analysis

In [1]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print(f"Dowloaded {local}")

download("https://github.com/AllenDowney/Thinkstats/raw/v3/nb/thinkstats.py")

In [2]:
try:
    import empiricaldist
except ImportError:
    %pip install empiricaldist

In [3]:
# Imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import HTML
from thinkstats import decorate

# Evidence
Evidences that are based on data is unpublished and usually personal are known as **Anecdotal Evidence**.
Anecdotes are often personal stories, and might be misremembered, misrepresented, repeated inaccurately etc.

To address the limitations of anecdotes, we use statistical tools such as *Data collection*, *Descriptive Statistics*, *EDA*, *Estimation*, *Hypothesis Testing* etc. 

# The Nationol Survey of Family Growth
---

- The **CDC** in U.S. (Centers for disease control and prevention) conducts the National Survey family growth (***NSFG***) which gathers information on family life, marriage, divorce, pregnancy etc.
- We will use data collected by this survey to investigate whether ***First babies tend to be born late*** and other questions.

## General Terms
- The goal of a statistical study is to draw conclusions about a **Population**. In NSFG, the target population is people in the united states aged 15-44.
- We collect data from a subset of the population called a **Sample**.
- The people who participate in a survey are called **Respondents**. 

## Cross-Sectional Study
- The NSFG is a **cross-sectional** study which means that it captures snapshot of a population at a point in time. It is conductedd several times and each deployment is called as **Cycle**.
- The cross-sectional studies are meant to be **representative** which means that ***the sample is similar to the target population***.


# Downloading the NSFG Data set
---

In [5]:
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")

Dowloaded 2002FemPreg.dct
Dowloaded 2002FemPreg.dat.gz


In [6]:
try:
    import statadit
except ImportError:
    %pip install statadict

Collecting statadict
  Using cached statadict-1.1.0-py3-none-any.whl.metadata (1.7 kB)
Using cached statadict-1.1.0-py3-none-any.whl (9.4 kB)
Installing collected packages: statadict
Successfully installed statadict-1.1.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


- The data is stored in two Files,
    1. A "*Dictonary*" that describes the format of the data (.dct)
    2. A data file (.dat)

In [7]:
dct_file = "2002FemPreg.dct"
dat_file = "2002FemPreg.dat.gz"

# Reading The Data
---
- `read_stata` : Function that reads the above files.
- It is called `read_stata` because this data is compatible with a statistical software package called *Stata*.

In [11]:
from statadict import parse_stata_dict

def read_stata(dct_file, dat_file):
    stata_dict = parse_stata_dict(dct_file)
    resp = pd.read_fwf(
        dat_file, 
        names=stata_dict.names,
        colspecs=stata_dict.colspecs,
        compression="gzip",
    )
    return resp

In [12]:
preg = read_stata(dct_file, dat_file)

In [13]:
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


In [14]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'poverty_i', 'laborfor_i', 'religion_i', 'metro_i', 'basewgt',
       'adj_mod_basewgt', 'finalwgt', 'secu_p', 'sest', 'cmintvw'],
      dtype='object', length=243)

# Variables in NSFG Dataset
---
- The NSFG dataset contains 243 variables in total.
- Some of the ones we'll use:

    1. `caseid` : integer id of the respondant.
    2. `pregordr` : Pregnancy serial number.
           - **1** for respondant's first pregnancy.
           - **2** for second and so on
    3. `prglngth` : integer duration of pregnancy in weeks.
    4. `outcome` : integer code for the outcome of pregnancy.
          - Code **1** indicates a live birth.
    5. `birthord` : serial number for live birth.
          - Code for a respondant's first child is 1 and so on
    6. `birthwgt_lb` and `birthwgt_oz` : contains the pounds and ounces parts of the birth weight of the baby.
    7. `agepreg` : mother's age at the end of the pregnancy.
    8. `finalwgt` : statistical weight associated with the respondant.