# Exploratory Data Analysis

In [1]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print(f"Dowloaded {local}")

download("https://github.com/AllenDowney/Thinkstats/raw/v3/nb/thinkstats.py")

In [2]:
try:
    import empiricaldist
except ImportError:
    %pip install empiricaldist

In [3]:
# Imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import HTML
from thinkstats import decorate

# Evidence
Evidences that are based on data is unpublished and usually personal are known as **Anecdotal Evidence**.
Anecdotes are often personal stories, and might be misremembered, misrepresented, repeated inaccurately etc.

To address the limitations of anecdotes, we use statistical tools such as *Data collection*, *Descriptive Statistics*, *EDA*, *Estimation*, *Hypothesis Testing* etc. 

# The Nationol Survey of Family Growth
---

- The **CDC** in U.S. (Centers for disease control and prevention) conducts the National Survey family growth (***NSFG***) which gathers information on family life, marriage, divorce, pregnancy etc.
- We will use data collected by this survey to investigate whether ***First babies tend to be born late*** and other questions.

## General Terms
- The goal of a statistical study is to draw conclusions about a **Population**. In NSFG, the target population is people in the united states aged 15-44.
- We collect data from a subset of the population called a **Sample**.
- The people who participate in a survey are called **Respondents**. 

## Cross-Sectional Study
- The NSFG is a **cross-sectional** study which means that it captures snapshot of a population at a point in time. It is conductedd several times and each deployment is called as **Cycle**.
- The cross-sectional studies are meant to be **representative** which means that ***the sample is similar to the target population***.


# Downloading the NSFG Data set
---

In [4]:
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/nsfg.py")

Dowloaded nsfg.py


In [5]:
try:
    import statadit
except ImportError:
    %pip install statadict

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


- The data is stored in two Files,
    1. A "*Dictonary*" that describes the format of the data (.dct)
    2. A data file (.dat)

In [6]:
dct_file = "2002FemPreg.dct"
dat_file = "2002FemPreg.dat.gz"

# Reading The Data
---
- `read_stata` : Function that reads the above files.
- It is called `read_stata` because this data is compatible with a statistical software package called *Stata*.

In [7]:
from statadict import parse_stata_dict

def read_stata(dct_file, dat_file):
    stata_dict = parse_stata_dict(dct_file)
    resp = pd.read_fwf(
        dat_file, 
        names=stata_dict.names,
        colspecs=stata_dict.colspecs,
        compression="gzip",
    )
    return resp

In [8]:
preg = read_stata(dct_file, dat_file)

In [9]:
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


In [10]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'poverty_i', 'laborfor_i', 'religion_i', 'metro_i', 'basewgt',
       'adj_mod_basewgt', 'finalwgt', 'secu_p', 'sest', 'cmintvw'],
      dtype='object', length=243)

# Variables in NSFG Dataset
---
- The NSFG dataset contains 243 variables in total.
- Some of the ones we'll use:

    1. `caseid` : integer id of the respondant.
    2. `pregordr` : Pregnancy serial number.
           - **1** for respondant's first pregnancy.
           - **2** for second and so on
    3. `prglngth` : integer duration of pregnancy in weeks.
    4. `outcome` : integer code for the outcome of pregnancy.
          - Code **1** indicates a live birth.
    5. `birthord` : serial number for live birth.
          - Code for a respondant's first child is 1 and so on
    6. `birthwgt_lb` and `birthwgt_oz` : contains the pounds and ounces parts of the birth weight of the baby.
    7. `agepreg` : mother's age at the end of the pregnancy.
    8. `finalwgt` : statistical weight associated with the respondant.

# Validation
---
- When data is exported from one software enviornment and imported into another, errors might be introduced which could lead to errors while getting familiar to it.
- We validate the data first so that we can save time later and avoid errors and inconsistency.

> NOTE: Check the published tables from the `Cycle6Codebook-Pregnancy.pdf` of investigating columns

## 1. Investigating the outcomes of pregnancy
---
Compare the **outcome** table with our computed statistics results. 

1. `value_counts()` method : counts the number of times each value appears in a column.
2. `sort_index()` method : sorts the series based on the values of the index. 

In [11]:
preg["outcome"].value_counts().sort_index()

outcome
1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: count, dtype: int64

## 2. Investigating `birthwgt_lb`
---
Comparing our computed statistics for `birthwgt_lb` with the published table.

In [12]:
counts = preg["birthwgt_lb"].value_counts(dropna=False).sort_index()
counts

birthwgt_lb
0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
51.0       1
97.0       1
98.0       1
99.0      57
NaN     4449
Name: count, dtype: int64

In [13]:
counts.loc[:5]

birthwgt_lb
0.0      8
1.0     40
2.0     53
3.0     98
4.0    229
5.0    697
Name: count, dtype: int64

In [14]:
counts.loc[:5].sum()

np.int64(1125)

- When we compare the total with the codebook it comes out to be consistent.
- The values 97, 98, and 99 represent cases where the birth weight is unknown.  To handle these values we can replace them with NaN.

In [15]:
preg["birthwgt_lb"] = preg["birthwgt_lb"].replace([51, 97, 98, 88], np.nan)
preg["birthwgt_lb"].value_counts().sort_index()

birthwgt_lb
0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
99.0      57
Name: count, dtype: int64

# Transformation
---
Sometimes we have to convert data into different formats, and perform other calculations, this is known as ***Transformation***.

## 1. Conversion of columns data to other format

- **Example** : `agepreg` contains the age of mother at the end of the pregnancy. We can convert it from <ins>Centiyears (hundredths of a year)</ins> to <ins>Years</ins> using the below code:

- Mean below represents the average age of the mother at the end of pregnancy but since it is in the *centiyears* format, it is not quiet readable so we convert it to *years* by dividing each value in `agepreg` column by 100.  

In [16]:
preg["agepreg"].mean()

np.float64(2468.8151197039497)

In [17]:
# to convert it to years
preg["agepreg"] /= 100.0
preg["agepreg"].mean()

np.float64(24.6881511970395)

- Hence, average age of the mother at the end of the pregnancy is $24$.

- **Example :** Combine the birth weights `birthwgt_lb` and `birthwgt_oz` with the pounds and ounces in single column. 

In [18]:
preg["birthwgt_oz"].value_counts(dropna=False).sort_index()

birthwgt_oz
0.0     1037
1.0      408
2.0      603
3.0      533
4.0      525
5.0      535
6.0      709
7.0      501
8.0      756
9.0      505
10.0     475
11.0     557
12.0     555
13.0     487
14.0     475
15.0     378
97.0       1
98.0       1
99.0      46
NaN     4506
Name: count, dtype: int64

- First, clean the data just like `birthwgt_lb` earlier.

In [19]:
preg["birthwgt_oz"] = preg["birthwgt_oz"].replace([97, 98, 99], np.nan)
preg["birthwgt_oz"].value_counts(dropna=False).sort_index()

birthwgt_oz
0.0     1037
1.0      408
2.0      603
3.0      533
4.0      525
5.0      535
6.0      709
7.0      501
8.0      756
9.0      505
10.0     475
11.0     557
12.0     555
13.0     487
14.0     475
15.0     378
NaN     4554
Name: count, dtype: int64

- Create a new column that combines the new cleaned values from pounds and ounces into single quantity.

In [20]:
preg["totalwgt_lb"] = preg["birthwgt_lb"] + preg["birthwgt_oz"] / 16.0
preg["totalwgt_lb"].mean()

np.float64(7.265628457623368)

# Summary Statistics
---
A statistics is a number derived from a dataset, usally intended to quantify some aspect of the data. 
Ex: Mean, Mode, Variance, standard deviation etc.


1. `count()` method : returns the number of values that are not `nan`.

In [21]:
weights = preg["totalwgt_lb"]
n = weights.count()
n

np.int64(9038)

2. `mean()` and `sum()` method: mean returns the mean of the data and sum adds the data. 

In [22]:
mean = weights.sum() / n
mean

np.float64(7.265628457623368)

In [23]:
weights.mean()

np.float64(7.265628457623368)

3. **Variance** : Variance is the spread of a set of values from the mean. It is the square of deviations of a variate from the absolute mean.

$\sigma = \frac{(\sum_{i=1}^{\infty} x_{i} - \mu)^2}{n}$

In [24]:
squared_deviations = (weights - mean) ** 2

In [25]:
var = squared_deviations.sum() / n
var

np.float64(1.983070989750022)

In [26]:
weights.var()

np.float64(1.9832904288326545)

4. **Standard Deviation** : square root of variance

In [27]:
std = np.sqrt(var)
std

np.float64(1.40821553384062)

In [28]:
weights.std()

np.float64(1.4082934455690173)

# Interpretaion 
---
Helps in working with data effectively.

In [29]:
subset = preg.query("caseid == 10229")
subset.shape

(7, 244)

In [30]:
subset["outcome"].values

array([4, 4, 4, 4, 4, 4, 1])

---
# Exercise
---

## Exercise 1.1

In [31]:
preg["birthord"].value_counts(dropna=False).sort_index()

birthord
1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
NaN     4445
Name: count, dtype: int64

## Exercise 1.2

In [32]:
preg["totalwgt_kg"] = preg["totalwgt_lb"] * 2.2

In [33]:
preg["totalwgt_kg"].mean()

np.float64(15.98438260677141)

In [34]:
preg["totalwgt_kg"].std(ddof=0)

np.float64(3.098074174449364)

## Exercise 1.3

In [35]:
# respondant
preg_case = preg.query("caseid == 2298")
preg_case.shape

(4, 245)

In [36]:
# pregnancy lenghts of the respondent
preg_case["prglngth"].values

array([40, 36, 30, 40])

In [37]:
first_baby = preg.query("caseid == 5013 and pregordr == 1")

In [38]:
first_baby["totalwgt_lb"].values

array([7.375])