![Exploratory Data Analysis in Python](![image.png](attachment:image.png))

# Read, clean, and validate

## Exploring the NSFG data

In [9]:
import pandas as pd
import numpy as np

In [2]:
nsfg = pd.read_csv("../datasets/nsfg.csv")
nsfg.head()

Unnamed: 0,caseid,outcome,birthwgt_lb1,birthwgt_oz1,prglngth,nbrnaliv,agecon,agepreg,hpagelb,wgt2013_2015
0,60418,1,5.0,4.0,40,1.0,2000,2075.0,22.0,3554.964843
1,60418,1,4.0,12.0,36,1.0,2291,2358.0,25.0,3554.964843
2,60418,1,5.0,4.0,36,1.0,3241,3308.0,52.0,3554.964843
3,60419,6,,,33,,3650,,,2484.535358
4,60420,1,8.0,13.0,41,1.0,2191,2266.0,24.0,2903.782914


### Number of Rows and Columns

In [3]:
nsfg.shape

(9358, 10)

### Column Names

In [4]:
nsfg.columns

Index(['caseid', 'outcome', 'birthwgt_lb1', 'birthwgt_oz1', 'prglngth',
       'nbrnaliv', 'agecon', 'agepreg', 'hpagelb', 'wgt2013_2015'],
      dtype='object')

In [5]:
ounces = nsfg.birthwgt_oz1
ounces.head()

0     4.0
1    12.0
2     4.0
3     NaN
4    13.0
Name: birthwgt_oz1, dtype: float64

## Clean and Validate

### Validate a variable

`'outcome'` encodes the outcome of each pregnancy as shown below:

```python
value	label
1	    Live birth
2	    Induced abortion
3	    Stillbirth
4	    Miscarriage
5	    Ectopic pregnancy
6	    Current pregnancy
```

In [7]:
nsfg.outcome.value_counts().sort_index()

1    6489
2     947
3      86
4    1469
5     118
6     249
Name: outcome, dtype: int64

### Clean a variable

`'nbrnaliv'` records the number of babies born alive at the end of a pregnancy.

In [8]:
nsfg.nbrnaliv.value_counts().sort_index()

1.0    6379
2.0     100
3.0       5
8.0       1
Name: nbrnaliv, dtype: int64

`8` appears once this value indicates that the respondent refused to answer the question. we will clean this

In [10]:
nsfg.nbrnaliv.replace(8, np.nan, inplace=True)
display(nsfg.nbrnaliv.value_counts().sort_index())

nsfg.birthwgt_lb1.replace([98, 99], np.nan, inplace=True)
display(nsfg.birthwgt_lb1.value_counts().sort_index())

nsfg.birthwgt_oz1.replace([98, 99], np.nan, inplace=True)
display(nsfg.birthwgt_oz1.value_counts().sort_index())

1.0    6379
2.0     100
3.0       5
Name: nbrnaliv, dtype: int64

0.0        6
1.0       34
2.0       47
3.0       67
4.0      196
5.0      586
6.0     1666
7.0     2146
8.0     1168
9.0      363
10.0      82
11.0      17
12.0       7
13.0       2
14.0       2
17.0       1
Name: birthwgt_lb1, dtype: int64

0.0     757
1.0     297
2.0     429
3.0     393
4.0     386
5.0     407
6.0     543
7.0     346
8.0     518
9.0     377
10.0    295
11.0    418
12.0    388
13.0    275
14.0    258
15.0    268
Name: birthwgt_oz1, dtype: int64

### Compute a variable

For each pregnancy in the NSFG dataset, the variable 'agecon' encodes the respondent's age at conception, and `'agepreg'` the respondent's age at the end of the pregnancy.

Both variables are recorded as integers with two implicit decimal places, so the value `2575` means that the respondent's age was `25.75`.

In [11]:
# Compute the difference
nsfg['preg_length'] = (nsfg.agepreg/100) - (nsfg.agecon/100)

# Compute summary statistics
nsfg.preg_length.describe()

count    9109.000000
mean        0.552069
std         0.271479
min         0.000000
25%         0.250000
50%         0.670000
75%         0.750000
max         0.920000
Name: preg_length, dtype: float64

## Filter and visualize
