# Think Stats 2
### by Allen B. Downey, Copyright 2014 Allen B. Downey

## Chapter 1. Exploratory data analysis 
 This notebook contains my taken notes, practice code and solution to exercises.

In [1]:
import nsfg
import pandas as pd

Data is in `.dct` format,which is a Stata dictionary file.<br>
Stata is a statistical software system; a “dictionary” in this
context is a list of variable names, types, and indices that identify where in
each line to find each variable.<br>
Data is converted to `DataFrame` in `nsfg.py` using `thinkstats2.py`<br>
Data is cleaned in `CleanFemPreg` function in `nsfg.py`.

In [2]:
df=nsfg.ReadFemPreg()
df.shape

(13593, 244)

In [3]:
df.head(10)

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.875
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,9.125
3,2,2,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,7.0
4,2,3,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,6.1875
5,6,1,,,,,6.0,,1.0,,...,0,0,0,4870.926435,5325.196999,8874.440799,1,23,,8.5625
6,6,2,,,,,6.0,,1.0,,...,0,0,0,4870.926435,5325.196999,8874.440799,1,23,,9.5625
7,6,3,,,,,6.0,,1.0,,...,0,0,0,4870.926435,5325.196999,8874.440799,1,23,,8.375
8,7,1,,,,,5.0,,1.0,,...,0,0,0,3409.579565,3787.539,6911.879921,2,14,,7.5625
9,7,2,,,,,5.0,,1.0,,...,0,0,0,3409.579565,3787.539,6911.879921,2,14,,6.625


In [4]:
df.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

In [5]:
df.columns[1]

'pregordr'

In [6]:
type(df['pregordr'])

pandas.core.series.Series

In [7]:
df['pregordr'].head(10)

0    1
1    2
2    1
3    2
4    3
5    1
6    2
7    3
8    1
9    2
Name: pregordr, dtype: int64

<b>Validate Data:</b><br>
When data is exported from one software environment and imported into
another, errors might be introduced.<br>
One way to validate data is to compute basic statistics and compare them
with published results.<br>
For example, the NSFG codebook includes tables
that summarize each variable.<br>Here is the table for outcome, which encodes
the outcome of each pregnancy:<br>

| value         | label            | Total|
|:-------------:|:----------------:|:----:|
| 1             | LIVE BIRTH       | 9148 |
| 2             | INDUCED ABORTION | 1862 |
| 3             | STILLBIRTH       | 120  |
| 4             | MISCARRIAGE      | 1921 |
| 5             | ECTOPIC PREGNANCY| 190  |
| 6             | CURRENT PREGNANCY| 352  |


We can create a similar table programmatically to see if total counts match with such official tables for each column. 

In [8]:
df.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

`nsfg.py` contains a function `MakePregMap` to create a map to help us find the row(record) indices representing  pregnancy for a particular caseid.

In [9]:

preg_map=nsfg.MakePregMap(df)

<b>Example:</b><br>
Finding pregnancy outcomes for caseid=10229

In [10]:
caseid=10229
df.outcome[preg_map[caseid]]

11093    4
11094    4
11095    4
11096    4
11097    4
11098    4
11099    1
Name: outcome, dtype: int64


From the above result, we learn the respondent was pregnant 7 times, out of which first 6 times the outcome was 'miscarriage'(i.e 4) and the recent one was a 'live birth'(i.e 1)<br>
Using just the above infomation, we can understand, how relieiving the recent pregnancy must have been for her family.

We used data to learn the emotional circumstance of that respondent !.<br>


## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [11]:
df['birthord'].value_counts()

1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
Name: birthord, dtype: int64

In [12]:
print(f'Number of INAPPLICABLE values:{df["birthord"].isnull().sum()}')

Number of INAPPLICABLE values:4445


Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [13]:
bins=[-1,13,26,50]
print(pd.cut(df['prglngth'],bins).value_counts().sort_index())

(-1, 13]    3522
(13, 26]     793
(26, 50]    9278
Name: prglngth, dtype: int64


Mean of birthweight in pounds:

In [14]:
df.totalwgt_lb.mean()

7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean.

In [15]:
df['totalwgt_kg']=df['totalwgt_lb']*0.453592
print(f'Mean in kg is: {df.totalwgt_kg.mean()}')

Mean in kg is: 3.2956309433502984


`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [16]:
df_resp=nsfg.ReadFemResp()
df_resp.head(10)

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667
5,845,1,5,4,1,5.0,42,42,727,42,...,0,2335.279149,3725.796795,4705.681352,2,18,1234,1222,17:10:13,95.488
6,10333,5,5,3,1,5.0,17,17,1029,17,...,0,2335.279149,2687.399758,3139.151658,2,18,1236,1224,14:14:38,61.204333
7,855,5,5,4,5,5.0,22,22,965,22,...,0,4670.558298,7122.614751,10019.38217,2,18,1235,1223,14:42:52,59.756333
8,8656,5,5,4,1,5.0,38,38,780,38,...,0,5198.652195,6027.568848,6520.021223,2,18,1237,1225,15:32:34,56.978833
9,3566,5,5,4,5,5.0,21,21,974,21,...,0,2764.142038,3240.986558,4559.095792,2,18,1231,1219,16:22:25,104.744667


Select the `age_r` column from `df_resp` and print the value counts.  How old are the youngest and oldest respondents?

In [17]:
df_resp.age_r.value_counts().sort_index()

15    217
16    223
17    234
18    235
19    241
20    258
21    267
22    287
23    282
24    269
25    267
26    260
27    255
28    252
29    262
30    292
31    278
32    273
33    257
34    255
35    262
36    266
37    271
38    256
39    215
40    256
41    250
42    215
43    253
44    235
Name: age_r, dtype: int64

In [18]:
print(f"Yongest respondent is {df_resp.age_r.min()} years old\nOldest respondent is {df_resp.age_r.max()} years old")

Yongest respondent is 15 years old
Oldest respondent is 44 years old


Select record for respondent with `caseid`=2298

In [19]:
df_resp[df_resp.caseid==2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


Get corresponding records for pregnancy with `caseid`=2298

In [20]:
df[df.caseid==2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb,totalwgt_kg
2610,2298,1,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875,3.118445
2611,2298,2,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,5.5,2.494756
2612,2298,3,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,4.1875,1.899417
2613,2298,4,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875,3.118445


How old is the respondent with `caseid` 1?

In [21]:
print(f'caseid 1 respondent is {df_resp[df_resp.caseid==1].age_r.to_string(index=False)} years old')

caseid 1 respondent is  44 years old


What are the pregnancy lengths for the respondent with `caseid` 2298?

In [22]:
df[df.caseid==2298].prglngth

2610    40
2611    36
2612    30
2613    40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [23]:
df[df.caseid==5012].totalwgt_lb.head(1)

5515    6.0
Name: totalwgt_lb, dtype: float64