# Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [7]:
from __future__ import print_function, division

import nsfg

ModuleNotFoundError: No module named 'code.nsfg'; 'code' is not a package

## Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [5]:
preg = nsfg.ReadFemPreg()
preg.head()

NameError: name 'nsfg' is not defined

Print the column names.

In [None]:
preg.columns

Select a single column name.

In [None]:
preg.columns[1]

Select a column and check what type it is.

In [None]:
pregordr = preg['pregordr']
type(pregordr)

Print a column.

In [None]:
pregordr

Select a single element from a column.

In [None]:
pregordr[0]

Select a slice from a column.

In [None]:
pregordr[2:5]

Select a column using dot notation.

In [None]:
pregordr = preg.pregordr

Count the number of times each value occurs.

In [None]:
preg.outcome.value_counts().sort_index()

Check the values of another variable.

In [None]:
preg.birthwgt_lb.value_counts().sort_index()

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [None]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [None]:
# Solution goes here
preg.birthord.value_counts().sort_index()

We can also use `isnull` to count the number of nans.

In [None]:
preg.birthord.isnull().sum()

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [4]:
# Solution goes here
preg.prglngth.value_counts.sort_index()

NameError: name 'preg' is not defined

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [16]:
preg.totalwgt_lb.mean()

7.265628457623368

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [25]:
# Solution goes here
preg['totalwgt_kg'] = preg.totalwgt_lb * 0.453592
preg[['caseid','totalwgt_kg']]

Unnamed: 0,caseid,totalwgt_kg
0,1,3.997279
1,1,3.572037
2,2,4.139027
3,2,3.175144
4,2,2.806601
...,...,...
13588,12571,2.806601
13589,12571,
13590,12571,
13591,12571,3.401940


`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [26]:
resp = nsfg.ReadFemResp()

`DataFrame` provides a method `head` that displays the first five rows:

In [28]:
resp.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667


Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [36]:
# Solution goes here
min_age = resp.age_r.min()
max_age = resp.age_r.max()

"Min Age: {0} Max Age: {1}".format(min_age, max_age)

'Min Age: 15 Max Age: 44'

We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [21]:
resp[resp.caseid==2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


And we can get the corresponding rows from `preg` like this:

In [22]:
preg[preg.caseid==2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
2610,2298,1,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875
2611,2298,2,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,5.5
2612,2298,3,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,4.1875
2613,2298,4,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875


How old is the respondent with `caseid` 1?

In [38]:
# Solution goes here
respondent = resp[resp.caseid == 1]
respondent.age_r


1069    44
Name: age_r, dtype: int64

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [43]:
# Solution goes here
case_id = 2298

df = preg[preg.caseid==case_id]

df.prglngth

2610    40
2611    36
2612    30
2613    40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [25]:
# Solution goes here

In [44]:
preg[preg.caseid==5012].totalwgt_lb

5515    6.0
Name: totalwgt_lb, dtype: float64