# Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [None]:
from __future__ import print_function, division

import nsfg
import numpy as np

## Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [None]:
preg = nsfg.ReadFemPreg()
preg.head()

Print the column names.

In [None]:
preg.columns

Select a single column name.

In [None]:
preg.columns[1]

Select a column and check what type it is.

In [None]:
pregordr = preg['pregordr']
type(pregordr)

Print a column.

In [None]:
pregordr

Select a single element from a column.

In [None]:
pregordr[0]

Select a slice from a column.

In [None]:
pregordr[2:5]

Select a column using dot notation.

In [None]:
pregordr = preg.pregordr

Count the number of times each value occurs.

In [None]:
preg.outcome.value_counts().sort_index()

Check the values of another variable.

In [None]:
preg.birthwgt_lb.value_counts().sort_index()

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [None]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [None]:
# Solution goes here
preg.birthord.value_counts()

We can also use `isnull` to count the number of nans.

In [None]:
preg.birthord.isnull().sum()

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [None]:
# Solution goes here
preg.prglngth.value_counts()

#initial output does not match codebook, which has 0-13, 14-26, 27-50 weeks 
#transforming the data frame with a new column that matches the codebook
preg.loc[preg.prglngth <= 13, 'prglngth'].count()
preg.loc[(preg.prglngth >=14) & (preg.prglngth <= 26), 'prglngth'].count()
preg.loc[preg.prglngth >=27, 'prglngth'].count()

#tried this pattern but the lambda function dosen't like pass
preg['preglngth_range'] = preg['prglngth'].apply(lambda x: '0-13 weeks' if x <=13 else '')

#more elegant solution in terms of the output, but more lines of code
when = [
    (preg.prglngth <=13),
    (preg.prglngth >=14) & (preg.prglngth <=26),
    (preg.prglngth >=27)
]
then = ['0-13 weeks', '14-26 weeks', '27+ weeks']
preg['prglngth_rng'] = np.select(when, then, default='other')

preg.prglngth_rng.value_counts()

## new pattern from the stack overflow thread 

https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column

    df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
    conditions = [
        (df['Set'] == 'Z') & (df['Type'] == 'A'),
        (df['Set'] == 'Z') & (df['Type'] == 'B'),
        (df['Type'] == 'B')]
    choices = ['yellow', 'blue', 'purple']
    df['color'] = np.select(conditions, choices, default='black')
    print(df)

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [None]:
preg.totalwgt_lb.mean()

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [None]:
# Solution goes here
preg['totalwgt_kg'] = preg.totalwgt_lb * 0.453592

preg.totalwgt_kg.mean()

`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [None]:
resp = nsfg.ReadFemResp()

`DataFrame` provides a method `head` that displays the first five rows:

In [None]:
resp.head()

Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [None]:
# Solution goes here
resp.age_r.value_counts()

print('the oldest respondent was '+str(resp.age_r.max())+' years old and the youngest was '+str(resp.age_r.min())+' years old')

We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [None]:
resp[resp.caseid==2298]

And we can get the corresponding rows from `preg` like this:

In [None]:
preg[preg.caseid==2298]

How old is the respondent with `caseid` 1?

In [None]:
# Solution goes here
resp.loc[resp.caseid==1, 'age_r']

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [None]:
# Solution goes here
preg.loc[preg.caseid==2298, 'prglngth']

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [None]:
# Solution goes here
preg.loc[(preg.caseid==5012)&(preg.pregordr==1), 'totalwgt_lb']