The question:

Do first time babies tend to arrive late?

Many anecdotal evidence because they are based on data that is unpublished and usually personal. Which fails because:

- **Small number of observations**: If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
- **Selection bias**: People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
- **Conrmation bias**: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
- **Inaccuracy**: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

# Statistical Approach

To address the limitations of anecdotes, we will use the tools of statistics, which include:
Data collection: We will use data from a large national survey that
was designed explicitly with the goal of generating statistically valid
inferences about the U.S. population.
- **Descriptive statistics**: We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
- **Exploratory data analysis**: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
- **Estimation**: We will use data from a sample to estimate characteristics of the general population.
- **Hypothesis testing**: Where we see apparent effects, like a difference between two groups, we will evaluate whether the eect might have happened by chance.

# The Data Source

We will be usin the National Survey of Family Growth.

See http://cdc.gov/nchs/nsfg.htm and explore the different data sets and informarion.

The NSFG is a **cross-sectional** study, which means that it captures a snapshot of a group at a point in time. The most common alternative is a **longitudinal study**, which observes a group repeatedly over a period of time.

The goal of the survey is to draw conclusions about a **population**; the target population of the NSFG is people in the United States aged 15-44. Ideally surveys would collect data from every member of the population, but that's seldom possible. Instead we collect data from a subset of the population called a **sample**. The people who participate in a survey are called **respondents**.

In general, cross-sectional studies are meant to be **representative**, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.

The NSFG is not representative; instead it is deliberately **oversampled**. The designers of the study recruited three groups|Hispanics, African-Americans and teenagers|at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.

Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey. We will come back to this point later.

The codebook and user's guide for the NSFG data are available from http://www.cdc.gov/nchs/nsfg/nsfgcycle6.htm

## Importing the data

Once you are done go to the folder ThinkStats2/code on your terminal and run nsfg.py:
> cd ThinkStats2/code

> python nsfg.py

You should get a message like 'All tests passed'

Now explore the data on the folder. How does 2002FemPreg.dct look?

This is a Stata dictionary file.

thinkstats2.py has a module to open Stata dictionaries.

A **module** is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

Explore the module nsfg file in the folder. find the function ReadFemPreg() and then import it.

You might have to copy the module to the correct directory.

In [1]:
import pandas as pd
from Thinkstats2 import nsfg
import numpy as np

In [2]:
# Code here
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz')

What are the columns in the Dataframe?

In [3]:
# Code here
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'poverty_i', 'laborfor_i', 'religion_i', 'metro_i', 'basewgt',
       'adj_mod_basewgt', 'finalwgt', 'secu_p', 'sest', 'cmintvw'],
      dtype='object', length=243)

How many columns does it have? Use 2 methods to calculate it.

In [4]:
preg.shape[1]

243

In [5]:
len(preg.columns)

243

Remember that columns is not a **method** is an **attribute**

What is the first column?

In [6]:
# Code here
preg.columns[0]

'caseid'

Access pregordr column. Use 2 different methods.

In [7]:
# Code here
preg.loc[:,'pregordr'].head()

0    1
1    2
2    1
3    2
4    3
Name: pregordr, dtype: int64

In [8]:
preg.pregordr.head()

0    1
1    2
2    1
3    2
4    3
Name: pregordr, dtype: int64

What type is that column? and what type is the column object?

In [9]:
# Code here
preg.pregordr.dtype

dtype('int64')

In [10]:
type(preg.pregordr)

pandas.core.series.Series

In [11]:
type(preg)

pandas.core.frame.DataFrame

Get the rows 2 to 4 of the column

In [12]:
# Code here
preg.iloc[1:4]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


## Variables

Out of the 244 we are only going to use:

- *prglngth* is the integer duration of the pregnancy in weeks.
- *outcome* is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.
- *pregordr* is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.
- *birthord* is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.
- *birthwgt_lb* and *birthwgt_oz* contain the pounds and ounces parts of the birth weight of the baby.
- *agepreg* is the mother’s age at the end of the pregnancy.
- *finalwgt* is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.

If you read the codebook carefully, you will see that many of the variables are **recodes**, which means that they are not part of the raw data collected by the survey; they are calculated using the **raw data**.

For example, prglngth for live births is equal to the raw variable wksgest (weeks of gestation) if it is available; otherwise it is estimated using mosgest * 4.33 (months of gestation times the average number of weeks in a month).

## Transformation

When you import data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called **data cleaning**.

First of all, ReadFemPreg() has a function CleanFemResp() within it that cleans it. Open the module again in the text editor and edit ReadFemPreg() so that CleanFemResp() is has an input to decide whether to clean it or not. <br> such as:
>ReadFemPreg(clean=True)

After loading the unclean file again. (You might have to restart the kernel to get the new function imported) Code the following Data cleaning transformation processes:

In [13]:
#Load the file here
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz', clean = True)
preg.describe()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
count,13593.0,13593.0,352.0,349.0,352.0,3.0,13241.0,18.0,9144.0,163.0,...,13593.0,13593.0,13593.0,13593.0,13593.0,13593.0,13593.0,13593.0,0.0,9038.0
mean,6216.526595,2.34915,15.144886,1.34384,4.647727,3.666667,4.650177,4.055556,1.022419,1.834356,...,0.000809,0.003016,0.0,4216.271164,5383.982581,8196.42228,1.48731,44.083352,,7.265628
std,3645.417341,1.577807,13.922211,0.47567,2.527523,4.618802,1.84979,1.696787,0.190098,1.630208,...,0.028437,0.058727,0.0,3982.680473,5640.499431,9325.918114,0.499857,24.110403,,1.408293
min,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,64.577101,71.201194,118.65679,1.0,1.0,,0.125
25%,3022.0,1.0,5.0,1.0,2.0,1.0,3.0,3.0,1.0,1.0,...,0.0,0.0,0.0,2335.445237,2798.048902,3841.375308,1.0,25.0,,6.5
50%,6161.0,2.0,9.0,1.0,5.0,1.0,6.0,4.0,1.0,1.0,...,0.0,0.0,0.0,3409.648504,4127.220642,6256.592133,1.0,45.0,,7.375
75%,9423.0,3.0,23.0,2.0,7.0,5.0,6.0,6.0,1.0,1.0,...,0.0,0.0,0.0,4869.941451,5795.69288,9432.360931,2.0,65.0,,8.125
max,12571.0,19.0,99.0,2.0,9.0,9.0,9.0,6.0,5.0,5.0,...,1.0,2.0,0.0,99707.832014,157143.686687,261879.953864,2.0,84.0,,15.4375


Now import the unclean dataframe and do the following transformations:

agepreg contains the mother’s age at the end of the pregnancy. In the data file, agepreg is encoded as an integer number of centiyears. So first divide each element of agepreg by 100, yielding a floating-point value in
years.

In [14]:
# Code it here
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz')

preg['agepreg'] = preg['agepreg']/100
preg.agepreg.head()

0    33.16
1    39.25
2    14.33
3    17.83
4    18.33
Name: agepreg, dtype: float64

birthwgt_lb and birthwgt_oz contain the weight of the baby, in pounds and ounces, for pregnancies that end in live birth. In addition it uses several special codes:<br/>
97 NOT ASCERTAINED<br/>
98 REFUSED<br/>
99 DON'T KNOW<br/>

1. Replace those values with nan. 

Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby. The replace method replaces these values with np.nan, a special floating- point value that represents “not a number.” The inplace flag tells replace to modify the existing Series rather than create a new one.<br/>

In [15]:
# Code it here
preg.loc[:,['birthwgt_lb','birthwgt_oz']] = preg.loc[:,['birthwgt_lb','birthwgt_oz']].replace([97,98,99])
preg.loc[:,['birthwgt_lb','birthwgt_oz']][13:18]

Unnamed: 0,birthwgt_lb,birthwgt_oz
13,,
14,,
15,7.0,11.0
16,7.0,8.0
17,6.0,5.0


Be careful nan is not a string is a numpy object. np.nan

As part of the IEEE floating-point standard, all mathematical operations
return nan if either argument is nan:<br/>
```
>import numpy as np
>np.nan / 100.0
nan
```
So computations with nan tend to do the right thing, and most pandas functions handle nan appropriately. But dealing with missing data will be a recurring issue.

Create a new column totalwgt_lb that com- bines pounds and ounces into a single quantity, in pounds.<br>
One important note: when you add a new column to a DataFrame, you must use dictionary syntax

In [16]:
# Code it here
preg['totalwgt_lb'] = preg['birthwgt_lb'] + preg['birthwgt_oz']*0.0625
preg['totalwgt_lb'].head(20)

0     8.8125
1     7.8750
2     9.1250
3     7.0000
4     6.1875
5     8.5625
6     9.5625
7     8.3750
8     7.5625
9     6.6250
10    7.8125
11    7.0000
12    4.0000
13       NaN
14       NaN
15    7.6875
16    7.5000
17    6.3125
18       NaN
19    8.7500
Name: totalwgt_lb, dtype: float64

Compare them with the results the fuction when it cleans the data

In [17]:
# Code it here
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz', clean = True)

preg['totalwgt_lb_clean'] = preg['birthwgt_lb'] + preg['birthwgt_oz']*0.0625
preg['totalwgt_lb_clean'].head(20)

0     8.8125
1     7.8750
2     9.1250
3     7.0000
4     6.1875
5     8.5625
6     9.5625
7     8.3750
8     7.5625
9     6.6250
10    7.8125
11    7.0000
12    4.0000
13       NaN
14       NaN
15    7.6875
16    7.5000
17    6.3125
18       NaN
19    8.7500
Name: totalwgt_lb_clean, dtype: float64

## Validation

When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other
misunderstandings. If you take time to validate the data, you can save time later and avoid errors. <br>
One way to validate data is to compute basic statistics and compare them with published results. For example, the NSFG codebook includes tables that summarize each variable. Here is the table for outcome, which encodes the outcome of each pregnancy:<br>

![alt text](notebookpics/number_rows_table.png "Title")

The Series class provides a method, value_counts, that counts the number of times each value appears. If we select the outcome Series from the DataFrame.<br>
Use value_counts to compare with the published data:

In [18]:
# Code here
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz', clean=True)

In [19]:
preg.outcome.value_counts()

1    9148
4    1921
2    1862
6     352
5     190
3     120
Name: outcome, dtype: int64

Similarly, here is the published table for birthwgt_lb. Is there anything weird? If so, fix it.<br>
![alt text](notebookpics/number_rows_table2.png "Title")

In [20]:
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz')

In [21]:
# Solution 1
bins = pd.IntervalIndex.from_tuples([(-1, 5), (5, 6), (6, 7), (7, 8), (8, 950)])
print(pd.cut(preg['birthwgt_lb'], bins).value_counts(dropna = False).sort_index())

NaN         4449
(-1, 5]     1125
(5, 6]      2223
(6, 7]      3049
(7, 8]      1889
(8, 950]     858
Name: birthwgt_lb, dtype: int64


In [22]:
# Solution 2
sorted_index = preg.birthwgt_lb.value_counts(dropna = False).sort_index()
sorted_index

 0.0        8
 1.0       40
 2.0       53
 3.0       98
 4.0      229
 5.0      697
 6.0     2223
 7.0     3049
 8.0     1889
 9.0      623
 10.0     132
 11.0      26
 12.0      10
 13.0       3
 14.0       3
 15.0       1
 51.0       1
 97.0       1
 98.0       1
 99.0      57
NaN      4449
Name: birthwgt_lb, dtype: int64

In [23]:
# Solution 2
under6lb = sum(sorted_index[0:5])
lb6 = sorted_index[6]
lb7 = sorted_index[7]
lb8 = sorted_index[8]
over9lb = sum(sorted_index[9:15]) # More than index 15 says error... and also -1 in nan 
inapplicable = 4449 # sorted_index[10]

print('INAPPLICABLE |', inapplicable)
print('UNDER 6 POUNDS |', under6lb)
print('6 POUNDS |', lb6)
print('7 POUNDS |', lb7)
print('8 POUNDS |', lb8)
print('9 POUNDS OR MORE |', over9lb)

INAPPLICABLE | 4449
UNDER 6 POUNDS | 1125
6 POUNDS | 2223
7 POUNDS | 3049
8 POUNDS | 1889
9 POUNDS OR MORE | 798


In [24]:
preg = preg.loc[preg['birthwgt_lb'] <= 15.0] # More than 15 lbs doesn't make sense
preg.birthwgt_lb.value_counts(dropna = False).sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

## Interpretation

To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.<br>
As an example, let’s look at the sequence of outcomes for a respondents.
Because of the way the data files are organized, we have to do some processing to collect the pregnancy data for each respondent.

Create a dictionary that maps each caseid to all of index for the pregnancies she has been involved on:

An output as: {1:[1,1],2:[1,1,1].....}

dont use pandas dataframe functions.

In [25]:
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


In [26]:
list(preg.loc[preg.caseid == 1, 'outcome'])


[1, 1]

In [27]:
dicto = {}

for i in preg.caseid:
    dicto[i] = list(preg.loc[preg.caseid == i, 'outcome'])
    
dicto

{1: [1, 1],
 2: [1, 1, 1],
 6: [1, 1, 1],
 7: [1, 1],
 12: [1],
 14: [1, 1],
 15: [1, 1],
 18: [1],
 21: [1, 1],
 23: [1],
 24: [1, 1, 1],
 28: [1],
 31: [1, 1, 1],
 36: [1],
 38: [1, 1, 1],
 39: [1],
 44: [1, 1],
 46: [1, 1],
 49: [1, 1],
 51: [1, 1],
 57: [1, 1, 1],
 60: [1, 1],
 63: [1, 1],
 69: [1],
 70: [1, 1],
 71: [1],
 72: [1],
 73: [1, 1],
 77: [1, 1],
 80: [1, 1, 1, 1],
 81: [1],
 86: [1, 1, 1],
 90: [1],
 91: [1, 1, 1, 1],
 92: [1, 1, 1],
 95: [1],
 101: [1, 1],
 106: [1, 1, 1],
 114: [1],
 115: [1],
 118: [1, 1, 1],
 119: [1, 1],
 123: [1],
 132: [1],
 135: [1],
 138: [1],
 139: [1, 1],
 142: [1, 1, 1, 1],
 145: [1],
 149: [1, 1, 1],
 150: [1, 1, 1, 1, 1, 1],
 151: [1],
 152: [1, 1, 1],
 153: [1, 1],
 156: [1],
 159: [1, 1],
 160: [1, 1, 1],
 172: [1, 1, 1],
 173: [1],
 176: [1, 1],
 181: [1, 1],
 183: [1, 1],
 184: [1],
 186: [1, 1],
 190: [1, 1, 1, 1],
 193: [1, 1],
 209: [1],
 210: [1, 1, 1],
 213: [1, 1],
 215: [1, 1],
 218: [1, 1],
 219: [1],
 222: [1],
 227: [1, 1],
 

Now make it into a function

In [28]:
def todict(preg, column1, column2):
    dicto = {}
    for i in preg[column1]:
        dicto[i] = list(preg.loc[preg.caseid == i, column2])
    
    return dicto

In [29]:
todict(preg,'caseid','outcome')

{1: [1, 1],
 2: [1, 1, 1],
 6: [1, 1, 1],
 7: [1, 1],
 12: [1],
 14: [1, 1],
 15: [1, 1],
 18: [1],
 21: [1, 1],
 23: [1],
 24: [1, 1, 1],
 28: [1],
 31: [1, 1, 1],
 36: [1],
 38: [1, 1, 1],
 39: [1],
 44: [1, 1],
 46: [1, 1],
 49: [1, 1],
 51: [1, 1],
 57: [1, 1, 1],
 60: [1, 1],
 63: [1, 1],
 69: [1],
 70: [1, 1],
 71: [1],
 72: [1],
 73: [1, 1],
 77: [1, 1],
 80: [1, 1, 1, 1],
 81: [1],
 86: [1, 1, 1],
 90: [1],
 91: [1, 1, 1, 1],
 92: [1, 1, 1],
 95: [1],
 101: [1, 1],
 106: [1, 1, 1],
 114: [1],
 115: [1],
 118: [1, 1, 1],
 119: [1, 1],
 123: [1],
 132: [1],
 135: [1],
 138: [1],
 139: [1, 1],
 142: [1, 1, 1, 1],
 145: [1],
 149: [1, 1, 1],
 150: [1, 1, 1, 1, 1, 1],
 151: [1],
 152: [1, 1, 1],
 153: [1, 1],
 156: [1],
 159: [1, 1],
 160: [1, 1, 1],
 172: [1, 1, 1],
 173: [1],
 176: [1, 1],
 181: [1, 1],
 183: [1, 1],
 184: [1],
 186: [1, 1],
 190: [1, 1, 1, 1],
 193: [1, 1],
 209: [1],
 210: [1, 1, 1],
 213: [1, 1],
 215: [1, 1],
 218: [1, 1],
 219: [1],
 222: [1],
 227: [1, 1],
 

What are all the outcomes observed for caseid = 10229 (use your calculated dictionary)

In [30]:
todict(preg,'caseid','outcome')[10229]

[1]

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.

But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy,
it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At
the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.<br>

## Exercises

In [31]:
preg = nsfg.ReadFemPreg(dct_file='../Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Thinkstats2/2002FemPreg.dat.gz')
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [32]:
# Solution goes here
print(preg['birthord'].value_counts( dropna = False))
print('TOTAL  ', len(preg['birthord']))

# value	label	 	Total
# .	INAPPLICABLE	 	4445
# 1	1ST BIRTH	 	4413
# 2	2ND BIRTH	 	2874
# 3	3RD BIRTH	 	1234
# 4	4TH BIRTH	 	421
# 5	5TH BIRTH	 	126
# 6	6TH BIRTH	 	50
# 7	7TH BIRTH	 	20
# 8	8TH BIRTH	 	7
# 9	9TH BIRTH	 	2
# 10	10TH BIRTH	 	1
#  	Total	 	13593

NaN      4445
 1.0     4413
 2.0     2874
 3.0     1234
 4.0      421
 5.0      126
 6.0       50
 7.0       20
 8.0        7
 9.0        2
 10.0       1
Name: birthord, dtype: int64
TOTAL   13593


We can also use `isnull` to count the number of nans.

In [33]:
preg['birthord'].isnull().head(15)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14     True
Name: birthord, dtype: bool

In [34]:
sum(preg['birthord'].isnull()) # True = 1, False = 0

4445

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [35]:
# Solution goes here

bins1 = pd.IntervalIndex.from_tuples([(0, 13), (14, 26), (27, 50)])
print(pd.cut(preg['prglngth'], bins1).value_counts(dropna = False).sort_index())
print('TOTAL  ', len(preg['prglngth']))

# value	label	 	Total
# 0-13	13 WEEKS OR LESS	 	3522
# 14-26	14-26 WEEKS	 	793
# 27-50	27 WEEKS OR LONGER	 	9278
#  	Total	 	13593

NaN           52
(0, 13]     3507
(14, 26]     764
(27, 50]    9270
Name: prglngth, dtype: int64
TOTAL   13593


To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [36]:
preg.birthwgt_lb.mean()

7.431321084864392

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [37]:
# Solution goes here
preg['totalwgt_kg'] = preg['birthwgt_lb']*0.453592
print(preg.totalwgt_kg.mean())
print(preg.birthwgt_lb.mean()*0.453592)

3.3707877935258095
3.370787793525809


`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [38]:
resp = nsfg.ReadFemResp(dct_file='../Think_Stats/Thinkstats2/2002FemResp.dct',
                      dat_file='Thinkstats2/2002FemResp.dat.gz')

`DataFrame` provides a method `head` that displays the first five rows:

In [39]:
resp.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667


Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [40]:
# Solution goes here
counts = resp.age_r.value_counts(dropna = False).sort_index()
print(counts)
print()
print('Youngest:', counts.index[0])
print('Oldest:', counts.index[-1])

15    217
16    223
17    234
18    235
19    241
20    258
21    267
22    287
23    282
24    269
25    267
26    260
27    255
28    252
29    262
30    292
31    278
32    273
33    257
34    255
35    262
36    266
37    271
38    256
39    215
40    256
41    250
42    215
43    253
44    235
Name: age_r, dtype: int64

Youngest: 15
Oldest: 44


We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [41]:
resp[resp['caseid'] == 2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


And we can get the corresponding rows from `preg` like this:

In [42]:
preg[preg['caseid'] == 2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_kg
2610,2298,1,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,2.721552
2611,2298,2,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,2.26796
2612,2298,3,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,1.814368
2613,2298,4,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,2.721552


How old is the respondent with `caseid` 1?

In [53]:
# Solution goes here
print('Age:',int(resp.loc[resp['caseid'] == 1, 'age_r']))

Age: 44


What are the pregnancy lengths for the respondent with `caseid` 2298?

In [58]:
# Solution goes here
print('Pregnancy lengths (weeks):', list(preg.loc[preg['caseid'] == 2298, 'prglngth']))

Pregnancy lengths (weeks): [40, 36, 30, 40]


What was the birthweight of the first baby born to the respondent with `caseid` 2732?

In [63]:
# Solution goes here
print('Birthweight:', float(preg.loc[(preg['caseid'] == 2732 )& (preg['pregordr'] == 1), 'finalwgt']))

Birthweight: 11509.440549989084


In the repository you downloaded, you should find a file named chap01ex.py; using this file as a starting place, write a function that reads the respondent file, 2002FemResp.dat.gz.

The variable pregnum is a recode that indicates how many times each respondent has been pregnant. Print the value counts for this variable and compare them to the published results in the NSFG codebook. You can also crossvalidate the respondent and pregnancy files by comparing pregnum for each respondent with the number of records in the pregnancy file.

You can use nsfg.MakePregMap to make a dictionary that maps from each caseid to a list of indices into the pregnancy DataFrame.

## Extra exercise

The best way to learn about statistics is to work on a project you are interested in. Is there a question like, “Do first babies arrive late,” that you want to investigate?

Think about questions you find personally interesting, or items of conventional wisdom, or controversial topics, or questions that have political consequences, and see if you can formulate a question that lends itself to statistical inquiry.

Look for data to help you address the question. Governments are good sources because data from public research is often freely available. Good places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.

Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, and the European Social Survey at http://www.europeansocialsurvey.org/.

You can also use typical Data science datasets from places like https://www.kdnuggets.com/datasets/index.html or https://www.kaggle.com/datasets

If it seems like someone has already answered your question, look closely to see whether the answer is justified. There might be flaws in the data or the analysis that make the conclusion unreliable. In that case you could perform a different analysis of the same data, or look for a better source of data.

If you find a published paper that addresses your question, you should be able to get the raw data. Many authors make their data available on the web, but for sensitive data you might have to write to the authors, provide information about how you plan to use the data, or agree to certain terms of use. Be persistent!

###### Javier Fernández Suárez