The question:

**Do first time babies tend to arrive late?**

Many anecdotal evidence because they are based on data that is unpublished and usually personal. Which fails because:

- **Small number of observations**: If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
- **Selection bias**: People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
- **Conrmation bias**: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
- **Inaccuracy**: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

# Statistical Approach

To address the limitations of anecdotes, we will use the tools of statistics, which include:
Data collection: We will use data from a large national survey that
was designed explicitly with the goal of generating statistically valid
inferences about the U.S. population.
- **Descriptive statistics**: We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
- **Exploratory data analysis**: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
- **Estimation**: We will use data from a sample to estimate characteristics of the general population.
- **Hypothesis testing**: Where we see apparent effects, like a difference between two groups, we will evaluate whether the eect might have happened by chance.

# The Data Source

We will be usin the National Survey of Family Growth.

See http://cdc.gov/nchs/nsfg.htm and explore the different data sets and informarion.

The NSFG is a **cross-sectional** study, which means that it captures a snapshot of a group at a point in time. The most common alternative is a **longitudinal study**, which observes a group repeatedly over a period of time.

The goal of the survey is to draw conclusions about a **population**; the target population of the NSFG is people in the United States aged 15-44. Ideally surveys would collect data from every member of the population, but that's seldom possible. Instead we collect data from a subset of the population called a **sample**. The people who participate in a survey are called **respondents**.

In general, cross-sectional studies are meant to be **representative**, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.

The NSFG is not representative; instead it is deliberately **oversampled**. The designers of the study recruited three groups|Hispanics, African-Americans and teenagers|at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.

Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey. We will come back to this point later.

The codebook and user's guide for the NSFG data are available from http://www.cdc.gov/nchs/nsfg/nsfgcycle6.htm

## Importing the data

First of all go to https://github.com/AllenDowney/ThinkStats2 and clone the book repo on your computer.
Once you are done go to the folder ThinkStats2/code on your terminal and run nsfg.py:
> cd ThinkStats2/code

> python nsfg.py

You should get a message like 'All tests passed'

Now explore the data on the folder. How does 2002FemPreg.dct look?

This is a Stata dictionary file.

thinkstats2.py has a module to open Stata dictionaries.

A **module** is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

Explore the module. find the function ReadFemPreg() and then import it.

You might have to copy the module to the correct directory.

In [None]:
## Code here

What are the columns in the Dataframe?

In [None]:
## Code here

What is the first column?

In [None]:
## Code here

Access pregordr column. Use 2 different methods.

In [None]:
## Code here

What type is that column? and what type in the column object?

In [None]:
## Code here

Get the rows 2 to 4 of the column

In [None]:
## Code here

## Variables

Out of the 244 we are only going to use:

- *prglngth* is the integer duration of the pregnancy in weeks.
- *outcome* is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.
- *pregordr* is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.
- *birthord* is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.
- *birthwgt_lb* and *birthwgt_oz* contain the pounds and ounces parts of the birth weight of the baby.
- *agepreg* is the mother’s age at the end of the pregnancy.
- *finalwgt* is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.

If you read the codebook carefully, you will see that many of the variables are **recodes**, which means that they are not part of the raw data collected by the survey; they are calculated using the **raw data**.

For example, prglngth for live births is equal to the raw variable wksgest (weeks of gestation) if it is available; otherwise it is estimated using mosgest * 4.33 (months of gestation times the average number of weeks in a month).

## Transformation

When you import data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called **data cleaning**.

In [1]:
import nsfg

Code the following Data cleaning transformation processes:

agepreg contains the mother’s age at the end of the pregnancy. In the data file, agepreg is encoded as an integer number of centiyears. So the first line divides each element of agepreg by 100, yielding a floating-point value in
years.

In [15]:
# Code it here

birthwgt_lb and birthwgt_oz contain the weight of the baby, in pounds and ounces, for pregnancies that end in live birth. In addition it uses several special codes:<br/>
97 NOT ASCERTAINED<br/>
98 REFUSED<br/>
99 DON'T KNOW<br/>
Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby. The replace method replaces these values with np.nan, a special floating- point value that represents “not a number.” The inplace flag tells replace to modify the existing Series rather than create a new one.<br/>
As part of the IEEE floating-point standard, all mathematical operations
return nan if either argument is nan:<br/>
```
>import numpy as np
>np.nan / 100.0
nan
```
So computations with nan tend to do the right thing, and most pandas functions handle nan appropriately. But dealing with missing data will be a recurring issue.

replace those values with nan:

In [16]:
# Code it here

Create a new column totalwgt_lb that com- bines pounds and ounces into a single quantity, in pounds.<br>
One important note: when you add a new column to a DataFrame, you must use dictionary syntax

In [None]:
# Code it here

Compare them with the results the fuction

In [18]:
nsfg.CleanFemPreg(df)

In [None]:
# Code it here

## Validation

When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other
misunderstandings. If you take time to validate the data, you can save time later and avoid errors. <br>
One way to validate data is to compute basic statistics and compare them with published results. For example, the NSFG codebook includes tables that summarize each variable. Here is the table for outcome, which encodes the outcome of each pregnancy:<br>

![alt text](notebookpics/number_rows_table.png "Title")

The Series class provides a method, value_counts, that counts the number of times each value appears. If we select the outcome Series from the DataFrame.<br>
Use value_counts to compare with the published data:

In [19]:
# Code here

Similarly, here is the published table for birthwgt_lb. Is there anything weird? If so, fix it.<br>
![alt text](notebookpics/number_rows_table2.png "Title")

In [None]:
# Code here

Compare them with the results the fuction

In [None]:
nsfg.CleanFemPreg(df)

# Code it here

## Interpretation

To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.<br>
As an example, let’s look at the sequence of outcomes for a few respondents.
Because of the way the data files are organized, we have to do some processing to collect the pregnancy data for each respondent.

Create a dictionary that maps each caseid to all of index for the pregnancies she has been involved on:

Now make it into a function

What all the outcomes observed for caseid = 10229

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.

But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy,
it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At
the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.<br>

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)