The question:

Do first time babies tend to arrive late?

Many anecdotal evidence because they are based on data that is unpublished and usually personal. Which fails because:

- **Small number of observations**: If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
- **Selection bias**: People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
- **Conrmation bias**: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
- **Inaccuracy**: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

# Statistical Approach

To address the limitations of anecdotes, we will use the tools of statistics, which include:
Data collection: We will use data from a large national survey that
was designed explicitly with the goal of generating statistically valid
inferences about the U.S. population.
- **Descriptive statistics**: We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
- **Exploratory data analysis**: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
- **Estimation**: We will use data from a sample to estimate characteristics of the general population.
- **Hypothesis testing**: Where we see apparent effects, like a difference between two groups, we will evaluate whether the eect might have happened by chance.

# The Data Source

We will be usin the National Survey of Family Growth.

See http://cdc.gov/nchs/nsfg.htm and explore the different data sets and informarion.

The NSFG is a **cross-sectional** study, which means that it captures a snapshot of a group at a point in time. The most common alternative is a **longitudinal study**, which observes a group repeatedly over a period of time.

The goal of the survey is to draw conclusions about a **population**; the target population of the NSFG is people in the United States aged 15-44. Ideally surveys would collect data from every member of the population, but that's seldom possible. Instead we collect data from a subset of the population called a **sample**. The people who participate in a survey are called **respondents**.

In general, cross-sectional studies are meant to be **representative**, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.

The NSFG is not representative; instead it is deliberately **oversampled**. The designers of the study recruited three groups|Hispanics, African-Americans and teenagers|at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.

Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey. We will come back to this point later.

The codebook and user's guide for the NSFG data are available from http://www.cdc.gov/nchs/nsfg/nsfgcycle6.htm

## Importing the data

First of all go to https://github.com/AllenDowney/ThinkStats2 and clone the book repo on your computer.
Once you are done go to the folder ThinkStats2/code on your terminal and run nsfg.py:
> cd ThinkStats2/code

> python nsfg.py

You should get a message like 'All tests passed'

Now explore the data on the folder. How does 2002FemPreg.dct look?

This is a Stata dictionary file.

thinkstats2.py has a module to open Stata dictionaries.

A **module** is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

Explore the module nsfg. find the function ReadFemPreg() and then import it.

You might have to copy the module to the correct directory.

In [81]:
import pandas as pd
import nsfg
import numpy as np

In [82]:
## Code here

data = nsfg.ReadFemPreg()

What are the columns in the Dataframe?

In [83]:
## Code here

a= data.columns

How many columns does it have? Use 2 methods to calculate it.

In [84]:
print(len(a))

data.shape[1]

244


244

Remember that columns is not a **method** is an **attribute**

What is the first column?

In [85]:

## Code here
a[0]

'caseid'

Access pregordr column. Use 2 different methods.

In [86]:
## Code here
met1 = data['pregordr']

met2 = data.pregordr

What type is that column? and what type in the column object?

In [87]:
## Code here

np.dtype(met1)
    
    

dtype('int64')

Get the rows 2 to 4 of the column

In [88]:
## Code here
met1[1:4]

1    2
2    1
3    2
Name: pregordr, dtype: int64

## Variables

Out of the 244 we are only going to use:

- *prglngth* is the integer duration of the pregnancy in weeks.
- *outcome* is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.
- *pregordr* is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.
- *birthord* is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.
- *birthwgt_lb* and *birthwgt_oz* contain the pounds and ounces parts of the birth weight of the baby.
- *agepreg* is the mother’s age at the end of the pregnancy.
- *finalwgt* is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.

If you read the codebook carefully, you will see that many of the variables are **recodes**, which means that they are not part of the raw data collected by the survey; they are calculated using the **raw data**.

For example, prglngth for live births is equal to the raw variable wksgest (weeks of gestation) if it is available; otherwise it is estimated using mosgest * 4.33 (months of gestation times the average number of weeks in a month).

## Transformation

When you import data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called **data cleaning**.

In [89]:
import nsfg

First of all, ReadFemPreg() has a function CleanFemResp() within it that cleans it. Open the module again in the text editor and edit ReadFemPreg() so that CleanFemResp() is has an input to decide whether to clean it or not. <br> such as:
>ReadFemPreg(clean=True)

After loading the unclean file again. (You might have to restart the kernel to get the new function imported) Code the following Data cleaning transformation processes:

In [90]:
#Load the file here
data = nsfg.ReadFemPreg(clean = False)


agepreg contains the mother’s age at the end of the pregnancy. In the data file, agepreg is encoded as an integer number of centiyears. So first divide each element of agepreg by 100, yielding a floating-point value in
years.

In [91]:
# Code it here
data['agepreg'] = data['agepreg']/100

birthwgt_lb and birthwgt_oz contain the weight of the baby, in pounds and ounces, for pregnancies that end in live birth. In addition it uses several special codes:<br/>
97 NOT ASCERTAINED<br/>
98 REFUSED<br/>
99 DON'T KNOW<br/>

1. Replace those values with nan. 

Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby. The replace method replaces these values with np.nan, a special floating- point value that represents “not a number.” The inplace flag tells replace to modify the existing Series rather than create a new one.<br/>

In [92]:
# Code it here

data['birthwgt_lb'] = data['birthwgt_lb'].replace([97,98,99],np.nan)
data['birthwgt_oz'] = data['birthwgt_oz'].replace([97,98,99],np.nan)



Be careful nan is not a string is a numpy object. np.nan

As part of the IEEE floating-point standard, all mathematical operations
return nan if either argument is nan:<br/>
```
>import numpy as np
>np.nan / 100.0
nan
```
So computations with nan tend to do the right thing, and most pandas functions handle nan appropriately. But dealing with missing data will be a recurring issue.

Create a new column totalwgt_lb that com- bines pounds and ounces into a single quantity, in pounds.<br>
One important note: when you add a new column to a DataFrame, you must use dictionary syntax

In [93]:
# Code it here
data['total_weight'] = data['birthwgt_lb'] + ((1/16)*data['birthwgt_oz'])

data['total_weight'].head()


0    8.8125
1    7.8750
2    9.1250
3    7.0000
4    6.1875
Name: total_weight, dtype: float64

Compare them with the results the fuction when it cleans the data

In [94]:
# Code it here
data1 = nsfg.ReadFemPreg()

data1.loc[data1['totalwgt_lb']!=data['total_weight'],'totalwgt_lb']
data.loc[data1['totalwgt_lb']!=data['total_weight'],'total_weight'].head()

#They're different in the NaN's, because, by definition NaN is never equal to NaN

13   NaN
14   NaN
18   NaN
22   NaN
30   NaN
Name: total_weight, dtype: float64

## Validation

When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other
misunderstandings. If you take time to validate the data, you can save time later and avoid errors. <br>
One way to validate data is to compute basic statistics and compare them with published results. For example, the NSFG codebook includes tables that summarize each variable. Here is the table for outcome, which encodes the outcome of each pregnancy:<br>

![alt text](notebookpics/number_rows_table.png "Title")

The Series class provides a method, value_counts, that counts the number of times each value appears. If we select the outcome Series from the DataFrame.<br>
Use value_counts to compare with the published data:

In [95]:
# Code here
data['outcome'].value_counts()

1    9148
4    1921
2    1862
6     352
5     190
3     120
Name: outcome, dtype: int64

Similarly, here is the published table for birthwgt_lb. Is there anything weird? If so, fix it.<br>
![alt text](notebookpics/number_rows_table2.png "Title")

In [96]:
# Code here
data['birthwgt_lb'].value_counts(sort = True)


7.0     3049
6.0     2223
8.0     1889
5.0      697
9.0      623
4.0      229
10.0     132
3.0       98
2.0       53
1.0       40
11.0      26
12.0      10
0.0        8
13.0       3
14.0       3
51.0       1
15.0       1
Name: birthwgt_lb, dtype: int64

In [97]:
data['birthwgt_lb'] = data['birthwgt_lb'].replace(51.0,np.nan)
data['birthwgt_lb'].value_counts(sort = True)

7.0     3049
6.0     2223
8.0     1889
5.0      697
9.0      623
4.0      229
10.0     132
3.0       98
2.0       53
1.0       40
11.0      26
12.0      10
0.0        8
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

## Interpretation

To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.<br>
As an example, let’s look at the sequence of outcomes for a respondents.
Because of the way the data files are organized, we have to do some processing to collect the pregnancy data for each respondent.

Create a dictionary that maps each caseid to all of index for the pregnancies she has been involved on:

An output as: {1:[1,1,1,4],2:[1,1,1].....}

dont use pandas dataframe functions.

In [60]:
data[['caseid','outcome']].head(10)

Unnamed: 0,caseid,outcome
0,1,1
1,1,1
2,2,1
3,2,1
4,2,1
5,6,1
6,6,1
7,6,1
8,7,1
9,7,1


Now make it into a function

In [141]:
#INput:

data[['caseid','outcome']]

#Algorithm:
a = set(data['caseid'].tolist())
outdict = {}
for i in a:
    outdict[i]= data[data['caseid'] == i]['outcome'].tolist()
    
#outdict

#Outcome

#outdict 


In [146]:
def dictfun(input_):
    a = set(input_['caseid'].tolist())
    outdict = {}
    for i in a:
        outdict[i]= input_[input_['caseid'] == i]['outcome'].tolist()
    return outdict

result = dictfun(data)


    

In [147]:
def preg_(int_in):
    return data[data['caseid']== int_in]       

What are all the outcomes observed for caseid = 10229 (use your calculated dictionary)

In [150]:
outdict[10229]

[4, 4, 4, 4, 4, 4, 1]

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.

But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy,
it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At
the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.<br>

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [151]:
# Solution goes here

data['birthord'].value_counts()

1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
Name: birthord, dtype: int64

We can also use `isnull` to count the number of nans.

In [154]:
data['birthord'].isnull().count()



13593

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [194]:
# Solution goes here

data['prglngth'].value_counts(sort = True).head()

39    4744
40    1120
38     609
9      594
41     591
Name: prglngth, dtype: int64

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [157]:
data['total_weight'].mean()

7.270508352693882

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [175]:
# Solution goes here
data['totalwgt_kg'] = data['total_weight'] * 0.453592

In [176]:
data['totalwgt_kg'].mean()

3.297844424715123

`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [177]:
resp = nsfg.ReadFemResp()


`DataFrame` provides a method `head` that displays the first five rows:

In [179]:
resp.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667


Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [182]:
# Solution goes here

      
# Oldest/youngers?     
print(resp['age_r'].max())
print(resp['age_r'].min())

#Value counts?
print(resp['age_r'].value_counts().head())


44
15
30    292
22    287
23    282
31    278
32    273
Name: age_r, dtype: int64


We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [183]:
resp[resp['caseid']==2298]



Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


And we can get the corresponding rows from `preg` like this:

In [185]:
data[data['caseid'] == 2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,total_weight,totalwgt_kg
2610,2298,1,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,6.875,3.118445
2611,2298,2,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,5.5,2.494756
2612,2298,3,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,4.1875,1.899417
2613,2298,4,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,1234,6.875,3.118445


How old is the respondent with `caseid` 1?

In [187]:
# Solution goes here
resp[resp['caseid'] == 1]['age_a']
# Using loc
resp.loc[resp['caseid'] == 1,'age_a']

1069    44
Name: age_a, dtype: int64

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [189]:
# Solution goes here
data[data['caseid'] == 2298]['prglngth']
#using loc
data.loc[data['caseid'] == 2298,'prglngth']


2610    40
2611    36
2612    30
2613    40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with `caseid` 2732?

In [223]:
# Solution goes here
resp[resp['caseid'] == 2732]['finalwgt']

data[data['caseid'] == 2732]['pregordr']

data.loc[(data['caseid']==2732 )& (data['pregordr']==1), 'finalwgt']

3085    11509.44055
Name: finalwgt, dtype: float64

In the repository you downloaded, you should find a file named chap01ex.py; using this file as a starting place, write a function that reads the respondent file, 2002FemResp.dat.gz.

The variable pregnum is a recode that indicates how many times each respondent has been pregnant. Print the value counts for this variable and compare them to the published results in the NSFG codebook. You can also crossvalidate the respondent and pregnancy files by comparing pregnum for each respondent with the number of records in the pregnancy file.

You can use nsfg.MakePregMap to make a dictionary that maps from each caseid to a list of indices into the pregnancy DataFrame.

## Extra exercise

The best way to learn about statistics is to work on a project you are interested in. Is there a question like, “Do first babies arrive late,” that you want to investigate?

Think about questions you find personally interesting, or items of conventional wisdom, or controversial topics, or questions that have political consequences, and see if you can formulate a question that lends itself to statistical inquiry.

Look for data to help you address the question. Governments are good sources because data from public research is often freely available. Good places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.

Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, and the European Social Survey at http://www.europeansocialsurvey.org/.

You can also use typical Data science datasets from places like https://www.kdnuggets.com/datasets/index.html or https://www.kaggle.com/datasets

If it seems like someone has already answered your question, look closely to see whether the answer is justified. There might be flaws in the data or the analysis that make the conclusion unreliable. In that case you could perform a different analysis of the same data, or look for a better source of data.

If you find a published paper that addresses your question, you should be able to get the raw data. Many authors make their data available on the web, but for sensitive data you might have to write to the authors, provide information about how you plan to use the data, or agree to certain terms of use. Be persistent!