# V2: Clean and Validate:

## Selecting columns:

    pounds= nsfg['birthwgt_lb1']
    ounces= nsfg['birthwgt_oz1']
    
## Validation:

One part of validation is confirming that we are interpreting the data correctly. 

### 1. value_counts():
We can use value_counts() to see what values appear in pounds and how many times each value appears. 
 
 * By default, the results are sorted with the most frequent value first, so I use sort_index(). To sort them by value instead, with the lightest babies first and heaviest last.    
        
            pounds.value_counts().sort_index()
            
 * 98 & 99 indicate missing data.
 
 
 * We can validate the results by comparing them to the codebook, which lists the values and their frequencies. 
       
### 2. describe():
Another way to validate data is describe

* It compute the summary statistics like the mean, standard deviation, min, and max.

            pounds.describe()
           
* When we look at mean of the above data. Then we find it '8.09' which is not representing it well due to values of 98 and 99. We need to replace them

#### Replace():
        
            pounds = pounds.replace([98, 99], np.nan )
            
            pounds.mean() 
            
   * Now, mean will become 6.70. This is due to removal of 98 and 99 values. 
   
   * Instead of making a new Series, we can call replace(), with inplace=True. It modifies the existing Series 'in place', that is, without making a copy.
   
           ounces.replace([98, 99], np.nan, inplace = True )
   
### Arithmetic with Series:

* Now, we want to combine pounds and ounces into a single Series that contains total birth weight.
* Arithmetic operators work with Series objects; so, to convert from ounces to pounds, we can divide by 16. 
 
     * 1 pound = 16 Ounces
     
            birth_weight = pounds + ounces/16.0
            
            birth_weight.describe()
            

## Example 1: Validate a variable

In the NSFG dataset, the variable 'outcome' encodes the outcome of each pregnancy as shown below:

    value	label
        1	Live birth
        2	Induced abortion
        3	Stillbirth
        4	Miscarriage
        5	Ectopic pregnancy
        6	Current pregnancy
    
The nsfg DataFrame has been pre-loaded for you. Explore it in the IPython Shell and use the methods Allen showed you in the video to answer the following question: How many pregnancies in this dataset ended with a live birth?



In [1]:
import pandas as pd

nsfg = pd.read_hdf('nsfg.hdf5')

#nsfg.columns

#How many pregnancies ended with a live birth

nsfg['outcome'].value_counts()

1    6489
4    1469
2     947
6     249
5     118
3      86
Name: outcome, dtype: int64

## Example 2: Clean a variable
In the NSFG dataset, the variable 'nbrnaliv' records the number of babies born alive at the end of a pregnancy.

If you use .value_counts() to view the responses, you'll see that the value 8 appears once, and if you consult the codebook, you'll see that this value indicates that the respondent refused to answer the question.

Your job in this exercise is to replace this value with np.nan. Recall from the video how Allen replaced the values 98 and 99 in the ounces column using the .replace() method:

    ounces.replace([98, 99], np.nan, inplace=True)
    
### Steps: 
1. In the 'nbrnaliv' column, replace the value 8, in place, with the special value NaN.
2. Confirm that the value 8 no longer appears in this column by printing the values and their frequencies.



In [2]:
import numpy as np

print(nsfg.columns)

nsfg['nbrnaliv'].replace(8, np.nan, inplace = True)

print(nsfg['nbrnaliv'].value_counts())

Index(['caseid', 'outcome', 'birthwgt_lb1', 'birthwgt_oz1', 'prglngth',
       'nbrnaliv', 'agecon', 'agepreg', 'hpagelb', 'wgt2013_2015'],
      dtype='object')
1.0    6379
2.0     100
3.0       5
Name: nbrnaliv, dtype: int64


## Example 3: Compute a variable
For each pregnancy in the NSFG dataset, the variable 'agecon' encodes the respondent's age at conception, and 'agepreg' the respondent's age at the end of the pregnancy.

Both variables are recorded as integers with two implicit decimal places, so the value 2575 means that the respondent's age was 25.75.

#### Part1: Select 'agecon' and 'agepreg', divide them by 100, and assign them to the local variables agecon and agepreg.

#### Part 2: Compute the difference, which is an estimate of the duration of the pregnancy. Keep in mind that for each pregnancy, agepreg will be larger than agecon.

#### Part3: Use .describe() to compute the mean duration and other summary statistics.

In [17]:
# Select the columns and divide by 100
agecon = nsfg['agecon'] / 100
agepreg = nsfg['agepreg'] / 100

print('Age when concieve: \n',agecon)
print('\n\n\nAge while pregnancy:\n',agepreg)

# Compute the difference
preg_length = agepreg - agecon

print('\n\nPregnency Length:\n', preg_length)

# Compute summary statistics
print('\n\n\n', preg_length.describe())

Age when concieve: 
 0       20.00
1       22.91
2       32.41
3       36.50
4       21.91
        ...  
9353    17.58
9354    17.41
9355    20.91
9356    34.50
9357    36.83
Name: agecon, Length: 9358, dtype: float64



Age while pregnancy:
 0       20.75
1       23.58
2       33.08
3         NaN
4       22.66
        ...  
9353    18.25
9354    18.16
9355    21.58
9356    35.25
9357    37.58
Name: agepreg, Length: 9358, dtype: float64


Pregnency Length:
 0       0.75
1       0.67
2       0.67
3        NaN
4       0.75
        ... 
9353    0.67
9354    0.75
9355    0.67
9356    0.75
9357    0.75
Length: 9358, dtype: float64



 count    9109.000000
mean        0.552069
std         0.271479
min         0.000000
25%         0.250000
50%         0.670000
75%         0.750000
max         0.920000
dtype: float64
