# Exploring US Births

This juptyer notebook is based on a guided project from [DataQuest](https://www.dataquest.io), a data analytics tutorial.

The data comes from a [FiveThirtyEight](https://fivethirtyeight.com/) analysis, [Some People Are Too Superstitious to Have a Baby on Friday the 13th](https://github.com/fivethirtyeight/data/tree/master/births).

## Scope

In this project, I'll analyze CDC and SSA data to determine the frequency of births by

- year,
- month,
- day of the month,
- and day of the week.


# Making a list...

I'll start by reading both datasets into lists, and then creating a list of those lists:

In [5]:
import csv
import requests

CDC_CSV_URL_1 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv'
SSA_CSV_URL_2 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv'

seeEssVees = [CDC_CSV_URL_1, SSA_CSV_URL_2]
listOfLists = []

for each in seeEssVees:
    with requests.Session() as s:
        download = s.get(each)
        decoded_content = download.content.decode('utf-8')
        data = csv.reader(decoded_content.splitlines(), delimiter=',')
        listOfLists.append(list(data))
        

# Checking it twice...

The first element of each list should be a header row describing the datapoints that follow:

In [6]:
listOfLists[0][0:5]

[['year', 'month', 'date_of_month', 'day_of_week', 'births'],
 ['1994', '1', '1', '6', '8096'],
 ['1994', '1', '2', '7', '7772'],
 ['1994', '1', '3', '1', '10142'],
 ['1994', '1', '4', '2', '11248']]

In [7]:
listOfLists[1][0:5]

[['year', 'month', 'date_of_month', 'day_of_week', 'births'],
 ['2000', '1', '1', '6', '9083'],
 ['2000', '1', '2', '7', '8006'],
 ['2000', '1', '3', '1', '11363'],
 ['2000', '1', '4', '2', '13032']]

# Benchmarking...

Both datasets contain datapoints for the years 2000 - 2003. Presumably, the number of CDC and SSA records should match. 

The code below checks whether this is the case:

In [59]:
checkListCDC = []
checkListSSA = []

for each in listOfLists[0]:
    try:
        year = int(each[0])
        if (2000 <= year) and (year < 2004):
            checkListCDC.append(each)
    except:
        pass
    
for each in listOfLists[1]:
    try:
        year = int(each[0])
        if (2000 <= year) and (year < 2004):
            checkListSSA.append(each)
    except:
        pass


print("It's",(len(checkListSSA) == len(checkListCDC)),"that both lists have the same number of elements:")
print("The CDC data has",len(checkListCDC), "observations for the years 2000-2003.")
print("The SSA data has",len(checkListSSA), "observations for the years 2000-2003.")
    

It's True that both lists have the same number of elements:
The CDC data has 1461 observations for the years 2000-2003.
The SSA data has 1461 observations for the years 2000-2003.


# More Benchmarking...

Both datasets contain the same *number* of datapoints for the years 2000 - 2003. But are those data points the same?

The code below compares the first five observations:

In [60]:
checkListCDC[0:5]

[['2000', '1', '1', '6', '8843'],
 ['2000', '1', '2', '7', '7816'],
 ['2000', '1', '3', '1', '11123'],
 ['2000', '1', '4', '2', '12703'],
 ['2000', '1', '5', '3', '12240']]

In [62]:
checkListSSA[0:5]

[['2000', '1', '1', '6', '9083'],
 ['2000', '1', '2', '7', '8006'],
 ['2000', '1', '3', '1', '11363'],
 ['2000', '1', '4', '2', '13032'],
 ['2000', '1', '5', '3', '12558']]

# Counting Function

The births recorded by each dataset differ, so we should analyze them separately.

Rather write code to analyze each dataset separately, the code below creates a function that sums the number of births by year, month, day of the month, and day of the week.

In [80]:
def calc_counts(data, column):
    totalBirths = {}
    for each in data:
        try:
            period = int(each[column])
            if period in totalBirths:
                totalBirths[period]+=int(each[4])
            else:
                totalBirths[period]=int(each[4])
        except:
            pass
    return totalBirths

for each1 in listOfLists:
    for i in range(4):
        print(calc_counts(each1,i))
        

{1994: 3952767, 1995: 3899589, 1996: 3891494, 1997: 3880894, 1998: 3941553, 1999: 3959417, 2000: 4058814, 2001: 4025933, 2002: 4021726, 2003: 4089950}
{1: 3232517, 2: 3018140, 3: 3322069, 4: 3185314, 5: 3350907, 6: 3296530, 7: 3498783, 8: 3525858, 9: 3439698, 10: 3378814, 11: 3171647, 12: 3301860}
{1: 1276557, 2: 1288739, 3: 1304499, 4: 1288154, 5: 1299953, 6: 1304474, 7: 1310459, 8: 1312297, 9: 1303292, 10: 1320764, 11: 1314361, 12: 1318437, 13: 1277684, 14: 1320153, 15: 1319171, 16: 1315192, 17: 1324953, 18: 1326855, 19: 1318727, 20: 1324821, 21: 1322897, 22: 1317381, 23: 1293290, 24: 1288083, 25: 1272116, 26: 1284796, 27: 1294395, 28: 1307685, 29: 1223161, 30: 1202095, 31: 746696}
{6: 4562111, 7: 4079723, 1: 5789166, 2: 6446196, 3: 6322855, 4: 6288429, 5: 6233657}
{2000: 4149598, 2001: 4110963, 2002: 4099313, 2003: 4163060, 2004: 4186863, 2005: 4211941, 2006: 4335154, 2007: 4380784, 2008: 4310737, 2009: 4190991, 2010: 4055975, 2011: 4006908, 2012: 4000868, 2013: 3973337, 2014: 40105