# Birth Dates In The United States

The raw data behind the story **Some People Are Too Superstitious To Have A Baby On Friday The 13th** (click [here](http://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/)).  

The data set comes from the Centers for Disease Control and Prevention's National National Center for Health Statistics. 

The data set has the following structure:
- `year` - Year
- `month` - Month
- `date_of_month` - Day number of the month
- `day_of_week` - Day of week, where 1 is Monday and 7 is Sunday
- `births` - Number of births

## Loading the data

In [8]:
data = open('US_births_1994-2003_CDC_NCHS.csv','r').read().split('\n')
data[:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

In [7]:
def read_csv(file):
    data = open(file,'r').read().split('\n')
    string_list = data[1:]
    final_list = []
    for row in string_list:
        string_fields = row.split(',')
        int_fields = [int(field) for field in string_fields]
        final_list.append(int_fields)
    return final_list

cdc_list = read_csv('US_births_1994-2003_CDC_NCHS.csv')
cdc_list[:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

## Counting total number of births in each month

In [9]:
def month_births(data):
    births_per_month = {}
    for item in data:
        month = item[1]
        births = item[-1]
        if month in births_per_month:
            births_per_month[month] += births
        else:
            births_per_month[month] = births
    return births_per_month

cdc_month_births = month_births(cdc_list)
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

## Counting total number of births in each day of the week

In [12]:
def dow_births(data):
    births_per_day = {}
    for item in data:
        day = item[-2]
        births = item[-1]
        if day in births_per_day:
            births_per_day[day] += births
        else:
            births_per_day[day] = births
    return births_per_day

cdc_day_births = dow_births(cdc_list)
cdc_day_births

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

## Rationalizing the code

In [19]:
def calc_counts(data,column):
    births_per_column = {}
    for item in data:
        if item[column] in births_per_column:
            births_per_column[item[column]] += item[-1]
        else:
            births_per_column[item[column]] = item[-1]
    return births_per_column

In [21]:
cdc_year_births = calc_counts(cdc_list,0)
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

## Statistics

In [22]:
cdc_month_births = calc_counts(cdc_list,1)
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

In [24]:
cdc_dom_births = calc_counts(cdc_list,2)
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [25]:
cdc_dow_births = calc_counts(cdc_list,3)
cdc_dow_births

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

In [26]:
def calc_bpd_by_year(data, column_idx, column_val):
    bpd = {}
    for item in data:
        if item[column_idx] == column_val:
            year=item[0]
            births=item[-1]
            if year in bpd:
                bpd[year] += births
            else:
                bpd[year] = births
    return bpd

In [32]:
# Births on Saturday of each year:
cdc_sat_by_year = calc_bpd_by_year(cdc_list, -2, 6)
cdc_sat_by_year

{1994: 474732,
 1995: 459580,
 1996: 456261,
 1997: 450840,
 1998: 453776,
 1999: 449985,
 2000: 469794,
 2001: 453928,
 2002: 445770,
 2003: 447445}

In [36]:
births_last_year = '';
for year, births in cdc_sat_by_year.items():
    if(births_last_year == ''):
        births_last_year = births
    else:
        print(year-1, '-', year, ': ')
        print(births - births_last_year, '\n')

2000 - 2001 : 
-15866 

2001 - 2002 : 
-24024 

2002 - 2003 : 
-22349 

1993 - 1994 : 
4938 

1994 - 1995 : 
-10214 

1995 - 1996 : 
-13533 

1996 - 1997 : 
-18954 

1997 - 1998 : 
-16018 

1998 - 1999 : 
-19809 



In [31]:
# Births on Monday of each year:
cdc_mon_by_year = calc_bpd_by_year(cdc_list, -2, 1)
cdc_mon_by_year

{1994: 568672,
 1995: 557396,
 1996: 569343,
 1997: 564782,
 1998: 571822,
 1999: 572958,
 2000: 585312,
 2001: 593186,
 2002: 595554,
 2003: 610141}