## Birth Dates In The United States

The raw data behind the story **Some people Are Too Superstitious To Have A Baby On Friday The 13th**, which you can read [here](http://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/).

We'll be working with the dataset from the Centers for Disease Control and Prevention's National Center for Health Statistics. The dataset has the following structure:

- `year` - Year
- `month` - Month
- `date_of_month` - Day number of the month
- `day_of_week` - Day of week, where 1 is Monday and 7 is Sunday
- `births` - Number of births that day

We will try to explore the dataset.

### Preparing The Data

In [1]:
#file handle is returned for opening, reading, writing and closing the file
fh = open("US_births_1994-2003_CDC_NCHS.csv","r")

#reading the whole file into a string. If the file is huge, more than memory, 
#its not recommended way, reading line by line would be the better way to read 
#the whole file
text = fh.read()

#splitting the data with a delimiter new line
words = text.split("\n")

#displaying first 10 rows
words[:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

### Function To Convert The Data Into List Of Lists 

In [2]:
def read_csv(filename):
    """
    filename -> [list of lists of each row]
    
    This function takes in a filename and returns each row in the
    form of lists.
    """
    string_data = open(filename,"r").read().split("\n")
    string_list = string_data[1:]
    final_list = []
    for row in string_list:
        int_fields = []
        string_fields = row.split(",")
        for val in string_fields:
            int_fields.append(int(val))
        final_list.append(int_fields)
    return final_list

cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
cdc_list[:10]
        

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

### Calculating Number Of Births Each Month

In [3]:
def month_births(data):
    """
    list of lists -> dictionary of counts for each month
    
    This function takes in list of lists and returns dictionary 
    of counts for each month.
    """
    births_per_month={}
    for row in data:
        months = row[1]
        births = row[4]
        births_per_month[months] = births_per_month.get(months,0) + births
    return births_per_month

cdc_month_births = month_births(cdc_list)

cdc_month_births  

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

### Calculating Number Of Births Each Day Of Week

In [4]:
def dow_births(data):
    """
    list of lists -> dictionary of counts for each month
    
    This function takes in list of lists and returns dictionary 
    of counts for each month.
    """
    births_per_dow = {}
    for row in data:
        week = row[3]
        births = row[4]
        births_per_dow[week] = births_per_dow.get(week,0) + births
    return births_per_dow

cdc_day_births =  dow_births(cdc_list)

cdc_day_births

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

### Creating A More General Function

Rather than creating a function for finding out the number of births on Year,month,day of the month&day 
of the week,which is very cumbersome. We will try to create a generalized function
which acts as a blackbox.

In [5]:
def calc_counts(data, column):
    """
    (data, column of the data) -> dictionary of counts for each 
                                  unique level in a column
    
    This function takes in list of lists (data) and a column, returns a 
    dictionary of counts for each month.
    """
    sum_dict= {}
    for row in data:
        col_value = row[column]
        births = row[4]
        sum_dict[col_value] = sum_dict.get(col_value,0) + births
    return sum_dict

cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)  
    

In [6]:
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [7]:
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

In [8]:
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [9]:
cdc_dow_births


{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}