# Introduction

This project was a part of the DataQuest Data Scientist specialization course. In this project, we define functions to do some simple analysis on U.S. births. The dataset was compiled by fivethirtyeight, and is available on their github repo. 

## The Data Set

In [9]:
File = open("US_births_1994-2003_CDC_NCHS.csv", "r")
data = File.read()
rows = data.split("\n")
rows[0:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

## Converting the data to a list of lists

The data is currently a list of string, which is a good format for analysis. We will define a function to read in the data and then convert it into a list of lists of integers, instead of strings, for easier analysis. 
Note: we will also omit the header row

In [17]:
def read_csv(csv):
    File = open(csv, "r")
    data = File.read()
    rows = data.split("\n")
    string_list = rows[1:len(rows)]
    final_list = []
    for each in string_list:
        int_fields = []
        string_fields = each.split(',')
        for s in string_fields:
            s = int(s)
            int_fields.append(s)
        final_list.append(int_fields)
    return final_list

In [18]:
cdc_list = read_csv('US_births_1994-2003_CDC_NCHS.csv')

In [19]:
cdc_list[0:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

## Calculating Number of Births each Month

We now define a function to calculate the number of births for each, across the entire dataset. The function stores the results in a dictionary.

In [21]:
def month_births(lst_of_lst):
    births_per_month = {}
    for lst in lst_of_lst:
        month = lst[1]
        births = lst[4]
        if month in births_per_month:
            births_per_month[month] += births
        else:
            births_per_month[month] = births
    return births_per_month

In [22]:
cdc_month_births = month_births(cdc_list)

In [23]:
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

## Calculating Number of Births for each Day of Week

Similarly, we now define a function to calculate the number of births for each day of week.

In [24]:
def dow_births(lst_of_lst):
    dow = {}
    for lst in lst_of_lst:
        day = lst[3]
        births = lst[4]
        if day in dow:
            dow[day] += births
        else:
            dow[day] = births
    return dow

In [25]:
cdc_day_births = dow_births(cdc_list)

In [26]:
cdc_day_births

{6: 4562111,
 7: 4079723,
 1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657}

## Creating a more Versatile function

Both functions, above, are very similar. Therefore, we will now define a more versatile function to calculate the number of births for each day of week, day of month, month and year, across the entire dataset.

In [27]:
def calc_counts(data, column):
    total = {}
    for d in data:
        col = d[column]
        births = d[4]
        if col in total:
            total[col] += births
        else:
            total[col] = births
    return total

In [29]:
cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)

In [30]:
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [31]:
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

In [32]:
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [33]:
cdc_dow_births

{6: 4562111,
 7: 4079723,
 1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657}

## Function for Max and Min of a Dictionary

In [40]:
def max_dict(dic):
    max_key = max(dic.keys(), key=(lambda x: dic[x]))
    return max_key, dic[max_key]

In [41]:
max_dict(cdc_dow_births)

(2, 6446196)

In [42]:
def min_dict(dic):
    min_key = min(dic.keys(), key=(lambda x: dic[x]))
    return min_key, dic[min_key]

In [43]:
min_dict(cdc_dow_births)

(7, 4079723)

## Function to find trends

Now, we define a function that extracts the same values across years and calculates the differences between consecutive values to show if number of births is increasing or decreasing.

In [44]:
def trends(data, index, val):
    diffs_dict = {}
    temp_dict = dict()
    
    for row in data:
        if row[index] == val:
            if row[0] in temp_dict:
                temp_dict[row[0]] = temp_dict[row[0]] + row[4]
            else:
                temp_dict[row[0]] = row[4]
    
    for year, calc in temp_dict.items():
        if year - 1 not in diffs_dict:
            diffs_dict[year] = 0
        else:
            diffs_dict[year] = temp_dict[year] - temp_dict[year - 1]
    return diffs_dict

In [46]:
jan_trend = trends(cdc_list, 1, 1)
jan_trend

{1994: 0,
 1995: -4692,
 1996: -1730,
 1997: 2928,
 1998: 2129,
 1999: -158,
 2000: 10926,
 2001: 5090,
 2002: -4524,
 2003: -871}