## Introduction to the Dataset

This dataset features information about births in the United States from 1994-2003, it was compiled by FiveThirtyEight. 

The dataset contains the following columns:

- year: Year (1994 to 2003).
- month: Month (1 to 12).
- date_of_month: Day number of the month (1 to 31).
- day_of_week: Day of week (1 to 7).
- births: Number of births that day.

In this project, I will convert data into a list of lists, create an abstract summary function to calculate statistics on different columns, create a min/max value function from scratch, and implement a function that calculates the difference in consecutive values between years for a given column and value. 


In [2]:
#read csv file into a string, split the string on newline character, display first 10 values
f = open("US_births_1994-2003_CDC_NCHS.csv","r").read().split("\r")
print f[0:10]

['year,month,date_of_month,day_of_week,births', '1994,1,1,6,8096', '1994,1,2,7,7772', '1994,1,3,1,10142', '1994,1,4,2,11248', '1994,1,5,3,11053', '1994,1,6,4,11406', '1994,1,7,5,11251', '1994,1,8,6,8653', '1994,1,9,7,7910']


## Converting Data Into A List Of Lists

While a list of strings helps us get a general picture of the dataset, we need to convert it to a more structured format to be able to analyze it. Specifically, we need to convert the dataset into a list of lists where each nested list contains integer values (not strings). We also need to remove the header row.

In [3]:
def read_csv(file_name):
    #read file in. split on newline. 
    data = open(file_name,"r").read().split("\r")
    
    #remove header row
    string_list = data[1:len(data)]
    
    final_list = []
    
    #convert each element in string_list to an integer. append to final_list.
    for s in string_list:
        int_fields = []
        string_fields = s.split(",")
        for string in string_fields: 
            int_string = int(string)
            int_fields.append(int_string)
        final_list.append(int_fields)
    return final_list

#Call read_csv() function on data and display first 10 rows. 
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
cdc_list[0:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

## Calculating Number Of Births Each Month

Now that the data is in a more usable format, we can start to analyze it. Let's calculate the total number of births that occured in each month, across all of the years in the dataset. We'll create a dictionary where each key is a unique month and each value is the number of births that happened in that month, across all years

In [4]:
#Calculate births by month. 

def month_births(lst):
    #create empty dictionary to store monthly totals
    births_per_month ={}
    for l in lst:
        #Extract the value in month column
        month = l[1]
        #Extract the value in birth column
        births = l[4]
        #If month value already exists as key in births_per_month dict, then add to existing value
        if month in births_per_month:
            births_per_month[month] = births_per_month[month] + births
        #If month value doesn't exist as key, create key and the associated births value. 
        else:
            births_per_month[month] = births
    return births_per_month

#Call function to determine number of births each month. 
cdc_month_births = month_births(cdc_list)
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

July - September are the top three months for births. Interestingly, that implies that more people are conceiving in the winter months. 

## Creating A More General Function

It would be best to create a single function that works for any column (month, year, day of week, day of month) and calculates the number of births according to that split. 

In [5]:
def calc_counts(data,column):
    dictionary = {}
    for row in data:
        column_calculation = row[column]
        births = row[4]
        if column_calculation in dictionary:
            dictionary[column_calculation] = dictionary[column_calculation] + births
        else:
            dictionary[column_calculation] = births
    return dictionary

#Return births by year using this more generalized function.
cdc_year_births = calc_counts(cdc_list,0)
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [6]:
#Finding the percent change between 1994 and 2003, printing the outcome. 
births_1994 = cdc_year_births[1994]
births_2003 = cdc_year_births[2003]
year_percent_change = float(births_2003 - births_1994)/float(births_1994) * 100

print "There were {:,} births in 1994 and {:,} births in 2003; births increased {:.2f}% over the 10 year period." \
.format(births_1994,births_2003,year_percent_change)

There were 3,952,767 births in 1994 and 4,089,950 births in 2003; births increased 3.47% over the 10 year period.


## Min/Max Value Function

Write a function that calculates min and max values for any dictionary that's passed in.

In [7]:
#Write a function that calculates min and max values for any dictionary that's passed in. 

def min_max(my_dict):
    key_max = max(my_dict.keys(), key=(lambda k: my_dict[k]))
    key_min = min(my_dict.keys(), key=(lambda k: my_dict[k]))
    print('Maximum Value:',my_dict[key_max],'key:',key_max)
    print('Minimum Value:',my_dict[key_min],'key:',key_min)

min_max(cdc_year_births)

('Maximum Value:', 4089950, 'key:', 2003)
('Minimum Value:', 3880894, 'key:', 1997)


## Difference Between Consecutive Values

Write a function to calculate the difference between consecutive values by year for a given value within a column. 

In [8]:
def consecutive_values(lst,column,value):
    year_dict = {}
    years =[]
    year_dict_diff = {}
    for row in lst:
        year = row[0]
        births = row[4]
        if year not in years:
            years.append(year)
        if row[column] == value:
            if year in year_dict:
                year_dict[year] = year_dict[year] + births
            else:
                year_dict[year] = births
    minimum = years[0]
    for year in years: 
        if year < minimum: 
            minimum = year
    n = 1
    a = 0 
    while n < len(years):
        year_dict_diff[str(minimum+ a)+"-"+str(minimum +n)] = year_dict[minimum + n] - year_dict[minimum+a]   
        n = n+1
        a = a+1
    return year_dict_diff

How have births changed on Saturday year to year from 1994-2003?

In [9]:
consecutive_saturday = consecutive_values(cdc_list,3,6)
consecutive_saturday

{'1994-1995': -15152,
 '1995-1996': -3319,
 '1996-1997': -5421,
 '1997-1998': 2936,
 '1998-1999': -3791,
 '1999-2000': 19809,
 '2000-2001': -15866,
 '2001-2002': -8158,
 '2002-2003': 1675}

The biggest year to year spike in births occurred from 1999-2000, and the biggest decrease year to year occurred from 1994 to 1995. 

The next step would be to further abstract this function to find the largest increases and decreases across all values for a given column (months, day of week etc.)