# CDC Natality Data Project

![sleeping_baby.jpg](attachment:sleeping_baby.jpg)

## Table of Contents

* [Introduction](#introduction)
* [Injestion](#injestion)
* [Exploration](#exploration)
* [Analysis and Beyond](#analysis)

## Introduction <a class='anchor' id='introduction' />

**The Centers for Desease Control and Prevention** (the CDC) compiles a yearly record of US births, with measurements of many features surrounding each birth. 

These data and their feature descriptions can be found at

https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Downloadable

You can look at some basic reports produced by the CDC using these data at the following URL in the Birth Data Files section: https://www.cdc.gov/nchs/nvss/new_nvss.htm#New_Releases_for_Birth_Data_Publications 
(You can find more by searching other locations as well).


While there are potential opportunities for ML type tasks here, these data will push you with exploratory and data analysis. In particular, these datasets are large, with roughly 4 million rows with many columns will require thoughtful treatment. 

## Injestion <a class='anchor' id='injestion' />

A few steps to keep in mind:
- The data is *fixed-width format* (fwf) with feature names found in separate pdf files. This is usable, but not as easy to deal with as an already curated csv file. 
- Each year has its own fwf file, and the widths don't agree exactly, so needs to be thought about. In particular, if you will be joining many years into one schema, consider which features are held in common and how to extract them.
- Each year, the natality datasets are large -- one row per birth (roughly 4M per year), with many features. So, even a single year of data is likely too large to fit *on-core* (in memory) just imputting blindly in to a pandas dataframe, say. 

Considering this obstacles, you may choose to use a script to better format your data into a csv file and some database (say, SQL like) to hold and deal with basic DA with. It may also be possible to use the WONDER API system offered by CDC.

## Exploration <a class='anchor' id='exploration' />

#### Requirement: You must always explore and become familiar with your data

Every feature is discussed within the pdf description file. You should become familiar with each of these features and at a minimum, be able to answer the following:
- Which of the features are measured for the mother?
- Which of the features are measured for the father?
- Which of the features are measured for the infant?
- Which of the features are measured/presented before the birth of the infant?
- Which of the features are measured/presented at/after the birth of the infant (or at least depends on direct testing of the infant, potentially in utero)?
- What are the possible values of each feature and what do those values represent? 

Beyond these requisites, you should also take time here thinking about:
What are meaningful and interested descriptive-statistics/variate-analysis and visuals that are immediately attainable to help you get a better feel for and interpretation of these data? 

 ## Analysis and Beyond <a class='anchor' id='analysis' />

Before going further, we need to have an "end-goal" in mind. It's OK to explore and become familiar with a dataset by simply poking around; perhaps during those explorations, a few questions or project ideas may have manifested, but let us not forget that we want to aim towards insights or products which are not just interseting, but also can be marketable and expose the value we can bring various organizations with our analyses. 

Of course, at this point you've already pointed out some of your talents and skills: 
- You've likely handled the injestion of very large datasets that are becomming more common across industries. ☑️
- You've likely shown off your ability to explore and wrangle these data using your mastery of some database structuring and language. ☑️
- You've hopefully become familiar with the data itself, understanding more deeply what the features represent and have illustrated that what follows will be informed by your new-found domain knowledge. ☑️

It's now time to go deeper and show that we can draw valuable insights. In what follows, we give some example questions that we can answer and explore with the data, which you are free to answer some of, but can motivate you to ask your own questions to try and answer within these data. 

#### Direction: Association Rule Mining


Of course, there are many ways to look for dependencies within data, but one attractive method that lets us apply **association rule mining** to look for implications of co-occurrence.

**1**) Admission to the NICU (neonatal intensive care unit) happens in roughly 10% of cases. It turns out that a child being admitted to the NICU is a costly endeavor. (E.g., see: https://www.managedcaremag.com/archives/2010/1/how-plans-can-improve-outcomes-and-cut-costs-preterm-infant-care). Can you draw any rule between "voluntary actions" of the mother and an infant being admitted to the NICU? Such relationships can help inform when effective interventions may be most affective, as well as warnings for insurance based on risk factors. 

**2**) Are there any rules associated with the adverse outcomes experienced by the infant at birth? 

...

#### Direction: Time-Dependent Analysis 

**3**) These data comes as a time series across many years. In general, what seasonality (if any) is apparent in total births throughout a year? 

**4**) Are these trends common among varying demographics of the mother/parents? Can you forecast these trends? 

**5**) If you consider the relationship between education level and income level, can we find any trends that could help inform the "baby-goods" industry where to allocate resources?

...


#### Direction: More into ML

You will not be able to read all the data (likely not even from a single year for most laptops) into a simple pandas data frame without swamping your memory. For the ML algorithms below, you won't be able to "iris" them by throwing them entirely into a sklearn algorithm and sitting back. Instead, you will need to approach it more thoughfully; e.g., you could divide up the larger dataset into smaller sets which can fit into memory, and attack each of those pieces one-by-one and recompile your results in a reasonable way; you could use "off-core" techniques, eating "chunks" of the data. 

**Note**: The structure of the data may not lead to a "great" result in regards to predictive power. But, it will give you a chance to try consider how to deal with unbalanced data, how to think about cost between classes, how to attack large datasets, the questions you ask and how you approach a solution.

**6**) Considering amongst only features which are present before birth, can you rank importance of those features as predictors to some of the adverse outcomes of the infant at birth? How well does your predictor work?

...