# Basic Overview 
The primary objective is to visualize and analyze the the teen birth rate data in USA for the years 2003-2015.

Comments/criticisms/appreciations are greatly accepted and appreciated. Do not be shy and send me an email at babinu@gmail.com !

Source of data : https://data.cdc.gov/api/views/3h58-x6cd/rows.csv?accessType=DOWNLOAD

In [14]:
# Section for importing relevant modules. Note that in order to increase readability , we will be importing necessary
# modules only as and when it is needed, rather than all at once.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [15]:
teen_birth_rate_data = pd.read_csv("project_2_teen_birth_rates_usa.csv")

In [16]:
teen_birth_rate_data.columns

Index(['Year', 'State', 'County', 'State FIPS Code', 'County FIPS Code',
       'Combined FIPS Code', 'Birth Rate', 'Lower Confidence Limit',
       'Upper Confidence Limit'],
      dtype='object')

###  A quick display of the obtained dataframe


In [17]:
teen_birth_rate_data.head()

Unnamed: 0,Year,State,County,State FIPS Code,County FIPS Code,Combined FIPS Code,Birth Rate,Lower Confidence Limit,Upper Confidence Limit
0,2003,Alabama,Autauga,1,1,1001,46.377215,40.683107,52.508481
1,2004,Alabama,Autauga,1,1,1001,46.050618,41.084735,51.340795
2,2005,Alabama,Autauga,1,1,1001,43.941062,39.513897,48.646647
3,2006,Alabama,Autauga,1,1,1001,43.826654,39.570766,48.345353
4,2007,Alabama,Autauga,1,1,1001,43.757806,39.573448,48.199268


In [18]:
# We do not need the FIPS code related data for this analysis. Hence, let us remove them.
teen_birth_rate_data_v1 = teen_birth_rate_data.drop(columns=['State FIPS Code', 'County FIPS Code', 'Combined FIPS Code'])

In [19]:
teen_birth_rate_data_v1.head()

Unnamed: 0,Year,State,County,Birth Rate,Lower Confidence Limit,Upper Confidence Limit
0,2003,Alabama,Autauga,46.377215,40.683107,52.508481
1,2004,Alabama,Autauga,46.050618,41.084735,51.340795
2,2005,Alabama,Autauga,43.941062,39.513897,48.646647
3,2006,Alabama,Autauga,43.826654,39.570766,48.345353
4,2007,Alabama,Autauga,43.757806,39.573448,48.199268


### Q : What exactly are the lower and upper confidence limits and do we need them ?

#### A : The birth rates are estimated for a year from the data for July using bayesian model(https://catalog.data.gov/dataset/teen-birth-rates-for-age-group-15-19-in-the-united-states-by-county) 
#### and hence we have the confidence limit values as well(obtained using the obtained standard error). 

#### For the sake of our analysis, we will omit these.


In [20]:
teen_birth_rate_data_v2 = teen_birth_rate_data_v1.drop(columns=['Lower Confidence Limit', 'Upper Confidence Limit'])

In [21]:
teen_birth_rate_data_v2.head()

Unnamed: 0,Year,State,County,Birth Rate
0,2003,Alabama,Autauga,46.377215
1,2004,Alabama,Autauga,46.050618
2,2005,Alabama,Autauga,43.941062
3,2006,Alabama,Autauga,43.826654
4,2007,Alabama,Autauga,43.757806


### How do we validate the given data ?

In [22]:
# We do a simple data validation to make sure that birth rates are indeed numbers.
print(teen_birth_rate_data_v2['Birth Rate'].describe())

count    40781.000000
mean        40.319738
std         19.644728
min          2.868646
25%         25.192610
50%         37.541996
75%         52.733487
max        135.231014
Name: Birth Rate, dtype: float64


### Which county had the highest/lowest birth rate across all years ?

In [23]:
# We find this in few different ways.
#print(teen_birth_rate_data_v2[teen_birth_rate_data_v2['Birth Rate'] == teen_birth_rate_data_v2['Birth Rate'].max()])
print("Highest birth rate  county : \n", teen_birth_rate_data_v2.loc[teen_birth_rate_data_v2['Birth Rate'].idxmax()])
print("Lowest birth rate  county : \n", teen_birth_rate_data_v2.loc[teen_birth_rate_data_v2['Birth Rate'].idxmin()])

Highest birth rate  county : 
 Year             2008
State           Texas
County         Brooks
Birth Rate    135.231
Name: 33038, dtype: object
Lowest birth rate  county : 
 Year                   2015
State         Massachusetts
County            Hampshire
Birth Rate          2.86865
Name: 15859, dtype: object


### Can we plot birth rate data across years for the Brooks county in Texas ?

In [24]:
brooks_county_data = \
    teen_birth_rate_data_v2[
        (teen_birth_rate_data_v2['County'] == 'Brooks') & (teen_birth_rate_data_v2['State'] == 'Texas')]
brooks_county_data.set_index('Year', inplace=True)

In [25]:
brooks_county_data.dtypes

State          object
County         object
Birth Rate    float64
dtype: object

In [26]:
from plotting_functions import plot_rel_data_v2
ax = plot_rel_data_v2(brooks_county_data['Birth Rate'], 'Year', 'Birth Rate', 'Birth year across years for Brooks County(TX)', 1, 5)
max_obj = brooks_county_data.loc[brooks_county_data['Birth Rate'].idxmax()]

print(max_obj)
print(max_obj.name, max_obj.State, max_obj.County,max_obj['Birth Rate'])

# Change the top y limit to make sure that annotations are made properly.
ax.set_ylim(top=brooks_county_data['Birth Rate'].max() + 7)
ax.annotate('Effect of financial crisis ?', 
            xy=(int(max_obj.name), max_obj['Birth Rate'] + 0.5),
            xytext=(int(max_obj.name), max_obj['Birth Rate'] + 3),
            arrowprops=dict(facecolor='green'),
            horizontalalignment='right',
            verticalalignment='top'
            )

NameError: name 'plt' is not defined

### Can we average the birth rates across counties(for each state) and see which state has highest/lowest birth rates ?

In [None]:
states_data_agg = \
    teen_birth_rate_data_v2[['Birth Rate']].groupby(
        [teen_birth_rate_data_v2['Year'], teen_birth_rate_data_v2['State']]).mean()

In [None]:
states_data_agg.head()

In [None]:
states_data_agg.reset_index(inplace=True)

In [None]:
states_data_agg.head()

In [None]:
print(states_data_agg.loc[states_data_agg['Birth Rate'].idxmax()])
print(states_data_agg[states_data_agg['Birth Rate']  == states_data_agg['Birth Rate'].max()])

print(states_data_agg.loc[states_data_agg['Birth Rate'].idxmin()])

### Can we plot the states with the maximum and minimum birth rates(Mississippi and Connecticut ) together in one graph ?

In [None]:
states_data_agg.set_index('Year', inplace=True)

In [None]:
miss_data = states_data_agg[['Birth Rate']][states_data_agg['State'] == 'Mississippi']
conn_data = states_data_agg[['Birth Rate']][states_data_agg['State'] == 'Connecticut']
miss_data.head()

In [None]:
mult_states_data = pd.merge(
    miss_data, 
    conn_data, 
    left_index=True, 
    right_index=True, 
    how='inner', 
    suffixes=['_MISS', '_CONN'])

In [None]:
mult_states_data.head()

In [None]:
plot_rel_data_v2(mult_states_data, 'Year', 'Birth Rates', 'Birth Rates for Mississippi and Connecticut', 1, 4)

### Can we plot data for entire USA ?

In [None]:
# We can aggregate data in the exact same manner as we did for individual states.
entire_usa_data = teen_birth_rate_data_v2[['Birth Rate']].groupby(teen_birth_rate_data_v2['Year']).mean()
#entire_usa_data.set_index('Year', inplace=True)
entire_usa_data

In [None]:
plot_rel_data_v2(entire_usa_data, 'Year', 'Birth Rate', 'Birth Rates for entire USA', 1 , 1)

### Can we decipher something statistically significant from this dataset ?

#### The conclusions are similar to that of project 1. The data present here does not look to be sufficient enough to 
#### provide a useful result that is statistically significant.