# Assignment 2
For this assignment you'll be looking at 2017 data on immunizations from the CDC. Your datafile for this assignment is in [assets/NISPUF17.csv](assets/NISPUF17.csv). A data users guide for this, which you'll need to understand and map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](assets/NIS-PUF17-DUG.pdf) and [assets/NIS-PUF17-CODEBOOK.pdf](assets/NIS-PUF17-CODEBOOK.pdf). 

**Note: For all assignments please write all of your code within the function we define in order to ensure it is run by the autograder correctly**

## Question 1: proportion_of_education()
Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```


In [20]:
def proportion_of_education():
    # your code goes here
    """
    The column EDUC1 in the NIS data has categories for the mother's education levels as follows:
    1: < 12 years (less than high school)
    2: 12 years (high school)
    3: > 12 years, non-college grad (more than high school but not college)
    4: college grad (college)
    
    This is a function to return the proportion of children with mothers belonging to the above categories as a dictionary.
    """
    import pandas as pd
    # load data
    df=pd.read_csv("assets/NISPUF17.csv")
    # obtain proportion of values in col
    proportion=df.EDUC1.value_counts(normalize=True)
    # convert index-sorted series to dict
    proportion_dict=dict(proportion.sort_index())
    # create list of numbered (old) keys
    old_keys = list(proportion_dict.keys())
    # create list of new keys
    new_keys = ['less than high school','high school','more than high school but not college', 'college']
    # create list of dictionary values
    values = list(proportion_dict.values())
    # replace old keys with new ones
    new_proportion_dict = {k: v for k, v in zip(new_keys, values)}                                              
    
    return new_proportion_dict
    
    # YOUR CODE HERE
#     raise NotImplementedError()

In [21]:
assert type(proportion_of_education())==type({}), "You must return a dictionary."
assert len(proportion_of_education()) == 4, "You have not returned a dictionary with four items in it."
assert "less than high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "more than high school but not college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."


## Question 2: chickenpox_by_sex()
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```

Note: To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.00779`.

In [22]:
def chickenpox_by_sex():
    # YOUR CODE HERE
    """
    The columns in the NIS data which are of importance are:
    1. SEX: gender of child (1 = male, 2 = female)
    2. HAD_CPOX: did the child ever have chickenpox? (1 = yes, 2 = no, 77 = don't know, 99 = refused)
    3. P_NUMVRC: number of varicella-containing shots (0, 1, 2, 3)
    
    This is a function to return the ratio of children who were vaccinated and contracted chickenpox vs. those who were vaccinated and did not contract chickenpox.
    We shall exclude the children who didn't know or refused to reveal if they had chickenpox.   
    """
    import pandas as pd
    # load data
    df=pd.read_csv("assets/NISPUF17.csv")
    # consider only those children who either had or didn't have chickenpox
    df_cpox=df.loc[df.HAD_CPOX.isin([1,2])]
    # consider only those children who had at least 1 dose of varicella vaccine
    vax=df_cpox.loc[df_cpox.P_NUMVRC.isin([1,2,3])]
    # obtain the number of vaccinated children of both genders who had and didn't have chickenpox
    vax_df=vax.groupby([vax.SEX, vax.HAD_CPOX])['HAD_CPOX'].count()
    
    male_cpox=vax_df.loc[1,1]                  # number of vaccinated males who had chickenpox
    male_no_cpox=vax_df.loc[1,2]               # number of vaccinated males who had no chickenpox
    female_cpox=vax_df.loc[2,1]                # number of vaccinated females who had chickenpox
    female_no_cpox=vax_df.loc[2,2]             # number of vaccinated females who had no chickenpox
    
    male_ratio=male_cpox/male_no_cpox          # ratio of vaccinated males who had chickenpox vs. those who had no chickenpox
    female_ratio=female_cpox/female_no_cpox    # ratio of vaccinated females who had chickenpox vs. those who had no chickenpox
    
    # dict containing vaccinated male and female ratios of chickenpox vs. no chickenpox
    cpox_dict={"male":male_ratio, "female":female_ratio}
    
    return cpox_dict
        
#     raise NotImplementedError()

In [23]:
assert len(chickenpox_by_sex())==2, "Return a dictionary with two items, the first for males and the second for females."


## Question 3: corr_chickenpox()
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no's) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses. 

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in `e-18` indicating a very small number).

[1] This isn't really the full picture, since we are not looking at when the dose was given. It's possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?

In [29]:
def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd
    
    # load data
    df=pd.read_csv("assets/NISPUF17.csv")
    # consider only those children who either had or didn't have chickenpox
    df_cpox=df.loc[df.HAD_CPOX.isin([1,2])]
    # drop NaN values from P_NUMVRC column
    df_cpox.dropna(subset=["P_NUMVRC"], axis=0, inplace=True)
    # here is some stub code to actually run the correlation
    corr, pval=stats.pearsonr(df_cpox["HAD_CPOX"],df_cpox["P_NUMVRC"])
    
    # just return the correlation
    return corr

    # YOUR CODE HERE
#     raise NotImplementedError()

In [30]:
assert -1<=corr_chickenpox()<=1, "You must return a float number between -1.0 and 1.0."
