   # Vaccination rate in schools - where can we improve?
   
   <img src=vaccine.jpg width="900">
   
   **Credit:**  [healthline](https://www.healthline.com/health-news/vaccinations-before-new-school-year) 

In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

warnings.filterwarnings("ignore")  # Suppress all warnings

## Introduction

**Business Context:** Reporters at the Wall Street Journal collected data on school-specific vaccination rates. In total, the WSJ’s dataset covers more than 46,000 schools, of which 42,000 have at least one vaccination rate available. Most states provided data for the 2018–19 school year. There is also data from the Census that estimates population and poverty by district, and median household income per state.  

**Analytics Context:**

Questions: 
1. What are the states with higher and lower vaccination rates? 
2. Does socioeconomic status play any role in vaccination rate?

**Goal**: Create a model to predict vaccination compliance at schools in the United States.


## Data Wrangling
### Extracting and cleaning relevant data

Let's start looking at the datasets.

In [80]:
vaccine_df = pd.read_csv('state-overviews.csv', index_col=0)
vaccine_df = vaccine_df.sort_values(by=['state','county/district'], ascending=True)
vaccine_df.head()


Unnamed: 0_level_0,state,year,county/district,enroll,mmr,overall,xmed,xper,xrel
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Alabama,2017-18,Autauga,1817,64.17,96.39,0.04,,0.57
2,Alabama,2017-18,Baldwin,5479,70.89,96.53,0.09,,1.15
3,Alabama,2017-18,Barbour,733,72.17,88.27,0.05,,0.13
4,Alabama,2017-18,Bibb,538,66.54,94.54,0.0,,0.54
5,Alabama,2017-18,Blount,1450,70.69,97.3,0.0,,0.46


In [82]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

vaccine_df['state'] = vaccine_df['state'].map(us_state_abbrev)
vaccine_df

Unnamed: 0_level_0,state,year,county/district,enroll,mmr,overall,xmed,xper,xrel
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,AL,2017-18,Autauga,1817,64.17,96.39,0.04,,0.57
2,AL,2017-18,Baldwin,5479,70.89,96.53,0.09,,1.15
3,AL,2017-18,Barbour,733,72.17,88.27,0.05,,0.13
4,AL,2017-18,Bibb,538,66.54,94.54,0,,0.54
5,AL,2017-18,Blount,1450,70.69,97.3,0,,0.46
...,...,...,...,...,...,...,...,...,...
19,WY,2018-19,Sweetwater,616,80.00,75,,,
20,WY,2018-19,Teton,218,72.00,65,,,
21,WY,2018-19,Uinta,348,72.00,64,,,
22,WY,2018-19,Washakie,117,94.00,89,,,


In [157]:
counties_for_state = {}

for state in vaccine_df['state'].unique():
    counties_in_state = vaccine_df[vaccine_df['state'] == state]['county/district'].values.tolist()
    counties_for_state[state] = [x.lower() for x in counties_in_state]

# Add empty list for states not represented in the other data set
for state in ['AK', 'AR', 'DE', 'DC', 'GA', 'HI', 'ID', 'IL', 'MO', 'MS', 'NH', 'PR', 'VA', 'WV']:
    counties_for_state[state] = []

counties_for_state

{'AL': ['autauga',
  'baldwin',
  'barbour',
  'bibb',
  'blount',
  'bullock',
  'butler',
  'calhoun',
  'chambers',
  'cherokee',
  'chilton',
  'choctaw',
  'clarke',
  'clay',
  'cleburne',
  'coffee',
  'colbert',
  'conecuh',
  'coosa',
  'covington',
  'crenshaw',
  'cullman',
  'dale',
  'dallas',
  'dekalb',
  'elmore',
  'escambia',
  'etowah',
  'fayette',
  'franklin',
  'geneva',
  'greene',
  'hale',
  'henry',
  'houston',
  'jackson',
  'jefferson',
  'lamar',
  'lauderdale',
  'lawrence',
  'lee',
  'limestone',
  'lowndes',
  'macon',
  'madison',
  'marengo',
  'marion',
  'marshall',
  'mobile',
  'monroe',
  'montgomery',
  'morgan',
  'perry',
  'pickens',
  'pike',
  'randolph',
  'russell',
  'shelby',
  'st clair',
  'sumter',
  'talladega',
  'tallapoosa',
  'tuscaloosa',
  'walker',
  'washington',
  'wilcox',
  'winston'],
 'AZ': ['apache',
  'cochise',
  'coconino',
  'gila',
  'graham',
  'greenlee',
  'la paz',
  'maricopa',
  'mohave',
  'navajo',
  'pi

------------------

The dataset below contains estimated number of relevant children 5 to 17 years old in poverty who are related to the householder. The data has information at the district level.

In [168]:
poverty_df = pd.read_excel('poverty_rate_district18.xls', header=None)

poverty_df.drop(0, inplace=True) #dropping unnamed row 0
poverty_df.drop(1, inplace=True) # dropping row 1

new_header = poverty_df.iloc[0] #grab the first row for the header
poverty_df = poverty_df[1:] #take the data less the header row
poverty_df.columns = new_header #set the header row as the df header



In [167]:
poverty_df['matches'] = poverty_df.apply(lambda x: any(county in x['Name'].lower() for county in counties_for_state[x['State Postal Code']]), axis=1)
poverty_df[poverty_df['matches'] == True]

2,State Postal Code,State FIPS Code,District ID,Name,Estimated Total Population,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder,matches
12,AL,01,00240,Autauga County School District,55601,9799,1891,True
13,AL,01,00270,Baldwin County School District,218022,35155,4534,True
14,AL,01,00300,Barbour County School District,12978,1671,639,True
16,AL,01,00360,Bibb County School District,22400,3302,840,True
18,AL,01,00420,Blount County School District,51201,8919,1357,True
...,...,...,...,...,...,...,...,...
13204,WY,56,04260,Uinta County School District 6,3122,730,33,True
13205,WY,56,06240,Washakie County School District 1,7208,1297,183,True
13206,WY,56,05820,Washakie County School District 2,677,90,8,True
13207,WY,56,04830,Weston County School District 1,5497,834,135,True


-------------

In [60]:
file = "Educational Attainment Percent high school graduate or higher by State.csv"
highschoolgrad_df = pd.read_csv(file)



highschoolgrad_df.head()

Unnamed: 0,Educational Attainment: Percent high school graduate or higher by State,Unnamed: 1,Unnamed: 2
0,State,Education,Margin Of Error
1,Alabama,86.2%,+/- 0.2%
2,Alaska,92.8%,+/- 0.2%
3,Arizona,87.1%,+/- 0.1%
4,Arkansas,86.6%,+/- 0.2%
