   # Vaccination rate in schools - where can we improve?
   
   <img src=vaccine.jpg width="900">
   
   **Credit:**  [healthline](https://www.healthline.com/health-news/vaccinations-before-new-school-year) 

In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

warnings.filterwarnings("ignore")  # Suppress all warnings

# Introduction

**Business Context:** The U.S. had its worst year for measles in 25 years in the first half of 2019, with most cases occurring in confined pockets of unvaccinated people. In order to get a better sense of immunization rates at the local level, The Wall Street Journal compiled kindergarten rates for individual schools across the country by reaching out to state health departments.

Some states also didn’t provide school-level data. A few states track their immunization rates by sampling schools in coordination with the Centers for Disease Control and Prevention and don’t collect data for all schools. Others didn’t publish school-level data due to small class sizes, and a handful of states only collect vaccination exemption forms for the state or county. Most schools have relatively high MMR vaccination rates of 90% or above. But many schools have rates in the 70% to 80% range, and some small, private schools have rates hovering around 50%.

**Methodology:** Addresses for about 16,000 schools in eight states were provided with the vaccination data by the state health departments. The remaining schools were checked against the National Center for Education Statistics’s school directory or that states’ school directory using school name, city, county and/or school district. If there was no unique match, the school’s location was determined with Google Maps API using school name, state and county, city and/or school district. Duplicate locations were analyzed individually. Schools with only state, school name and no additional identifying information are not displayed on the map.


**Analytics Context:** Reporters at the Wall Street Journal collected data on school-specific vaccination rates. In total, the WSJ’s dataset covers more than 46,000 schools, of which 42,000 have at least one vaccination rate available. Most states provided data for the 2018–19 school year. 

To make it more interesting datasets were added containing poverty by district, median household income per district, political party preference per district, and level of education per state representing the percentage of high school graduates or higher.

**Questions:** 
1. What are the states with higher and lower vaccination rates? 
2. Does socioeconomic status play any role in vaccination rate?
3. Does education level have any relationship to vaccination rate in schools? 
4. What are the variables that seem to impact vaccination rate the most?
5. Does political party help to predict vaccination rate?
6. Which variables are most correlated with vaccination rate?

**Goal**: Create a model to predict vaccination compliance at schools in the United States.


_Sources: Wall Street Journal, State education and health departments; National Center for Education Statistics; Google Maps (geocoding); Centers for Disease Control and Prevention (state-level rates, measles cases)_


# Data Wrangling

The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

## Extracting and cleaning relevant data

Let's start looking at the datasets!

**1. Vaccination Dataset:**

Contains vaccination rate from schools per county. The columns are: state, year, county, enrollment per county, mmr and overall vaccination rate, and three columns with rate for distinct exemption reason.

In [2]:
# vaccination datset containing vaccination rate for schools by county/district
vaccine_df = pd.read_csv('state-overviews.csv', index_col=0)
vaccine_df = vaccine_df.sort_values(by=['state','county/district'], ascending=True)
vaccine_df.head()


Unnamed: 0_level_0,state,year,county/district,enroll,mmr,overall,xmed,xper,xrel
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Alabama,2017-18,Autauga,1817,64.17,96.39,0.04,,0.57
2,Alabama,2017-18,Baldwin,5479,70.89,96.53,0.09,,1.15
3,Alabama,2017-18,Barbour,733,72.17,88.27,0.05,,0.13
4,Alabama,2017-18,Bibb,538,66.54,94.54,0.0,,0.54
5,Alabama,2017-18,Blount,1450,70.69,97.3,0.0,,0.46


In [3]:
# updating state long name to abbreviation version
# creating dictionary mapping full name to abbreviation
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

vaccine_df['state'] = vaccine_df['state'].map(us_state_abbrev) # updating long state name to abbreviation
vaccine_df.rename(columns={'county/district': 'district name'}, inplace=True) # renaming column
vaccine_df

Unnamed: 0_level_0,state,year,district name,enroll,mmr,overall,xmed,xper,xrel
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,AL,2017-18,Autauga,1817,64.17,96.39,0.04,,0.57
2,AL,2017-18,Baldwin,5479,70.89,96.53,0.09,,1.15
3,AL,2017-18,Barbour,733,72.17,88.27,0.05,,0.13
4,AL,2017-18,Bibb,538,66.54,94.54,0,,0.54
5,AL,2017-18,Blount,1450,70.69,97.3,0,,0.46
...,...,...,...,...,...,...,...,...,...
19,WY,2018-19,Sweetwater,616,80.00,75,,,
20,WY,2018-19,Teton,218,72.00,65,,,
21,WY,2018-19,Uinta,348,72.00,64,,,
22,WY,2018-19,Washakie,117,94.00,89,,,


In [32]:
# Creating a dictionary with all of the counties per state present in vaccination dataset in order to generate the merge
counties_for_state = {}

for state in vaccine_df['state'].unique():
    counties_in_state = vaccine_df[vaccine_df['state'] == state]['district name'].values.tolist()
    counties_for_state[state] = [x for x in counties_in_state]

# Add empty list for states not represented in the poverty dataset
for state in ['AK', 'AR', 'DE', 'DC', 'GA', 'HI', 'ID', 'IL', 'MO', 'MS', 'NH', 'PR', 'VA', 'WV']:
    counties_for_state[state] = []

#counties_for_state

------------------

**2. Poverty Dataset:**

Contains estimated number of relevant children 5 to 17 years old living in poverty who are related to the householder. The data has information at the county level.

In [5]:
# read csv
poverty_df = pd.read_excel('poverty_rate_district18.xls', header=None)

poverty_df.drop(0, inplace=True) #dropping unnamed row 0
poverty_df.drop(1, inplace=True) # dropping row 1

new_header = poverty_df.iloc[0] #grab the first row for the header
poverty_df = poverty_df[1:] #take the data less the header row
poverty_df.columns = new_header #set the header row as the df header


In [6]:
# creating a new column to add the short version of county name to generate the merge between vaccination dataset and
# poverty dataset
def matches_normalized_name(state_postal_code, name):
    county = None
    for county in counties_for_state[state_postal_code]:
        if county.lower() in name.lower():
            return county
    return None

poverty_df['district name'] = poverty_df.apply(lambda x: matches_normalized_name(x['State Postal Code'], x['Name']), axis=1)

# filtering only counties that are present in both datasets
poverty_df_clean = poverty_df[poverty_df['district name'].notna()]


In [7]:
# merging vaccination dataset and poverty dataset
# this version contains duplicates for vaccination
merge_df = pd.merge(vaccine_df, poverty_df_clean, on='district name', how='inner')
merge_df.head()


Unnamed: 0,state,year,district name,enroll,mmr,overall,xmed,xper,xrel,State Postal Code,State FIPS Code,District ID,Name,Estimated Total Population,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder
0,AL,2017-18,Autauga,1817,64.17,96.39,0.04,,0.57,AL,1,240,Autauga County School District,55601,9799,1891
1,AL,2017-18,Baldwin,5479,70.89,96.53,0.09,,1.15,AL,1,270,Baldwin County School District,218022,35155,4534
2,AL,2017-18,Barbour,733,72.17,88.27,0.05,,0.13,AL,1,300,Barbour County School District,12978,1671,639
3,AL,2017-18,Bibb,538,66.54,94.54,0.0,,0.54,AL,1,360,Bibb County School District,22400,3302,840
4,AL,2017-18,Blount,1450,70.69,97.3,0.0,,0.46,AL,1,420,Blount County School District,51201,8919,1357


In [8]:
# filtering columns of interest
# filtering for duplicates by summing school districts belonging to same counties for columns containing population data
merge_df_grouped = merge_df.groupby(['state', 'district name'], as_index=False).agg({
    'District ID': lambda x: x.iloc[0], # Only keep the first row's value
    'enroll': lambda x: x.iloc[0],
    'mmr': lambda x: x.iloc[0],
    'overall': lambda x: x.iloc[0],
    'xmed': lambda x: x.iloc[0],
    'xper': lambda x: x.iloc[0],
    'xrel': lambda x: x.iloc[0],
    'Estimated Total Population': np.sum,
    'Estimated Population 5-17': np.sum,
    'Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder': np.sum
})
merge_df_grouped

Unnamed: 0,state,district name,District ID,enroll,mmr,overall,xmed,xper,xrel,Estimated Total Population,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder
0,AL,Autauga,00240,1817,64.17,96.39,0.04,,0.57,55601,9799,1891
1,AL,Baldwin,00270,5479,70.89,96.53,0.09,,1.15,218022,35155,4534
2,AL,Barbour,00300,733,72.17,88.27,0.05,,0.13,12978,1671,639
3,AL,Bibb,00360,538,66.54,94.54,0,,0.54,22400,3302,840
4,AL,Blount,00420,1450,70.69,97.3,0,,0.46,144308,22543,3182
...,...,...,...,...,...,...,...,...,...,...,...,...
2120,WY,Sweetwater,05302,616,80.00,75,,,,42947,8329,697
2121,WY,Teton,05830,218,72.00,65,,,,23081,3107,180
2122,WY,Uinta,02760,348,72.00,64,,,,20299,4358,454
2123,WY,Washakie,06240,117,94.00,89,,,,7885,1387,191


-------------

**3. High School Graduation Rate Dataset:** 

Contains data per state for number of students with high school diploma or higher. Extrapolated for counties per state.

In [9]:
# dataset containing percentage of students with high school diploma or higher
# read csv file
file = "Educational Attainment Percent high school graduate or higher by State.csv"
highschoolgrad_df = pd.read_csv(file)

new_header1 = highschoolgrad_df.iloc[0] #grab the first row for the header
highschoolgrad_df = highschoolgrad_df[1:] #take the data less the header row
highschoolgrad_df.columns = new_header1 #set the header row as the df header
highschoolgrad_df = highschoolgrad_df.drop(highschoolgrad_df.index[-2:]) # drop last 2 rows

highschoolgrad_df['state'] = highschoolgrad_df['State'].map(us_state_abbrev) # updating long state name to abbreviation form
highschoolgrad_df.rename(columns={'Education': '% HS graduate or higher'}, inplace=True) # renaming column
del highschoolgrad_df['Margin Of Error'] # deleting colummn margin of error

In [10]:
# merge to main dataset
df = pd.merge(merge_df_grouped, highschoolgrad_df, on='state', how='left')
del df['State'] # deleting colummn State
df

Unnamed: 0,state,district name,District ID,enroll,mmr,overall,xmed,xper,xrel,Estimated Total Population,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder,% HS graduate or higher
0,AL,Autauga,00240,1817,64.17,96.39,0.04,,0.57,55601,9799,1891,86.2%
1,AL,Baldwin,00270,5479,70.89,96.53,0.09,,1.15,218022,35155,4534,86.2%
2,AL,Barbour,00300,733,72.17,88.27,0.05,,0.13,12978,1671,639,86.2%
3,AL,Bibb,00360,538,66.54,94.54,0,,0.54,22400,3302,840,86.2%
4,AL,Blount,00420,1450,70.69,97.3,0,,0.46,144308,22543,3182,86.2%
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2120,WY,Sweetwater,05302,616,80.00,75,,,,42947,8329,697,93.2%
2121,WY,Teton,05830,218,72.00,65,,,,23081,3107,180,93.2%
2122,WY,Uinta,02760,348,72.00,64,,,,20299,4358,454,93.2%
2123,WY,Washakie,06240,117,94.00,89,,,,7885,1387,191,93.2%


----------

**4. Coordinates dataset:**

Contains latitude and longitude at county level


In [11]:
# read csv file
coord_party_df = pd.read_csv('coordinates_uscities.csv')

# keep: state_id, 'county_name', lat, lng, population
coord_party_df = coord_party_df[['state_id', 'county_name', 'lat', 'lng', 'population']].sort_values(['state_id', 'county_name'])

# checking what the coordinates dataset looks like
#coord_party_df.head()



In [22]:
coord_party_df['district name'] = coord_party_df.apply(lambda x: matches_normalized_name(x['state_id'], x['county_name']), axis=1)

# filtering only counties that are present in both datasets
coord_party_df_clean = coord_party_df[coord_party_df['district name'].notna()]

coord_party_df_clean

Unnamed: 0,state_id,county_name,lat,lng,population,district name
1360,AL,Autauga,32.4605,-86.4588,35957,Autauga
6938,AL,Autauga,32.5797,-86.4529,4574,Autauga
12090,AL,Autauga,32.6793,-86.4607,1573,Autauga
15500,AL,Autauga,32.4320,-86.6579,871,Autauga
25151,AL,Autauga,32.6610,-86.7087,142,Autauga
...,...,...,...,...,...,...
22657,WY,Washakie,44.0349,-107.4483,250,Washakie
8622,WY,Weston,43.8510,-104.2123,3152,Weston
14329,WY,Weston,44.1025,-104.6374,1056,Weston
26134,WY,Weston,43.9848,-104.4269,104,Weston


In [21]:
# merging vaccination dataset and poverty dataset
# this version contains duplicates for vaccination
merging_df = pd.merge(coord_party_df_clean, df, how='inner')

#dropping columns
merging_df = merging_df.drop(columns=['state_id', 'county_name'])
#merging_df

In [14]:
# filtering columns of interest
# filtering for duplicates keeping the first row of each county
merged_df = merging_df.groupby(['state', 'district name'], as_index=False).agg({
    'District ID': lambda x: x.iloc[0], # Only keep the first row's value
    'enroll': lambda x: x.iloc[0],
    'mmr': lambda x: x.iloc[0],
    'overall': lambda x: x.iloc[0],
    'xmed': lambda x: x.iloc[0],
    'xper': lambda x: x.iloc[0],
    'xrel': lambda x: x.iloc[0],
    'lat' : lambda x: x.iloc[0],
    'lng' : lambda x: x.iloc[0],
    'population': lambda x: x.iloc[0],
    'Estimated Total Population': lambda x: x.iloc[0],
    'Estimated Population 5-17': lambda x: x.iloc[0],
    'Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder': lambda x: x.iloc[0]
})
merged_df

Unnamed: 0,state,district name,District ID,enroll,mmr,overall,xmed,xper,xrel,lat,lng,population,Estimated Total Population,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder
0,AL,Autauga,00240,1817,64.17,96.39,0.04,,0.57,32.4605,-86.4588,35957,55601,9799,1891
1,AL,Baldwin,00270,5479,70.89,96.53,0.09,,1.15,30.6286,-87.8866,71484,218022,35155,4534
2,AL,Barbour,00300,733,72.17,88.27,0.05,,0.13,31.9102,-85.1505,8484,12978,1671,639
3,AL,Bibb,00360,538,66.54,94.54,0,,0.54,32.9421,-87.1753,5302,22400,3302,840
4,AL,Blount,00420,1450,70.69,97.3,0,,0.46,33.9392,-86.4929,5382,144308,22543,3182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1725,WY,Sweetwater,05302,616,80.00,75,,,,41.5951,-109.2238,25913,42947,8329,697
1726,WY,Teton,05830,218,72.00,65,,,,43.4721,-110.7745,12576,23081,3107,180
1727,WY,Uinta,02760,348,72.00,64,,,,41.2602,-110.9646,11319,20299,4358,454
1728,WY,Washakie,06240,117,94.00,89,,,,44.0026,-107.9543,5003,7885,1387,191


The dataset above was not merged into the complete merge yet.

-----------------

**5. Political Party Dataset:**

Contains data at the county level for the 2016 election winners. It will be narrowed down to republicans or democrats per county.

In [16]:
# read csv file
politicalparty_df = pd.read_csv('2016_US_County_Level_Presidential_Results.csv')
politicalparty_df = politicalparty_df.sort_values(['state_abbr', 'county_name']) # sort by values

# keep: coordinates X and Y, Name (district name), 'DEM_percent', 'REP_percent', winner
politicalparty_df = politicalparty_df[['state_abbr', 'county_name', 'votes_dem', 'votes_gop', 'total_votes', 'per_dem', 'per_gop']]

# checking what the dataset looks like
#politicalparty_df


In [17]:
# applying function to add only counties that match the counties present in the main dataset
politicalparty_df['district name'] = politicalparty_df.apply(lambda x: matches_normalized_name(x['state_abbr'], x['county_name']), axis=1)

# filtering only counties that are present in both datasets
politicalparty_df_clean = politicalparty_df[politicalparty_df['district name'].notna()]

# dropping column
politicalparty_df_clean = politicalparty_df_clean.drop(columns=['county_name'])

# renaming column
politicalparty_df_clean.rename(columns={'state_abbr': 'state'}, inplace=True)

#politicalparty_df_clean

In [18]:
# merging political party dataset and df dataset without coordinates to verify number of rows
# this version contains duplicates for vaccination
main_df = pd.merge(politicalparty_df_clean, df, how='right')
main_df

Unnamed: 0,state,votes_dem,votes_gop,total_votes,per_dem,per_gop,district name,District ID,enroll,mmr,overall,xmed,xper,xrel,Estimated Total Population,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder,% HS graduate or higher
0,AL,5908.0,18110.0,24661.0,0.239569,0.734358,Autauga,00240,1817,64.17,96.39,0.04,,0.57,55601,9799,1891,86.2%
1,AL,18409.0,72780.0,94090.0,0.195653,0.773515,Baldwin,00270,5479,70.89,96.53,0.09,,1.15,218022,35155,4534,86.2%
2,AL,4848.0,5431.0,10390.0,0.466603,0.522714,Barbour,00300,733,72.17,88.27,0.05,,0.13,12978,1671,639,86.2%
3,AL,1874.0,6733.0,8748.0,0.214220,0.769662,Bibb,00360,538,66.54,94.54,0,,0.54,22400,3302,840,86.2%
4,AL,2150.0,22808.0,25384.0,0.084699,0.898519,Blount,00420,1450,70.69,97.3,0,,0.46,144308,22543,3182,86.2%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2138,WY,3233.0,12153.0,16661.0,0.194046,0.729428,Sweetwater,05302,616,80.00,75,,,,42947,8329,697,93.2%
2139,WY,7313.0,3920.0,12176.0,0.600608,0.321945,Teton,05830,218,72.00,65,,,,23081,3107,180,93.2%
2140,WY,1202.0,6154.0,8053.0,0.149261,0.764187,Uinta,02760,348,72.00,64,,,,20299,4358,454,93.2%
2141,WY,532.0,2911.0,3715.0,0.143203,0.783580,Washakie,06240,117,94.00,89,,,,7885,1387,191,93.2%


-----------

**6. Median Household Income Dataset:**

Contains the median household income per county. It also has 2015 county population.

In [27]:
# opening dataset containing median household income
income = "2015 Median Income by County.csv"
income_df = pd.read_csv(income)

# filter columns of interest
# keep county, population, median houlsehold income and state code
income_df = income_df[['State Code', 'County', 'Population', 'Median household income']]

# create a column called district name
income_df['district name'] = income_df.apply(lambda x: matches_normalized_name(x['State Code'], x['County']), axis=1)

# filtering only counties that are present in both datasets
income_df_clean = income_df[income_df['district name'].notna()]

# dropping column
income_df_clean = income_df_clean.drop(columns=['County'])

# renaming column
income_df_clean.rename(columns={'State Code': 'state'}, inplace=True)


-----------

**Final Merge:**

The complete dataframe contains all columns of interest and it is ready for initial EDA.

In [30]:
# merge on county and state code
final_df = pd.merge(income_df_clean, main_df, how='left') # didn't specify in which column -- is that an issue?

# Final merge organizing columns
complete_df = final_df[['District ID', 'state', 'district name', 'Population', 'Estimated Total Population', 'enroll',
                    'mmr', 'overall', 'xmed', 'xrel', 'xper', 'Median household income',
                    'Estimated Population 5-17', 'Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder',
                    '% HS graduate or higher', 'total_votes', 'votes_dem', 'votes_gop', 'per_dem', 'per_gop']]
complete_df.head(25)

Unnamed: 0,District ID,state,district name,Population,Estimated Total Population,enroll,mmr,overall,xmed,xrel,xper,Median household income,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder,% HS graduate or higher,total_votes,votes_dem,votes_gop,per_dem,per_gop
0,240,AL,Autauga,55221,55601.0,1817,64.17,96.39,0.04,0.57,,51281.0,9799.0,1891.0,86.2%,24661.0,5908.0,18110.0,0.239569,0.734358
1,270,AL,Baldwin,195121,218022.0,5479,70.89,96.53,0.09,1.15,,50254.0,35155.0,4534.0,86.2%,94090.0,18409.0,72780.0,0.195653,0.773515
2,300,AL,Barbour,26932,12978.0,733,72.17,88.27,0.05,0.13,,32964.0,1671.0,639.0,86.2%,10390.0,4848.0,5431.0,0.466603,0.522714
3,360,AL,Bibb,22604,22400.0,538,66.54,94.54,0.0,0.54,,38678.0,3302.0,840.0,86.2%,8748.0,1874.0,6733.0,0.21422,0.769662
4,420,AL,Blount,57710,144308.0,1450,70.69,97.3,0.0,0.46,,45813.0,22543.0,3182.0,86.2%,25384.0,2150.0,22808.0,0.084699,0.898519
5,480,AL,Bullock,10678,10138.0,278,72.66,,,,,31938.0,1537.0,1005.0,86.2%,4701.0,3530.0,1139.0,0.750904,0.242289
6,510,AL,Butler,20354,111193.0,592,73.99,91.0,0.18,0.0,,32229.0,16415.0,2846.0,86.2%,8685.0,3716.0,4891.0,0.427864,0.563155
7,540,AL,Calhoun,116648,111753.0,3319,73.82,76.58,0.03,0.43,,41703.0,18386.0,3889.0,86.2%,47376.0,13197.0,32803.0,0.278559,0.692397
8,600,AL,Chambers,34079,34838.0,803,65.01,92.94,0.0,0.13,,34177.0,5590.0,1193.0,86.2%,13778.0,5763.0,7803.0,0.418276,0.566338
9,630,AL,Cherokee,26008,126545.0,563,74.78,97.67,0.0,0.34,,36296.0,19904.0,4300.0,86.2%,10503.0,1524.0,8809.0,0.145101,0.838713


In [31]:
# checking what the dataset looks like
complete_df.tail(25)

Unnamed: 0,District ID,state,district name,Population,Estimated Total Population,enroll,mmr,overall,xmed,xrel,xper,Median household income,Estimated Population 5-17,Estimated number of relevant children 5 to 17 years old in poverty who are related to the householder,% HS graduate or higher,total_votes,votes_dem,votes_gop,per_dem,per_gop
2163,,WA,WHITMAN,46737,,,,,,,,36631.0,,,,,,,,
2164,5370.0,WA,YAKIMA,247408,96138.0,3918.0,95.53,92.93,0.46,0.05,0.64,44749.0,19111.0,4579.0,91.3%,72306.0,28484.0,39593.0,0.393937,0.547576
2165,730.0,WY,Albany,37565,38601.0,325.0,72.0,69.0,,,,42834.0,4444.0,504.0,93.2%,16420.0,6888.0,7601.0,0.419488,0.462911
2166,1420.0,WY,Big Horn,11895,12006.0,173.0,83.0,76.0,,,,51679.0,2276.0,346.0,93.2%,5079.0,594.0,4067.0,0.116952,0.800748
2167,420.0,WY,Campbell,48013,93093.0,715.0,76.0,69.0,,,,80060.0,16766.0,2614.0,93.2%,17935.0,1324.0,15778.0,0.073822,0.879732
2168,4980.0,WY,Carbon,15739,46231.0,182.0,73.0,67.0,,,,56825.0,8253.0,1366.0,93.2%,6200.0,1279.0,4409.0,0.20629,0.711129
2169,2140.0,WY,Converse,14101,13640.0,202.0,93.0,87.0,,,,62307.0,2481.0,253.0,93.2%,6552.0,668.0,5520.0,0.101954,0.842491
2170,3720.0,WY,Crook,7229,31389.0,99.0,82.0,77.0,,,,60445.0,4687.0,729.0,93.2%,3771.0,271.0,3347.0,0.071864,0.887563
2171,3960.0,WY,Fremont,40755,64351.0,588.0,84.0,78.0,,,,52773.0,10559.0,1880.0,93.2%,16543.0,4200.0,11167.0,0.253884,0.675029
2172,2990.0,WY,Goshen,13544,13291.0,121.0,78.0,73.0,,,,42689.0,1898.0,292.0,93.2%,5708.0,924.0,4418.0,0.161878,0.774001


--------------

## Dataset Ready for EDA:

In [None]:
# check for missing data

In [None]:
# histograms of numerical value