# Assignment 1
## Medical Patient No Show

### Team members: Luay Dajani, Dana Geislinger, Chris Morgan, Caroll Rodriguez
##### Github - https://github.com/cdmorgan103/7331DataMiningNoShow

## Business Understanding
We utilized the Kaggle No Show appointment data that contains the show/no-show status for several clinics across Brazil that are a part of a central healthcare system. The data set has 110,527 records and 14 attributes. This data was collected in 2016 and shows the show/no-show behavior of the patients across the different medical facilities (that are identified as the Neighborhood location).  

No Show in the medical world can be a large cost for clinics and health systems. With healthcare institution margins shrinking and an effort to modernize the delivery of healthcare, prioritizing nonproductive time has become a more common focus. As the institution's capacity becomes more constrained and scheduling time frames grow into the weeks, no show is often cited as becoming an increasingly important issue and many medical systems have gone to extreme lengths to reduce these costly occurrences. 

This has led many institutions to take various actions such as appointment confirmation via SMS, consumer engagement via email/web, strict planned overbooking rates, and even predictive modeling (my healthcare employer has done this).

In the case of predictive modeling, we have to not only think of how much yield our model can reasonably contribute, but we also have to consider the impact of a predicted No-show. Are we going to overbook for predicted no-show patients? If we predict No-show too often and have "Show patients", there is the potential for serious ramifications in the form of clinical overload. Therefore, it is critical to ensure that we consider how the model will be utilized and consider being as conservative as possible for no-show prediction.

As data scientists, it will be critical for us to not only explain model performance, but coach business leaders on how the model will work and be impacted by changes. This means establishing an effective communication channel between business leaders and the Data Science team so impacts of any operational changes (i.e. changing scheduling practices, adding pre-appointment verification calls, etc.) can be communicated. Furthermore, this may mean helping to coach business leaders on how they can use the model and where to put resources to improve business performance (i.e. following up with high probability no-show appointments before their appointment).




## Data Meaning Type

### Data Meaning
This dataset contains 110,527 appointment records for clinics located across the coastal city of Vitória in Espírito Santo, Brazil. The dataset includes 11 meaningful predictors relating to each appointment and to the patient that scheduled that appointment. Unique numeric identifiers are provided for each patient as well as for each appointment. The response variable of interest for this data set, *No-show*, is a boolean variable denoting whether or not a patient made it to their scheduled appointment.

Of the 11 predictor variables (the 2 unique identifiers are excluded), there are 2 timepoint, 1 integer, 1 categorical, and 7 boolean variables. Data are provided that generally describe a patient's health problems as well as their age, gender, and scholastic background. In addition to patient information, we are given data pertaining to the location and time of the appointment as well as whether or not the patient was notified about their appointment with an automated SMS reminder.

The following table describes all the variables provided in greater detail. Perhaps the most intriguing variable is *Scholarship*, which is related to the Bolsa Família social program instituted in Brazil. According to the dataset curator, "the best explanation about this variable that indicates if the person receives a scholarship or not" (source: https://www.kaggle.com/joniarroba/noshowappointments/discussion/45899). Based on this description, we assume that this variable defines whether or not a patient is currently receiving financial aid as part of this social program. Participants in Bolsa Família must have an income of less than $170/month, must attend regular medical checkups for all mothers and children in the household, and children must regularly attend school (source: https://www.wilsoncenter.org/article/programa-bolsa-familia).

| Variable Name  | Data Type | Variable Type         | Description                                                             |
| -------------- | --------- | --------------------- | ----------------------------------------------------------------------- |
| PatientID      | Interval  | Identifier            | Unique ID number for each patient.                                      |
| AppointmentID  | Interval  | Identifier            | Unique ID number for each appointment.                                  |
| Gender         | Nominal   | Binary Predictor      | Sex of the patient (Male/Female).                                       |
| ScheduledDay   | Interval  | Date/Time Predictor   | **Date** and **Time** when the patient called to schedule their appointment. Should always be before *AppointmentDay*.                                                                         |
| AppointmentDay | Interval  | Date Predictor        | Scheduled appointment **Date**. Appointment **Times** are not provided. |
| Age            | Ratio     | Integer Predictor     | Age of the patient in years.                                            |
| Neighbourhood  | Nominal   | Categorical Predictor | The neighborhood in which the appointment facility is located.          |
| Scholarship    | Ordinal   | Boolean Predictor     | Whether or not the patient receives Bolsa Família financial aid. To receive this benefit, a patient's income must be under the poverty threshold, all children in the household must be vaccinated and regularly attending school, and mothers and children must receive routine medical care.                                    |
| Hipertension   | Ordinal   | Boolean Predictor     | Whether or not a patient is classified as hypertensive (has high blood pressure).                                                                                                                     |
| Diabetes       | Ordinal   | Boolean Predictor     | Whether or not a patient is diagnosed as a diabetic.                    |
| Alcoholism     | Ordinal   | Boolean Predictor     | Whether or not a patient is classified as an alcoholic.                 |
| Handcap        | Ordinal   | Boolean Predictor     | Whether or not a patient is diagnosed as being handicapped.             |
| SMS_received   | Nominal   | Boolean Predictor     | Whether or not a patient received an SMS (text message) reminder for   their appointment.                                                                                                             |
| No-show        | Nominal   | Boolean Response      | Whether or not a patient showed up for their appointment. True means they **did not** show up, False means they **did** show up.                                                                         |

#### Created Variables
| Variable Name  | Data Type | Variable Type         | Description                                                             |
| -------------- | --------- | --------------------- | ----------------------------------------------------------------------- |
| DaysInAdvance  | Ratio     | Integer Predictor     | Value for how many days in advance the appointment was scheduled.       |
| ScheduledDOW   | Nominal   | Categorical Predictor | Day of the week for the day the patient scheduled the appointment.      |
| AppointmentDOW | Nominal   | Categorical Predictor | Day of the week for patient appointment.                                |
| ScheduledTime  | Interval  | Time Predictor        | **Time** of day when an appointment was scheduled.                      |

Dataset from: https://www.kaggle.com/joniarroba/noshowappointments

### Verify Data Quality

To verify the data quality, we will first import the raw data as a Pandas DataFrame object. Next, we will determine whether or not the data requires cleaning or modifications before deeper analysis.  This process will include changing variable types and names to more practically useful formats. Finally, we will perform basic exploratory data analysis to elucidate any potentical patterns or trends within the data.

In [None]:
# Import required modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
from pprint import pprint
from IPython.display import display

# Load the data into variable 'df'
df = pd.read_csv('data/KaggleV2-May-2016.csv')

# Get an overview of the raw data
df.info(null_counts=True)

With the raw data imported, we see that there are already no null (missing) values for any of the 110,527 observations in the data set. However, most of the categorical or binary variables are incorrectly stored as generic numpy objects or 64-bit integers. Furthermore, the date-time columns are stored as generic numpy objects. These columns should be converted to the correct data types. Finally, *ScheduledDay* includes time of day, but *AppointmentDay* does not, so we will separate the Scheduled time of day into a new variable; this way, it will be easy to compare the scheduled/actual **days**, while still retaining the time of day at which each appointment scheduling occured.

In [None]:
# Convert categorical variables to the correct datatype
categ_features = ['Gender', 'Neighbourhood', 'Scholarship', 'Hipertension', 'Diabetes',
                  'Alcoholism', 'Handcap', 'SMS_received', 'No-show'
                  ]
df[categ_features] = df[categ_features].astype('category')

# Pull-out the scheduled time of day as a new variable (ScheduledTime) and re-insert into df
df.ScheduledDay, ScheduledTime = df.ScheduledDay.str.split('T', 1).str
df.insert(loc=4, column='ScheduledTime', value=ScheduledTime)

# Convert date-time variables to the correct type using the C-style fmt codes
df.ScheduledDay = pd.to_datetime(df.ScheduledDay, format="%Y-%m-%d")
df.ScheduledTime = pd.to_datetime(df.ScheduledTime, format="%H:%M:%SZ").dt.time
df.AppointmentDay = pd.to_datetime(df.AppointmentDay, format="%Y-%m-%dT%H:%M:%SZ")

# Reprint df info to check
df.head()

The *PatientId* column was cast as a floating point variable, because there are some rows that erroneously contain data after the decimal point. We will convert these rows to integers and check if the ID numbers are unique for each row. It should be noted that it would not be incorrect to have multiple identical patient ID's, as these would describe multiple appointments made for the same patient. However, appointment ID should be unique for each observation. We will check these values before and after the conversion of *PatientId* to make sure that each patient is correctly identified after the conversion.

In [None]:
# Print the number of unique appointment IDs
print("Unique Appointment IDs: %d" %  len(df.AppointmentID.unique()))
print("Total Appointments:     %d\n" % len(df.AppointmentID))

# Print the number of unique patient IDs
print("Unique Patient IDs Before Conversion: %d" % len(df.PatientId.unique()))

# Cast PatientId as int
df.PatientId = df.PatientId.astype(np.int64)

# Double check the counts
print("Unique Patient IDs After Conversion:  %d" % len(df.PatientId.unique()))

This tells us that each appointment is indeed uniquely identified, but that all appointments in the dataset were made by the same group of 62,299 patients.

#### Cleaning the data
This data set is relatively clean in its raw state, with no missing values in any columns. However, there are some inconsistencies and errors that must be corrected before further analysis can be performed:

Since this dataset originated in a non-english speaking country, there are some columns with names that are misspelled. There are also some columns that are named inconsistently from the rest of the data. These names will be changed before further analysis is performed.

In [None]:
# Rename incorrect column names.
df = df.rename(columns={'Hipertension': 'Hypertension', 'Handcap': 'Handicap', 'SMS_received': 'SMSReceived', 'No-show': 'NoShow'})
pprint(list(df.columns))

There are sub-zero values for age and two unique patients with an age of 115. This could be documented oddly for perhaps a pregnant mother's child (hence a negative age value since the child isn't born). Since we have no way of knowing if why this single data point was recorded as -1, the -1 value will be simply imputed with the median value for the dataset. It is unlikely the 2 patients who are actually 115, but it is feasible to reach this age, and most likely, these patients true age are within the vicinity of the age documented but are slightly off due to poor birth documentation. In this case the values of 115 and other suspicously high age values will remain the same at this point.  

In [None]:
# Print observations where age is minimum or maximum
display(df.loc[(df.Age == -1) | (df.Age == 115)])

# Impute the values of sub zero age observations
df.Age=df.Age.replace(-1, int(df.Age.median()))

#### Defining New Variables
Next, we determine how many days in advance the appointment was scheduled and create it as a new feature. This will ensure values are reasonable and that there are no negative datapoints. Examination shows we do have some datapoints that need to be corrected, since scheduling an appointment for a previous date should not be possible.

In [None]:
# Create a column showing days in advance
df['DaysInAdvance']=(df['AppointmentDay']-df['ScheduledDay']).dt.days

# List appointments with negative days in advance (logically impossible)
df.loc[df.DaysInAdvance<0]

Since scheduling an appointment on a later day than the appointment is scheduled for is impossible, we will impute the *ScheduledDay* for all appointments with *DaysInAdvance* < 0 as the same day as the appointment. This seems to be a logical imputation since many of the existing appointments in the data set were scheduled the same day as the appointment, and appointments with scheduled days in the future are likely data entry errors.

In [None]:
# Run through the data to ensure no appointments that are scheduled after the appointment(which would be impossible).
# If true, scheduled day with the appointment day is assumed as the a same day as the appointment, then recalculate advance field
df['ScheduledDay'] = np.where(df['ScheduledDay']>df['AppointmentDay'], df['AppointmentDay'], df['ScheduledDay'])
df['DaysInAdvance']=(df['AppointmentDay']-df['ScheduledDay']).dt.days

#Examine again, we have corrected the bad scheduled appointment data (table is now empty)
df.loc[df.DaysInAdvance<0]

Next, we will add variables to the data set that describe the day of the week of an appointment and the day of the week on which an appointment was scheduled. The day of the week could provide important insights into trends in the data that might go unnoticed as raw dates.

In [None]:
# Create a day of week variable for both the scheduled day and the appointment day which will allows to examining
#  any potential trends related to the day of the week and appointment no-show
df['ScheduledDOW'] = df['ScheduledDay'].dt.weekday_name
df['AppointmentDOW'] = df['AppointmentDay'].dt.weekday_name

#Check the variables
df.info()
df.describe()

With the data cleaned, we print out the unique values for each categorical variable and the ranges for continuous variables to identify patterns and to better understand our data. This also allows us to verify that the contents of the data are as we understand them.

In [None]:
# Print descriptive info for the unique values for each predictor
print('Gender:', list(df.Gender.unique()))
print('Scheduled Dates: %s to %s' % (min(df.ScheduledDay).date(), max(df.ScheduledDay).date()))
print('Appointment Dates: %s to %s' % (min(df.AppointmentDay).date(), max(df.AppointmentDay).date()))
print('Age Range: %d to %d Years Old' % (min(df.Age), max(df.Age)))
print('Number of Distinct Neighbourhoods:', len(df.Neighbourhood.unique()))
print('Scholarship:', list(df.Scholarship.unique()))
print('Hypertension:', list(df.Hypertension.unique()))
print('Diabetes:', list(df.Diabetes.unique()))
print('Alchoholism:', list(df.Alcoholism.unique()))
print('Handicap:', list(df.Handicap.unique()))
print('SMSReceived:', list(df.SMSReceived.unique()))
print('NoShow:', list(df.NoShow.unique()))
print('Range of Scheduled Days in Advance: %d to %d Days' % (min(df.DaysInAdvance), max(df.DaysInAdvance)))
print('ScheduledDOW:', sorted(df.ScheduledDOW.unique()))
print('AppointmentDOW:', sorted(df.AppointmentDOW.unique()))

Viewing the ranges/categories for each variable helps illuminates the data and conveys some interesting trends.

First, the date ranges show that appointments were scheduled as early as November 2015, but this dataset does not include any actual appointments prior to the end of April 2016. The latest appointment takes place on June 8th, 2016 and this is also the final date on which an appointment was scheduled.

It is interesting to note the large age range of patients in the data set, and important to note that the patients in this data set could have visited clinics in any one of 81 possible Brazilian neighborhoods.

It is also interesting to note that patients schedule appointments over a wide range, from the same day to almost a half a year in advance. Finally, it is very interesting to notice that there were both no appointments scheduled or visits taken on Sundays throughout the entire dataset. This seems to indicate that medical clinics in Brazil are closed Sunday, both for scheduling appointments and seeing patients.

In [None]:
# Count the number of rows in which handicap takes on each value
display(df.groupby(df.Handicap).Handicap.count())

# Count the number of Handicap > 1 appointments
print('%d appointments have Handicap > 1' % len(df.loc[(df.Handicap == 2) | (df.Handicap == 3) | (df.Handicap == 4)]))

Looking at the variable ranges in this way shows us one final issue with the data that wasn't noticed previously; the *Handicap* variable is meant to be a boolean feature denoting whether a patient is disabled (1) or not (0). However, there are values of 2, 3, 4, and 5 entered in for this data as in 199 rows as well.

We believe that it is likely that clinics erroneously input a handicap value on an ordinal scale from 0-4 instead of as a boolean 0/1 value. Since these observations comprise only a small portion (199 / 110,527) of the data, it could make sense to impute them. However, since we are not yet at the stage of building a model, we will leave the values as they are for now. Even wtih such a small portion of the data represented, it *Handicap* levels > 1 might turn out to be beneficial predictors.

Now that the raw data has been cleaned and variables have been investigated, we will save the updated data set to a new csv file.

In [None]:
df.to_csv("data/updated.csv", index=False)

### Simple Statistics and Visualizing Attributes
This part will perform the most simplistic steps while doing the Exploratory Data Analysis. When describing the data, as there are either continuous, categorical (including binary) data, the system will not show them together, so their summary will be displayed in seperately as below:

### Explore Joint Attributes

In [None]:
names = list(df.columns.values)
# plot correlation matrix
correlations = df.corr(method='pearson')
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names, rotation=90)
ax.set_yticklabels(names)
plt.show()

In [None]:
# Scatterplot Matrix - pairwise relationships
from pandas.plotting import scatter_matrix
scatter_matrix(df)
plt.show()

In [None]:
import seaborn as sns
sns.set(style="dark")
#df = sns.load_dataset(df)
sns.pairplot(df[tuple([slice(None),1])], hue="NoShow")

##### Frequency Tables 

In [None]:
df = df.replace({'NoShow': {'Yes': True, 'No': False}})

In [None]:
#Ref: http://songhuiming.github.io/pages/2016/07/12/python-vs-sas/
#Gender
#Chris suggested I count unique PatientId
x=df.PatientId
x.value_counts(dropna = False).sort_index()
pd.crosstab(df.NoShow,df.Gender).apply(lambda r: r/r.sum(), axis = 1)

In [None]:
#Frequency of missing an appointment based on Appointment Day of Week

pd.crosstab(df.NoShow,df.AppointmentDOW).apply(lambda r: r/r.sum(), axis = 1)

In [None]:
#Frequency of missing an appointment based on the day of the week the appointment was made

pd.crosstab(df.NoShow,df.ScheduledDOW).apply(lambda r: r/r.sum(), axis = 1)

In [None]:
# let's break up the age variable
df['age_range'] = pd.cut(df.Age,[0,16,65,1e6],3,labels=['child','adult','senior']) # this creates a new variable
df.age_range.describe()

In [None]:
df.age_range

In [None]:
#now lets group with the new variable
df_grouped = df.groupby(by=['age_range','NoShow'])
print ("Percentage of No Shows in each age group:")
print (df_grouped.NoShow.sum() / df_grouped.PatientId.nunique() *100)

In [None]:
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using

f, ax = plt.subplots(figsize=(9, 9))

sns.heatmap(df.corr(), cmap=cmap, annot=True)

f.tight_layout()

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
sns.set()
sns.pairplot(df, hue="target", size=2)

In [None]:
sns.set(style="white")

# create a plot grid
g = sns.PairGrid(df[['Age','NoShow','SMSReceived']], diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d") # use joint kde on the lower triangle
g.map_upper(plt.scatter) # scatter on the upper
g.map_diag(sns.kdeplot, lw=3) # kde histogram on the diagonal

### Exceptional Work
For part of our effort to provide an example of exceptional work for this project, we decided to create an interactive plot to view the percentage of patients that did not show up for their appointment by each neighborhood location. This plot will exist on a real-scale map of Vitoria - ES, Brazil. This will allow us to make meaningful inferences about distance between clinic neighborhoods since we can view them on a real-life scaled map.

To accomplish this, I leveraged Google Maps Web API and the Bokeh package for creating interactive plots. This requires that a project be registered on Google's API access developer portal using a Google account. Once this was accomplished and an account was properly set up, I received an API key to allow access to Google's Map API from within Python.

Since the data set contained descriptive names for each of the 81 neighborhoods that clinics are located in, I first used Google's API from within Python to search for the location of each neighborhood on Google Maps. This is accomplished by utilizing Google's Geolocation API with the neighborhood name, and specifying that we are only interested in looking for places around Vitoria - ES, Brazil (this is accomplished by appending ', Vitoria - ES, Brazil' to the end of each neighborhood name). The GeoCode data from Google is stashed in a local pickle file for easy retrieval later without the need to access Google's servers.

Two Neighborhoods are excluded from the search; the names are ambiguous (they describe and industrial park and an ocean respectively), and while Google Maps is able to find a location for each of them, the locations it returns are far away from Vitoria Brazil and likely not accurate. Therefore, these two neighborhoods have been excluded from the plot map.

In [None]:
import pickle
import os
from collections import OrderedDict
import googlemaps
import pandas as pd

geo_pkl = 'data/geo.pkl'

# Access google maps API with API token (created a project with my account)
gmaps = googlemaps.Client(key='AIzaSyCAWI6VE7mlchILCqMYNniGbOMAwAtu_H4')

# Read in raw data
df_raw = pd.read_csv('data/KaggleV2-May-2016.csv')

# 2 ambiguously named neighborhoods will be removed (Google Maps can't find neighborhood location)
df2 = df_raw[df_raw.Neighbourhood != 'PARQUE INDUSTRIAL']
df2 = df2[~df2.Neighbourhood.str.contains('TRINDADE')]

# Get all neighborhood names (sorted alphabetically)
nhoods = sorted(list(df2.Neighbourhood.unique()))

# Get a list of all place names to search for (all are in Vitoria - ES, Brazil)
places = [name + ", Vitoria - ES, Brazil" for name in nhoods]

# Get list of geocodes and save as a pickle object
#  Pickling allows us to limit the number of requests to google API
#  Returns dict of {neighborhood_name: Google_geocode_response (dict)}
print("Getting Geo Data...".ljust(70), end='', flush=True)
if os.path.isfile(geo_pkl):
    with open(geo_pkl, 'rb') as p:
        geocodes = pickle.load(p)
else:
    geocodes = OrderedDict()
    for name in places:
        geocodes[name] = gmaps.geocode(name)
print("Complete!")

# Sometimes API doesn't get a hit (errors out)
#  In this case, it will return list of length 0, so we will re-get any results that are lists with length 0
while 0 in [len(g) for g in geocodes.values() if type(g) == list]:
    for name in geocodes:
        if type(geocodes[name]) == list and len(geocodes[name]) == 0:
            geocodes[name] = gmaps.geocode(name)
            # Check if still missing, and prompt input if it is
            if type(geocodes[name]) == list and len(geocodes[name]) == 0:
                input("%s can't be found. Enter to continue (Ctrl+C to abort)..." % name)

# Some locations have multiple matches in Vitoria - ES, Brazil
#   If that is the case, use only the 1st location
for name in geocodes:
    if type(geocodes[name]) == list:
        geocodes[name] = geocodes[name][0]

# Dump output geocodes to pickle file
with open(geo_pkl, 'wb') as p:
    pickle.dump(geocodes, p)

After importing the data, I used Pandas to calculate the percentage of No-Shows out of the total number of appointments at each of the 79 neighborhoods. Then Bokeh was used to plot color-coded dots on an interactive map representing all clinics in each neighborhood.

Since Google Maps is being used to supply the background of the plot, the cartesian coordinates on the plot correspond with latitude/longitude coordinates. The previously scraped data from Google Maps provides the coordinates for each location, which can be used as x/y points to plot onto the map.

Bokeh provides the ability to show tooltips with custom information when mousing over points on the plot. This functionality is used to provide the neighborhood name, latitude/longitude coordinates, and the percentange of appointments that are no-shows in that neighborhood when a user mouses over each point on the map.

Finally, Bokeh allows the use of color-scaled palettes to represent the relative magnitudes of custom variables in the plot. I use this functionality to provide a color scale representing no-show percentages, where neighborhood dots become darker orange and red as the no-show percentage becomes higher.

In [None]:
# Plot interactively with Bokeh!
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, GMapOptions, HoverTool, WheelZoomTool, LogColorMapper
from bokeh.plotting import gmap
from bokeh.palettes import OrRd6 as palette

# Set up the palette that will be used for the color scaling
palette.reverse()
color_mapper = LogColorMapper(palette=palette)

# Output to this notebook for the interactive JS plots
output_notebook()

# Define Google Maps plotting object
map_ops = GMapOptions(lat=-20.285,
                      lng=-40.315,
                      map_type='roadmap',
                      zoom=13
                      )
p = gmap('AIzaSyCAWI6VE7mlchILCqMYNniGbOMAwAtu_H4',
         map_ops,
         title='Vitoria Clinics No-Show Percentage by Neighborhood'
         )

# Get percentages of No-shows for each location
l1 = df2.groupby(['Neighbourhood','No-show']).count()
l2 = df2.groupby(('Neighbourhood')).count()
l3 = l1 / l2 * 100
l3 = l3.iloc[:,:1].rename({'Age': 'PctNoShow'}, axis='columns').reset_index()
no_show_pcts = list(l3[l3['No-show'] == 'Yes']['PctNoShow'])

# Define coordinate data
dat = ColumnDataSource(data=dict(lat=[g['geometry']['location']['lat'] for g in geocodes.values()],
                                 lng=[g['geometry']['location']['lng'] for g in geocodes.values()],
                                 name=[name[:-22].title() for name in geocodes],
                                 noshows=no_show_pcts
                                 )
                       )

# Paint color-scaled circles at respective coordinates
p.circle(x='lng',
         y='lat',
         size=15,
         fill_color={'field': 'noshows',
                     'transform': color_mapper
                     },
         fill_alpha=0.8,
         source=dat
         )

# Add info tooltips to mouse hover
tips=[("Name", "@name"), ("Coords", "(@lng, @lat)"), ("No Shows", "@noshows%")] 
p.add_tools(HoverTool(tooltips=tips))

# Display the plot
show(p)

As you can see from the interactive plot above, Bokeh allowed us to create a dynamic, interactive map of the clinic neighborhoods in this data set. I believe this can be very useful for EDA in projects like these, because reading a table of 81 neighborhood names and their corresponding responses doesn't facilitate finding patterns in the data nearly as well as an interactive plot does. The Bokeh plot is fully interactive; it can be zoomed and panned with a mouse to navigate the data in whatever way is the most useful and convenient for the user.

This process required extensive trial and error using Pandas and Bokeh to prepare data for plotting and to create the plot, and  work was also done to make the plot aesthetically pleasing by researching Bokeh's options for presentating the plotted data. We believe that the creation of this plot is an example of exceptional work that has been done to improve this project. This work will be useful to our team, and could also be useful as an example and template for other students wishing to create interactive location plots in the future.