## Drift tool: usability version

There are a number of version of the drift tool with varying levels of user input needed in order to get drift visualisation. They range from a mostly empty sheet of code with some prompts that allows users to learn Python by writing their own, through to this tool, which, other than using the run button at the top of the page, and having some knowledge about the CSV files with drift data in, requires next to no python knowledge at all to get the drift visualisations. The key benefit of this tool is ease of use, but this tool, and others, have lots of scope for users to make changes to the code if they feel confident in order to change the visualisations and outputs to the ones that they want or need.

The drift visualisations demonstrate how long it takes children to go between different stages of Children's Services provision. For instance, how long children wait between a referral and a CIN or CP plan, how long children wait between CP plans and a CLA status, and also larger periods such as how long children wait between initial referral and a CLA status. These statistics are presented in two ways, the first is histograms demonstrating spread of wait times between different stages of Children's Services provision. This first visualisation is a useful way of demonstrating what proportions of children are waiting what lengths of times for different levels of provision. The second, and most important visualisation, presents bar and box plots demonstrating the wait times between different provisions starting in different years. This second set of visualisations allows users to see how long children waited tro recieve different types of provision in any year, over a range of years. This can be used to help determine if wait times are improving or not, and see trends in wait times over time.

## User Guide
The usability version of the drift tool is the easiest to use, but, as for every version of the tool, a video explaining how to use it is in the pipeline and heading towards the D2I website. Until that time, this guide will have to suffice.

Jupyter notebooks runs in cells. This text is in a cell. You can tell which cell you have selected because it will be highlighted to the left of the cell in green or blue. To access the visualisations at the end of this notebook, you will need to run every cell in the notebook, one at a time, in order. You run a cell by pressing the 'Run' button in the toolbar at the top. It's easy to recognise as it has a play button next to it. The reason we have to run the cells one at a time is because some require user input to setup the visualisations. This user input is simple. First, you need to upload your CSVs to the notebook using the buttons provided. Following this, in the boxes provided, you need to add the header names for the columns containing children's ID's and referral, CIN plan, CP plan, and CLA status start dates from each CSV. These boxes contain the relevant header names from the Annex A in order to guide their use. These header names are needed to join the tables and extract the appropriate data. Once all of this is input, click the next cell, the one after the one where you input your data and run every cell using the 'Run' button, waiting a couple of seconds between each click. Once you have run every cell, interactive visualisations will appear. These visualisations allow you to choose specific visualisations to see, and year ranges over which the visualisations will calculate and display data.

Now, to get going, run the first three cells. The last cell to run is the one with all the code following this one. That'll take you to the setup. From there, scroll down to the setup buttons and texts boxes, and follow the intructions there.

# Limitations of the model:

1) The data for the model builds over time, for the earliest years in the data set, there is not much data, as such, the further back you look in the visualisationsthe less appropriate those years are as indications of drift. As such, in the plots which display wait time by end year, only the most recent five years are in colour, but the earlier years are still included to demonstrate where the data for the model is drawn from.

2) The model calculates the time between referral and CP plan and referral and CLA  plan by assuming that the last referral before any CP plan or CLA status is the one that lead to that plan or status, this means that for a small number of children who have a referral then CP plan and then CLA status without another referral, the one referral will count as the most recent referral for both the CP plan and the CLA status. This is not necessarily wrong for all cases, as the assumption can be made in some cases that if a child had a referral and was moved from CP plan to CLA status, they ought to have had a CLA status initially, instead of a CP plan. As the number of times this happens is small, relative to the rest of the data, and it happens every year, it it relatively balanced out across the model.

3) A small number of data points are dropped in an effort to remove data points that the model would calculate incorrectly leading to large false outliers - some children have a referral very shortly after a CP plan or CLA status is started which is due to some quirks in the way that childrens services case management systems work, if those children then move from CLA status to CP plan  years later, the calculations would count that as having waited years from that referral to the CP plan. To avoid this, children who have a referral within two weeks of a CP plan or CLA status are dropped from the model to avoid these large 

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import datetime
import os

import io

import matplotlib.pyplot as plt
import seaborn as sns


#### Interactivity 
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, Layout
from IPython.display import display
from IPython.display import display, clear_output

print('Instructions:')
print('This is the section where you need to upload your files and select your column headers so the code can read your files.')
print('First, upload your CSVs containing Refferal data, CIN plan data, CP plan data, and CLA data using these buttons.')

#  Sets style so widgets are wide enough for descriptions.
style = {'description_width': 'initial'}


#  buttons for uploading all the CSVs.
REFs = widgets.FileUpload(accept='*.csv', multiple=False, description = 'Referrals CSV', 
                          layout=Layout(width='250px'), style=style)
display(REFs)



CPs = widgets.FileUpload(accept='*.csv', multiple=False, description = 'CPs plans CSV', 
                          layout=Layout(width='250px'), style=style)
display(CPs)

CLAs = widgets.FileUpload(accept='*.csv', multiple=False, description = 'CLA status CSV', 
                          layout=Layout(width='250px'), style=style)
display(CLAs)





print("When your files are uploaded, and youve submitted and checked all of your column names are correct and in the right place, click\
 the next cell to run it. Make sure you don't accidentally re-run this cell, or you'll delete all of your entries. \
AFter running the next cell, you can run every cell until the end. Make sure to scroll down to see the visualisations!")

Instructions:
This is the section where you need to upload your files and select your column headers so the code can read your files.
First, upload your CSVs containing Refferal data, CIN plan data, CP plan data, and CLA data using these buttons.


FileUpload(value={}, accept='*.csv', description='Referrals CSV', layout=Layout(width='250px'))

FileUpload(value={}, accept='*.csv', description='Referrals CSV', layout=Layout(width='250px'))


FileUpload(value={}, accept='*.csv', description='CPs plans CSV', layout=Layout(width='250px'))

FileUpload(value={}, accept='*.csv', description='CLA status CSV', layout=Layout(width='250px'))

When your files are uploaded, and youve submitted and checked all of your column names are correct and in the right place, click the next cell to run it. Make sure you don't accidentally re-run this cell, or you'll delete all of your entries. AFter running the next cell, you can run every cell until the end. Make sure to scroll down to see the visualisations!


In [4]:
#  Reads the CSVs that were uploading with the upload button widgets, to be used by functions below.


input_file = list(REFs.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
REF = pd.read_csv(content)

input_file3 = list(CLAs.value.values())[0]
content3 = input_file3['content']
content3 = io.StringIO(content3.decode('utf-8'))
CLA = pd.read_csv(content3)

input_file4 = list(CPs.value.values())[0]
content4 = input_file4['content']
content4 = io.StringIO(content4.decode('utf-8'))
CP = pd.read_csv(content4)


{'metadata': {'name': '1referral20 yrs.csv', 'type': 'text/csv', 'size': 1775050, 'lastModified': 1660035297630}, 'content': b'\xef\xbb\xbfPerson ID,Referral Date\r\nP4017102,09/08/2002\r\nP4016949,09/08/2002\r\nP4017269,09/08/2002\r\nP4018794,09/08/2002\r\nP4008073,09/08/2002\r\nP4018789,09/08/2002\r\nP4017069,09/08/2002\r\nP4067682,12/08/2002\r\nP3146141,12/08/2002\r\nP4001190,12/08/2002\r\nP4017022,12/08/2002\r\nP4021929,12/08/2002\r\nP4017100,12/08/2002\r\nP3148251,12/08/2002\r\nP3141099,12/08/2002\r\nP3005773,12/08/2002\r\nP3119083,12/08/2002\r\nP1204180,13/08/2002\r\nP3103583,13/08/2002\r\nP4017046,13/08/2002\r\nP4017055,13/08/2002\r\nP4017057,13/08/2002\r\nP3147888,13/08/2002\r\nP4016652,13/08/2002\r\nP4017335,14/08/2002\r\nP4017639,14/08/2002\r\nP4022541,14/08/2002\r\nP4017417,14/08/2002\r\nP4018602,14/08/2002\r\nP4017233,14/08/2002\r\nP3079522,14/08/2002\r\nP4011711,14/08/2002\r\nP3086282,14/08/2002\r\nP3086283,14/08/2002\r\nP3026858,14/08/2002\r\nP3029207,15/08/2002\r\nP30292

Now we have read the data tables and put them into Pandas data frames, we need to find every child who has had a referral that has lead to a CP plan, a referral that has lead to a CLA status, and a CP plan that has lead to CLA status. The easiest way to do this is to merge the relevant tables together, using Person ID to link them. Doing this means is that we will end up with a table that, for each Person ID, links every row of table 1 to every row of table 2.  This means that we end up with a row linking every single referral and CP/CLA status, for instance. This is a problem because children may have more than one referral before, and either after a CP plan or CLA status, and we end up with a data frame that links every one of these referrals to each of these CP pland and CLA statuses. For example, say a child has had referrals in 2012, 2013, and 2014 and had a CP plan starting in 2014, the way the merge works is it will make rows joining the 2012, 2012, and 2014 referrals to the 2014 CP plan. This gives us wait times that do not reflect reality as we can assume that the 2012 and 2013 referrals did not need to result in a CP plan as the 2014 one did.

As a result of this, we need to find a way to only show the time between referrals and CP plans or CLA statuses that where the referral was actually deemed sufficient to lead to a CP plan or CLA status (or CP plan lead to CLA status). To do this, we have to assume that the most recent referral before al CP plan or CLA status is the one that leads to that plan or status, which as it is normally true is a fair assumption to make.

There area number of ways we might want to go about this. One is to hope that our data is already ordered by date for every Person ID and drop all but the most recent row, hoping this gives us the most recent referral that lead to a CP plan. This doesn't work, however: the data may not be ordered right, it's best not to assume that it is, and this would also mean we would be taking some rows where the CP plan was before the referral if a child has had a referral after their most recent CP plan. Another option is to sort by Person ID and then date, so that for every child the dates of their referrals are in order. We then drop any rows where referrals were after, say, CP plans, and hope that the last row is the one with the most recent referral that lead to a CP plan, and drop every row for every child, except that one. This is better, but it still requires that this correctly orders the end dates how we expect, which it may, but we don't want to risk it. The best method is to calculate the wait times between referral and CP plan, and then sort according to Child ID and then wait time, taking just the shortest, non-negative, wait time. In most, typical cases, this should give us the most recent referral that lead to a CP plan. 

There are still problems, however. This actually gives us the shortest time period between referral and CP plan. For some children, for instance with two CP plans, we will only have information about one of those wait times. So, we can do even better: rather than dropping all but the rows with the shortest wait times with the same Person ID, we can drop all but the shortest wait times with the same Person ID for each CP plan. We can do this by ordering according to Person ID and wait time (delta), and then dropping all but the first row which share a Person ID and CP plan start date. This should give us the shortest positive time between every referral and CP plan for every child. Sadly, there are still problems with this method! In some rare instances, a CP plan or CLA status will start slightly before a referral is logged, this means that, in these instances, the referral before the one related to the CP plan or CLA status will be the one with a row with wait time data. We could fix this, potentially, by dropping rows with a referral shortly after a CP plan or CLA status as outliers, but the implimentation of this may have unintended impact outside of these rows, and the current method allows these rare instances to be balanced out by the averages.

This final thing can be fixed by finding every referral that's shortly after a CP or CLA status, in the code we've chosen 2 weeks. We can then make a list of IDs for which this is the case, and then drop all the rows from the main data frame where the IDs match. This loses some data points, but avoids some very long wait times that occur as a result of having the referral after the CLA status or CP plan.

A final issue, and one that's more difficult to solve, is one where children may have a referral that leads them to a CLA status, and be downgraded to a CP plan later, the calculations below will count that as a wait between referral and CP plan taking a very long time!

In [4]:
def id_merger(Table1, Table2):
    '''This function merges the input tables together using an inner join.
    It makes 3 tables, one for people who are referred and move to a CP plan,
    another for people who are referred and have a CLA status, and last one
    for people who move from a CP plan to CLA status.
    
    It also converts the dates to Pandas datetime objects, drops negative wait times
    (e.g. where someone has a referral after their CP plan, that referral shouldn't 
    count as leading to their CP plan generally), and then drops rows with the same
    Person ID, making the assumption that generally the most recent referral before CP
    or CLA is the one that lead to the change of plan or status.'''
    
    df = pd.merge(Table1, Table2, how='inner', on='Person ID')
    df = df.rename(columns = {df.columns[-2]: 'Start', df.columns[-1]: 'End'})
    
    df['Start'] = pd.to_datetime(df['Start'])
    df['End'] = pd.to_datetime(df['End'])
    #df = df.sort_values(['Person ID', 'Start'])
    df['delta'] = df['End'] - df['Start']
    
    df = df.sort_values(['Person ID', 'delta'])
    
    df_temp = df[(df['delta'] < datetime.timedelta(days = 0)) & 
                 (df['delta'] > datetime.timedelta(days = -14))]
    IDlist = list(df_temp['Person ID'])
    #print(IDlist)
    df = df[~df['Person ID'].isin(IDlist)]
    
    
    df = df.drop(df[df['delta'] < datetime.timedelta(days = 0)].index)
    df = df.drop_duplicates(subset = ['Person ID', 'End'], keep = 'first')
   
    #print(df.head(50))
    df['Time Gap float'] = df['delta'].astype('timedelta64[h]')
    df['Time Gap float'] =  df['Time Gap float'].astype(float) / 24
    
    return df


#  Important data for making visualisations about each wait period, ref to CP, ref to CLA and CP to CLA.
REFCP_important = id_merger(REF, CP)

REFCLA_important = id_merger(REF, CLA)

CPCLA_important = id_merger(CP, CLA)




In [6]:
def cases_by_year_sort(important_frame):
    '''This function creates a new dataframe with cases and average wait times for cases starting in each year.'''
    important_frame['year'] = important_frame['End'].dt.year
    
    wait_by_year_ = important_frame.groupby(important_frame['year'])['Time Gap float'].mean()
    cases_by_year = important_frame.value_counts('year').rename_axis('year').reset_index(name='Cases starting that year')

    wait_cases_by_year = pd.merge(wait_by_year_, cases_by_year, on='year') 
    wait_cases_by_year.columns = ['year', 'Average wait', 'Cases starting that year']
    return wait_cases_by_year

#REFCP
wait_by_year_REFCP = cases_by_year_sort(REFCP_important)

#REFCLA
wait_by_year_REFCLA = cases_by_year_sort(REFCLA_important)

#CPCLA
wait_by_year_CPCLA = cases_by_year_sort(CPCLA_important)


In [7]:
wait_by_start_start_year_df = ()
wait_by_start_year_REFCIN = ()
wait_by_start_year_REFCP = ()
wait_by_start_year_REFCLA = ()
wait_by_start_year_CINCP = ()
wait_by_start_year_CINCLA = ()
wait_by_start_year_CPCLA = ()


def wait_by_start_year(years, wait_by_year_df, important_df, Title, st_1, st_2):

    '''This function makes a bar chart and box plot to represent wait times starting in each year.'''
    wait_by_start_year_df = wait_by_year_df[(wait_by_year_df['year'] >= years[0]) & (wait_by_year_df['year'] <= years[1])]
    wait_by_start_year_important = important_df[(important_df['year'] >= years[0]) & (important_df['year'] <= years[1])]
    
    #  Sets the colour palette to coolwarm_r and its length to the length of the dataframe.
    pal = sns.color_palette("coolwarm_r", len(wait_by_start_year_df))
    #  .aggsort 1 returns the indices int he order that would sort the column, the second provides the indicies.
    #  That would sort the first set of indices.
    rank = wait_by_start_year_df['Average wait'].argsort().argsort()   
    
    #  Sets seaborn palette style.
    sns.set_style('darkgrid', {"axes.facecolor": ".9"})

    #  Initialises the figure with two sets of axes.
    fig, axes = plt.subplots(2, 1, figsize=(20,20), sharex=True)
    
    #  Title according to fucntion inputs.
    plt.suptitle(Title)

    #  Makes a barplot at axes index 0 with wait_cases_by_year as the data, the year column as x and Average wait colum
    #  as y. Then sets the palette according to the pal and rank variables above.  
    #sns.barplot(ax=axes[0], data=wait_by_start_year_df, x='year', y='Average wait', palette=np.array(pal[::-1])[rank])
    sns.barplot(ax=axes[0], data=wait_by_start_year_df, x='year', y='Average wait', color = 'darkred')
    
    
    #Puts a horizontal like at the mean value for wait times, like the vertical ones above.
    axes[0].axhline(wait_by_start_year_important['Time Gap float'].mean(), ls='--', color ='navy')
    #Adds descriptor text to the mean bar.
    axes[0].text(0,-200, 'Mean wait for all years: ----', color ='navy', fontsize=20)
    
    #Sets axes titles/
    axes[0].set_title(st_1)
    axes[0].set_xlabel('')
    axes[0].set_ylabel('Average wait (Days)')

    i = -1 #necessary to start at 0
    for p in axes[0].patches:
        i = i + 1
            # Accesses wait_cases_by_year by index location mathcing i and returns percentage as string to be annotate.
            # Gets height and location of top of bar for annotation and places va ha center, rotates.
        axes[0].annotate(str(wait_by_year_df['Cases starting that year'].iloc[i]), (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=15, color='black', xytext=(0, 5),
                 textcoords='offset points')
    

    
    
#    for bar, alpha in zip(axes[0].containers[0], wait_by_start_year_df['alpha']):
#        bar.set_alpha(alpha)

    
    #Plots the second histogram using rules from the first.
    #The palette is left the same as the averages are the same.
    #sns.boxplot(ax=axes[1], data=wait_by_start_year_important, x='year', y='Time Gap float',  palette=np.array(pal[::-1])[rank])
    sns.boxplot(ax=axes[1], data=wait_by_start_year_important, x='year', y='Time Gap float',  color='darkred')
    axes[1].axhline(important_df['Time Gap float'].mean(), ls='--', color='navy')
    axes[1].set_title(st_2)
    axes[1].set_xlabel('Year')
    axes[1].set_ylabel('Wait (Days)')
    axes[1].set_ylim(-10, 4000)

    
    #  Reduces the alpha of all but the last 5 bars to 0.1 for bar and box plots.
    x = len(wait_by_start_year_df['year'])
    alpha = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 1, 1, 1, 1, 1]
    alf = alpha[-x:]
    
    for bar, alpha in zip(axes[0].containers[0], alpha[-x:]):
        bar.set_alpha(alpha)    
    
    for patch, Alpha in zip(axes[1].artists, alf):
        r, g, b, a = patch.get_facecolor()
        patch.set_facecolor((r, g, b, Alpha))   
    
    plt.show()


In [8]:

def wait_hist_plotter(yrs, refcp=False, refcla=False, cpcla=False):
    ''' Allows the use of a widget in the next cell to choose which graphs to display, and they wear over which to
    to calculate the data for the histogram.'''
    if refcp:
        Data = REFCP_important[(REFCP_important['year'] >= yrs[0]) & (REFCP_important['year'] <= yrs[1])]
        sns.set_style('darkgrid', {"axes.facecolor": ".9"})
        sns.set_palette("flare")
        fig = sns.histplot(data=Data, x='Time Gap float')
        fig.set_title('Referral to CP plan histogram of wait times')
        fig.set_xlabel('Wait time (Days)')
        fig.set_ylabel('Number of Children')
        fig.axvline(Data['Time Gap float'].mean(), ls='--', color = 'navy')
        plt.show()

    if refcla:
        Data = REFCLA_important[(REFCLA_important['year'] >= yrs[0]) & (REFCLA_important['year'] <= yrs[1])]
        sns.set_style('darkgrid', {"axes.facecolor": ".9"})
        sns.set_palette("flare")
        fig = sns.histplot(data=Data, x='Time Gap float')
        fig.set_title('Referral to CLA status histogram of wait times')
        fig.set_xlabel('Wait time (Days)')
        fig.set_ylabel('Number of Children')
        fig.axvline(Data['Time Gap float'].mean(), ls='--', color = 'navy')    
        plt.show()
        
    if cpcla:
        Data = CPCLA_important[(CPCLA_important['year'] >=yrs[0]) & (CPCLA_important['year'] <=yrs[1])]
        sns.set_style('darkgrid', {"axes.facecolor": ".9"})
        sns.set_palette("flare")
        fig = sns.histplot(data=Data, x='Time Gap float')
        fig.set_title('CP plan to CLA status histogram of wait times')
        fig.set_xlabel('Wait time (Days)')
        fig.set_ylabel('Number of Children')
        fig.axvline(Data['Time Gap float'].mean(), ls='--', color = 'navy')
        plt.show()

In [9]:
def graph_maker(years, refcp=False, refcla=False, cpcla=False):
    '''Used by the graph plotting widget below to select bar and box plots to draw, and put that date into the
    function that draws the plots'''
    if refcp:    
        title = 'Referral to CP plan, wait time by CP plan start year'
        subtitle_1 = 'Mean time from referral to CP plan for CP plans starting in a given year (cases given above bars)'
        subtitle_2 = 'Spread of time from referral to CP plan for CP plans starting in a given year'
        wait_by_start_year(years, wait_by_year_REFCP, REFCP_important, title, subtitle_1, subtitle_2)

    if refcla:    
        title = 'Referral to CLA status, wait time CLA status start year'
        subtitle_1 = 'Mean time from referral to CLA status for CLA statuses starting in a given year (cases given above bars)'
        subtitle_2 = 'Spread of time from referral to CLA status for CLA statuses starting in a given year'
        wait_by_start_year(years, wait_by_year_REFCLA, REFCLA_important, title, subtitle_1, subtitle_2)

    if cpcla:     
        title = 'CP plan to CLA status, wait time by CLA status start year'
        subtitle_1 = 'Mean time from CP plan to CLA status for CLA status starting in a given year (cases given above bars)'
        subtitle_2 = 'Spread of time from CP plan to CLA status for CLA statuses starting in a given year'
        wait_by_start_year(years, wait_by_year_CPCLA, CPCLA_important, title, subtitle_1, subtitle_2)

In [14]:
print('Bar plots showing wait time between stages by second stage year.')

# Widget used to create bar and box plots    
interact(graph_maker, years=widgets.IntRangeSlider(
    value = [2002, 2022],
    min = 2002,
    max = 2022,
    step = 1,
    description = 'Year Range:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout = True,
    readout_format = 'd',
    layout={'width': '500px'}), 
        refcp = widgets.Checkbox(
        value = False,
        description = 'Referral to CP plan',
        disabled=False), 
            refcla=widgets.Checkbox(
            value=False,
            description = 'Referral to CLA status',
            disabled=False), 
                cpcla=widgets.Checkbox(
                value=False,
                description = 'CP plan to CLA status',
                disabled=False))


print('Histograms showing breakdown of wait time between stages, with slider to change years for calculations.')
interact(wait_hist_plotter, yrs=widgets.IntRangeSlider(
    value = [2002, 2022],
    min = 2002,
    max =2022,
    step = 1,
    description = 'Year Range:',
    disabled = False,
    continuous_update = False,
    orientation='horizontal',
    readout = True,
    readout_format='d',
    layout={'width': '500px'}), 
        refcp=widgets.Checkbox(
        value=False,
        description = 'Referral to CP plan',
        disabled=False), 
            refcla=widgets.Checkbox(
            value=False,
            description = 'Referral to CLA status',
            disabled=False), 
                cpcla=widgets.Checkbox(
                value=False,
                description = 'CP plan to CLA status',
                disabled=False))


#interact_manual(hist_selector, graphs=widgets.SelectMultiple(
#    options=GraphDict.keys()
#    ))

Bar plots showing wait time between stages by second stage year.


interactive(children=(IntRangeSlider(value=(2002, 2022), continuous_update=False, description='Year Range:', m…

Histograms showing breakdown of wait time between stages, with slider to change years for calculations.


interactive(children=(IntRangeSlider(value=(2002, 2022), continuous_update=False, description='Year Range:', m…

<function __main__.wait_hist_plotter3(yrs, refcp=False, refcla=False, cpcla=False)>