This Jupyter notebook features python code that allows you to ingest data from the ChAT into a Python environment, which allows for very quick and powerful analysis and visualisation. A primary goal of this notebook is to be set up such that you can change very few variables (such as where the ChAT of interest is stored locally) to get the visualisations you want. Another key goal is to provide a frameowrk of Python code that you can manipulate and play with to get a feel for how Python works, and start using it to generate your own insights and visualisations. One thing you may note when reading through this notebook is just how customizable everything is, with just a little know-how. In almost every instance of the notebook, the code that's there, particularly given variables, is placeholder code that works, but which is meant to be changed based on what you want to do! This is particularly the case with the example variables that are passed to the visualisation functions towards the end of the notebook.

How the notebook proceeds:

Firstly the sheets of the ChAT necessary for ChAT report style visualisations are read in, this is done in a way that allows the user to input the name and file location of their own ChAT reports by changing a few lines of code.

The notebook then performs some basic data cleaning, cutting unnecessary rows and columns, which is a very basic illustration of how Python can be used to automate this type of work.

Following this, some very simple calculations are performed. This illustrates how Python can do the work done by the ChAT relatively easily, and very quickly.

After this, functions are defined for some of the key graphs/visualisations in the ChAT. These are defined as functions, rather than writing the visualisation each time, to allow the user to easily visualise different bits of data in the ChAT, and to allow visualisations to be reproduced multiple times with different data by adding single lines of code rather than multiple ones. Writing the functions the first time may seem long winded, but allowing them to be re-used makes creating further visualisations very very quick. These visualisations contain both simple code to produce the most basic possible visualisation with the inputted data, but also some of more complex code that is used to more closely reproduce the visualisations in the ChAT report.

Once the functions are defined, example variables are passed to the functions to produce visualisations. The exmaple variables are taken from the Early Help section of the ChAT and the outputs are similar, without being exactly the same, to ensure the code doesn't become too unwieldy. So long as your ChAT has been ingested correctly, this should work out of the box. Along with the ingest part of the code, changing the variables passed to the visualisation functions is the most simple thing to do in terms of getting your own output from this notebook. Largely, all it requires you to know is what sheet (List 1 through 11 of the ChAT) your data is on, what column the key data is in, and what you want to title the visualisation.

At the end of the notebook, ChAT style visualisations are created which can easily be copied and pasted into a report or powerpoint.

Following this introduction, the Markup cells explain what the code in the following Code cells does. The code itself is then heavily commented to make reading it easier, and to make trying it yourself, or changing it, very easy too.


Future work:
The option to read straight from the Annex A, like the Annex A loader, rather than the ChAT. This would speed up ingest. This requires adding some calculations to the visualisation functions to avoid relying on calculations already performed in the ChAT as this notebook sometimes does.

Work out how to automate report generation without additional libraries/packages.

Add print statements for runtimes to demonstrate the speed of python.

The following code imports the packages necessary for the notebook, and sets the location to read the ChAT from on your local computer.

Import os allows the notebook to access files on your operating system. 

Following this, os.chdir (operating system DOT change directory), followed by the filepath where the files you want to use are stored, tells Python where to look for files and allows you to read files into the notebook directly. If you note the URL bar at the top of the screen, it should say something like localhost:8888, that means you're running Python locally, so any files you read in Python are not being uploaded anywhere.

chat = 'xxxx.xlsx' (in this case chat = 'CHaT_6.9.xlsx'), creates the variable  called chat. This essentially means every time you write chat in your python code, python reads it as 'xxxx.xslx'. If you pass it the filename of the ChAT file in the directory you have told Python to look in, we can use this later to get python to read the file in. The location of the ChAT file is set as a variable, although it could be written in full where you call it, to improve readability and to make adjusting code easier. This is common in Python.



In [None]:
import os
import pandas as pd
import numpy as np

from pandas.plotting import table 

import matplotlib.pyplot as plt
import seaborn as sns

os.chdir(r'YOUR FILEPATH HERE')
chat = 'YOUR FILENAME HERE.xlsx'

This cell reads in the data from the ChAT for the specified sheets List 1-11 (this can be changed by changing the ListNum variable  to include only the sheets you want). There are a number of ways of importing in the sheets. One is to iterate through the ChAT, pulling a sheet at a time into individual dataframes. The other is to iteratively create a list of names of the sheets you want to read, and only read the excel file once to pull out those sheets. Given that the ChAT is so large, this second option is much quicker. This second option gives us a dictionary of the data frames created with the key:value pair corresponding to the sheet name in the chat, so, the dictionary key List 1 gives us the data frame corresponding to the excel sheet on the ChAT list 1. This means that when, in the notebook, you want to call data from a particular list, youll need to access the dataframe containing the data from that list form the dictionary containing the dataframes. This dictionary is called df (short for dataframes). Let me make it more simple: the way this code works is to read every sheet from List 1-11 from the ChAT into its own dataframe. This set of dataframes, one corresponding to each list, is stored in a dictionary, just like one excel workbook contains lots of sheets. In this notebook, to call on a dataframe from the dictionary, say List 1 we use df['List 1']. Every time we use df['List 1'] and and then perform some action, we are performing it there, for instance df['List 1]['Child Unique ID'] would access the Child unique ID column of the List 1 sheet.

The following code first initialises a variable called ListNum where you store the number of lists you want to read from the ChAT. Then, using a for loop, we use ListNum to get the names of every wheet in the ChAT we want to read. This might seem unnecessary, but it allows us to simply change the number of lists we want to read in a couple of digits, rather than writing out the name of ever list we want e.g.: List 1, List 2, etc.

After this, the dictionary of the List dataframes, df, is created, reading the excel data from the ChAT using the list of Lists previously specified. The datframes drop the first row of the excel sheet as these contain information that messes up the dataframes such as the name of the list and the date the information covers. Also, the ChAT stores some empty values as dashes, in addition to empty cells, buy Python doesn't automatically read these as empty, so na_values is set to include - as NaNs (Not a Number) so python knows to store them as empty, not as dashes. However, setting na_values tells python that ALL of the na_values are what you have set them as, to include standard empty cells as NaNs, we also need to set keep_default_na = True. This way, every empty cell, AND every - is read as an NaN. 

Following this, the dictionary of dataframes is iterated through to clean it up  bit. Because of the way the ChAT works, values are calculated for 5000 rows on every sheet, even ones with no data. This means that every dataframe taken from the ChAT has lots of rows with no data, but which still have the output of various calculations, namely, the errors. Also, some collumns are empty. Noting that the empty rows of data seem to only have extra values in one column, the Errors column, rows are dropped with a threshold of 2, meaning that every row which doesn't have at least two bits of data is dropped. In other words, every row that only has an Error value and nothing else is dropped. Collumns are also dropped with how='all' meaning that columns are only dropped where no data is present in the row. It is important to get the threshold and how correct so you don't drop too rows with hdata you want in them.

In [None]:

ListNum = 11 #number of lists you want to read from the ChAT
sheets = [] #initialises a list to store the names of the lists in
for i in range(ListNum): #this loop dynamically creates a name of sheet names to read from the ChAT based on the entry in ListNum
    sheets.append('List '+str(i+1))#Making the sheet names appropriately must be i+1 because range(ListNum) starts at 0
    
    
#reads the specified lists into a dictionary of dataframes, updates NaN values to include dashes as NaNs in addition to standard
df = pd.read_excel(chat, sheet_name=sheets, skiprows=1, na_values='-', keep_default_na=True)


#iterates through the dicitonary of dataframes, cleaning them up by dropping useless rows and empty columns  
for i in sheets:
    df[i] = df[i].dropna(thresh=2)
    df[i] = df[i].dropna(how = 'all', axis=1)
    
    

Now it's time to do some of the calculations from the ChAT using the dataframes we've made. You'll need to know which list the data comes from if you want to change it. These calculations are pretty simple and self explanatory, and are only really meant as examples. The examples include key operations and methods that will often be used with dataframes such as finding their length, and slicing them according to set criteria.

In [None]:
#prints the words total contacts, and a string of the value of total contacts which is necessary to concat
TotalContacts = len(df['List 1']) #len counts the length something, in this instance, the length of the List 1 dataframe
print('Total contacts = ' + str(TotalContacts)) #TotalContacts must be printed as a string to correctly concat

#applying .mean() to a column of a dataframe outputs the mean value of that column, in this instance, distinct contacts
AverageContacts = df['List 1']['Contacts per child (distinct count)'].mean() #mean number of contacts per child
print('Mean contacts = ' + str(AverageContacts))

#.nunique() gives us the number of unique data-points in some data, in this instance, the number of Unique Child IDs
TotalEH = df['List 2']['Child Unique ID'].nunique() 
print('Total number of children with EH = ' +  str(TotalEH))

#counts the number of males and females with EH contacts by slicing the dataframe by gender and counting their length
#the following code returns the length of the Gender group column of List 2, where the values are a) Male and b) Female
#it does this by slicing according to those rules (e.g. == 'a) Male') and getting the length
EHBoys = len(df['List 2'][df['List 2']['Gender group'] == 'a) Male'])
EHGirls = len(df['List 2'][df['List 2']['Gender group'] == 'b) Female'])

print('Total EH Boys/Girls = ' + str(EHBoys) + '/' + str(EHGirls)) 



Now we've seen how to do some basic calculations, it's going to be much more important to produce the type of visualisations we se ein the ChAT report. We'll do this by writing functions that allow us to input variables multiple times, rather than writing the formula for the visualisation every time. First off, the male/female, back-to-back, bar charts from the ChAT. Getting the basic counts of males and females is easy, we can just use seaborn countplots, formatting the charts back-to-back like in the ChAT is harder.

In [None]:

#Pulling out all the data where the child's gender is specified as born

def StackPlot(ListName, Column, Title):
    '''Creates a male/female back-to-back countplot for specified data as seen in the ChAT
    ListName is the list the data is stored in
    Column is the column that needs to be counted'''
    
    #separates male and female data, this also drops rows with unborn children like the ChAT
    #reads in ListName variable from the function so this will plot the graph for any of the 11 ChAT lists
    dataM=df[ListName][df[ListName]['Gender'] == 'Male'] #dataframe of just data for males
    dataF=df[ListName][df[ListName]['Gender'] == 'Female'] #just data for females

    #boy/girl counts for titles to match ChAT
    #counts the length of male/female dataframes to give numbers of each
    CountBoys = len(dataM) 
    CountGirls = len(dataF)
    
    #calculates percentages of boys to girls for the plot titles
    Gpercent = round((CountGirls /(CountBoys+CountGirls))*100)
    Bpercent = round((CountBoys /(CountBoys+CountGirls))*100)
    
    #set-up the figure for the different graph axes to go on, with two columns, specifying size, that they have the same
    #y values, and there is no space between them
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5), sharey=True, gridspec_kw={'wspace': 0})
    
    #creates a countplot of the female dataframe dataF, reading in Column from the function as the y data
    sns.countplot(data=dataF, y=Column, ax=ax2, color='Orange', alpha=0.6) #colour set to orange and alpha set to resemble ChAT
    ax2.yaxis.set_label_position('right') #sets where the axis label is
    ax2.set_title('Female ('+ str(CountGirls) +'/' + str(Gpercent) +'%)', loc='left') #title, including number and percent of girls calculated above

    #male data and remove labels and ticks for formatting, largely same as above
    sns.countplot(data=dataM, y=Column, ax=ax1, color='Orange')
    ax1.invert_xaxis()  # reverse the direction
    ax1.tick_params(labelleft=False, left=False) #removes ticks that the other graph has already added
    ax1.set_ylabel('') #removes y labels we already have
    ax1.set_title('Male ('+ str(CountBoys) +'/' + str(Bpercent)  +'%)', loc='right')

    fig.suptitle(Title) #reads in title from function
    plt.tight_layout() 
    plt.show() #shows the plot
    


Next off is the donut charts seen in the ChAT. Like the back-to-back chart, the function allows us to put in the name of the List sheet from the ChAT, the name of the column we are examining, and the title we want. It also calculates the answers the graph should include We turn an ordinary pie chart into a nice donut chart by adding a white circle to the centre, and we overlay additonal information to the centre, here we overlay counts which match the percentages from the chart.

In [None]:
def Donut(ListName, Column, Title):
    '''Creates a donut chart with some data points in the middle and percentages on the outside'''
 
    #data for the chart, counts the number of each unique variable in a column of the dataframe using Column and ListName input
    Counts=df[ListName][Column].value_counts()
    
    #Works out the text for the center of the donut, same as above but stores as dataframe for displaying nicely 
    CenterText = df[ListName][Column].value_counts().to_frame() #getting the count values for the centre
    
    #uses .unique to find the possible outcomes for the donut to add labels to the donut 
    PossibleAnswers = df[ListName][Column].unique() #getting the possible answers for the labels

    
    #creates the figure and plots a seaborn pie chart using specified ChAT-like colours, rounds percentages, uses Possible answers as labels
    fig, ax1 = plt.subplots()
    ax1.pie(Counts, colors=['#D1D1D1','#B874FC'], autopct='%1.0f%%', pctdistance=0.85, labels=PossibleAnswers)
    
    #adds center text to the donut, with ha (horizontal alignment) and va (vertical alignment) center
    ax1.text(0., 0., CenterText.to_string(header=False), horizontalalignment='center', verticalalignment='center')
    
    
    #draws a white circle for the center of the pie chart to make it a donut and then places it in the center
    centre_circle = plt.Circle((0, 0), 0.60, fc='white') #draws the circle
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle) #adds the circle to the figure
    
    
    fig.suptitle(Title) #titles using input title
    plt.show()

#commented out test: 
#EHRefList = Donut('List 2', 'Appears on referral list', 'EH Cases that also appear on the Referrals list')

Now let's create a function for the ChAT bar chart used to display sources of things like assessments and referrals. Firstly this creates a dataframe which gives the percentages a source appears, to match the data given on th ChAT, which is then used later to iteratively place those percentages on the bar chart.

In [None]:
def SourceBar(ListName, Column, Title):
    '''creates a ChAT-like graph that's used to show sources of reports and contacts'''
    
    #initialises the countplot graph using function input values, and orders them according to size
    fig= sns.countplot(data=df[ListName], x=Column, color='#7ED6D2', order=df[ListName][Column].value_counts().index)
    
    
    #removes unnecessary labels by passing them blank strings
    fig.set_xlabel("")
    fig.set_ylabel("")
    
    #because of the jerry-rigged method of adding percentages to the chart, title placement doesn't work right
    #this gives specific xy coordinates for the title, size, and rotation. It also rotates the y-ticks to be readable.
    plt.title(Title, x=-0.1, y=-1, fontsize=16, rotation = 90)
    plt.yticks(rotation=90)
    
    #creates a dataframe of rounded percentages of sources to be added to the bar chart to closer match the ChAT
    #This isn't necessary but includes information the ChAT does
    SourceByPercent=df[ListName][Column].value_counts(normalize=True) #new dataframe with counts of instances of unique variables and normalises
    SourceByPercent = SourceByPercent.rename_axis('Sources').reset_index(name='Percents') #renames index and column for ease
    SourceByPercent['Percents'] = round(SourceByPercent['Percents']*100, 1) #returns normalised values as rounded percentages
    
    
    #jerry-rigged code to place percentages of each source on top of the bars of the chart
    #uses a for loop to place each one as a patch
    i=-1 #necessary to start at 0
    for p in fig.patches:
             i=i+1
            #accesses SourceByPercent by index location mathcing i and returns percentage as string to be annotate
            #gets height and location of top of bar for annotation and places va ha center, rotates
             fig.annotate(str(SourceByPercent['Percents'].iloc[i])+'%', (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=10, color='black', xytext=(0, 15), rotation =90,
                 textcoords='offset points')

    #rotates x tick labels        
    for i in fig.get_xticklabels():
        i.set_rotation(90)
    
    
    fig,plt.show()

#commented out test:
#EHAssSource = SourceBar('List 2', 'Organisation completing assessment', 'Organisation completing assessment')   

The ChAT includes graphs showing how many people have had more than one contact/episode/referral. This would bea very simple countplot, if we didn't care how it looked. However, we do. As standard, bars are not shown where the number of incidences is zero, but this doesnt match the chat, and, in some instances, ends up with a big square bar of height 1, width 1, which looks silly. So, to fix this, we've had to generate a new dataframe with counts for different numbers of instances so we can show bars with zero instances.

In [None]:
def MultiplesGraph(ListName, Column, Title, YTicks):
    '''sets up a graph that is used to show how many people have has each of a number of incidences'''
    
    
    #setting up the counts of number of cases, each x returns a float of the number of cases mathcing conditions
    x2 = len(df[ListName][df[ListName][Column] == 2])
    x3 = len(df[ListName][df[ListName][Column] == 3])
    x4 = len(df[ListName][df[ListName][Column] == 4])
    x5 = len(df[ListName][df[ListName][Column] >= 5])
    #creates a dictionary matching strings of incident numbers, and counts associated with them
    lst = {'2':x2, '3':x3, '4':x4, '>5':x5}
    #turns that dictionary into a dataframe to more easily make a barplot
    dta = pd.DataFrame(list(lst.items()), columns = ['Number','Count'])

    #the graph
    fig = sns.barplot(data=dta, x='Number', y='Count', color='#7ED6D2')
    
    #allows user to input a suitable number of y-ticks based on data size using finction variables
    #change these according to what looks best on the output
    fig.set(yticks=range(YTicks+1))
    
    #graph settings like title taken from function variables
    fig.set_title(Title)
    fig.set_xlabel(Column)
    fig.set_ylabel('')
    
    
    plt.show()

    #commented out test:
#MultiplesGraph('List 2', 'EH per child (distinct count)', 'Children with multiple records in period', 4)

The final visualisation on the first few pages used of the ChAT is the data baout ethnic backgrounds. It's pretty simple, pandas allows us to use matplotlib to plot a dataframe as a figure, for easy copy pasting.

In [None]:
def EthnicBackgrounds(ListName, Title):
    '''creates a dataframe of ethnic backgrounds and plots the dataframe as an image for sharing'''
    
    #creates a dataframe of the different ethnic groups, providing a normalised count of each
    backgrounds = df[ListName]['Ethnic group'].value_counts(normalize=True).to_frame()
    backgrounds['Ethnic group'] = round(backgrounds['Ethnic group']*100) #returns normalised values as rounded percentages
    backgrounds.index.names = ['Background'] #renames the index 
    
    #attempts to provide data on non-reported backgrounds, needs to be fixed
    NoRecord = {'Not Recorded':101-backgrounds['Ethnic group'].sum()}

    #plots the dataframe as a figure for sharing
    ax = plt.subplot(111, frame_on=False) # no visible frame
    ax.xaxis.set_visible(False)  # hide the x axis
    ax.yaxis.set_visible(False)  # hide the y axis
    ax.set_title=(Title)
    table(ax, backgrounds)  
    
    plt.show('mytable.png') #presents the table as a png
    
#commented out test:    
#EthnicBackgrounds('List 2', 'Ethnic Backgrounds (EH)')

In the cell below, the visualisations defined in functions earlier have the correct inputs to rectreate the information given in the Early Help page of the ChAT. The functions below can be repeated again with different inputs to get other key ChAT visualisations. For instance, we could create the contacts report from the ChAT by using the information relevant to List 1.

In [None]:
#Early help cases visualisations
EHCases = StackPlot('List 2', 'Age of Child (Years)', 'Early Help Cases (Total =' + str(TotalEH) +')')
EHRefList = Donut('List 2', 'Appears on referral list', 'EH Cases that also appear on the Referrals list')
EHAssSource = SourceBar('List 2', 'Organisation completing assessment', 'Organisation completing assessment')
MultiplesGraph('List 2', 'EH per child (distinct count)', 'Children with multiple records in period', 4)
EthnicBackgrounds('List 2', 'Ethnic Backgrounds (EH)')




It is entirely possible to use Python, and a notebook like this, to produce an automated report or dashboard where you would change the inputs (E.G. A new ChAT file, or the visualisations you wanted) and a nicely formatted report would come out, just like in the ChAT or Cut the ChAT. However, given that it isn't always easy to get Python/Jupyter/Anaconda installed at all on an LA machine, it's best to work on the assumption, for now, that you wont be able to istall non-standard packages like Jupyter Dashboard, Dash, or Panel for this. As this is the case, a number of things can be done. The first is simply to copy and paste your output visualisations to a word document or powerpoint for presentation if all you want is the visualisations. It's also possible to share the entire notebook, complete with code, to better explain what you did, and why. Jupyter also has a presentation/slide tool where you can present code and run it in real time.