In [None]:
%%html

<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }
  
  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

# Provincial Achievement Test Scores

## Introduction

Every year, the province of Alberta runs standardized testing for grades 6 and 9 for primary courses under the blanket identifier of Provincial Achievement Tests in order to assess how well the students preform. The results of these test are open source and readily downloaded from the Alberta Education website. In this notebook we're going to download and manipulate the data direct from Alberta education, and see if we can easily identify under and over performing school districts. Time permitting, we might even toss these onto a map using another open data set from Alberta education which contains the addresses of every school in Alberta. Using this data in combination with the provincial testing scores, we will likely be able to easily identify which school districts/schools are performing best and worst.  

## Wrangling the data

First let's download the data directly from the Alberta Education website and toss it in a Pandas data frame

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import nan as Nan


df_ero = pd.read_excel("https://education.alberta.ca/media/3680591/pat-multiyear-sch-list.xlsx")

That was pretty easily done as those are hosted excel spreadsheets. So, we don't even have to save the file locally, we can toss it straight in a pandas frame.

In [None]:
school_results = df_ero.copy()
school_results[0:5]

Where the above data format is going to be annoying to work with in order to plot/sort some data. Instead, let's whip this data table into "long form" so that we can manipulate, analyze and plot this data more easily. We do this with the code below. Notice how now we have multiple duplicate entries for "Authority Name" and "School Name" columns, as well as a handy year column for each row. 

In [None]:
import re
import time

start = time.time()
schools_reshaped = school_results.copy()

# If there's a year in the column we want to split that bad boy
def splitter(string):
    r = re.compile(r'\d{4}|\S.*$')
    return r.findall(string)

cols = list(schools_reshaped)[0:8]
years = ['2013', '2014', '2015', '2016', '2017']

# Being lazy and creating duplicate columns with a year index. It's the same 
# accross the board but we need them for the next step. 
# The key is to not respect your RAM. 
for year in years:
    for names in cols:
        schools_reshaped[str(year +" "+names)] = schools_reshaped[names]


schools_reshaped.columns = pd.MultiIndex.from_tuples([tuple(splitter(c)) for c in schools_reshaped.columns])
schools_reshaped = schools_reshaped.stack(0).reset_index(1)

end = time.time()
print(end - start)
schools_reshaped.rename(columns={'level_1': "Year"}, inplace=True)
#schools_reshaped[["School Name", "Course Name", "Sch Enrol", "Year", "Sch Writing"]].loc[schools_reshaped['Year'] == '2013']

# Sort by school name. 
schools_reshaped=schools_reshaped.sort_values('School Name')

schools_reshaped[0:5]


Excellent. Now that the data have been reshaped into a "long form" they'll be a lot easier to work with when it comes to plotting and analysis. So, let's start to get an idea at the score distributions between schools and districts by using this dataframe as a back end to an interactive widget.

## Interactive Graph

Before we start any more "involved" analysis let's take a moment to plot these data by year to get an idea of what we're working with. In the widget below `_type` controls whether we're looking at individual schools or the school authority, `name` is the name of the school/authority, `subject` changes the subject, and `name2` is optional and will display another school/authority to compare with. Note that switching to school is a little slower, as that data set requires some set up before we can put it nicely into the widget. Also note that not all subjects are offered in each school, and they're filtered down buy what subjects were offered in the school/authority under `name`. 

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from ipywidgets import interact
init_notebook_mode(connected=True)
from ipywidgets import Dropdown



In [None]:
# Now let's do the same with school districts
# print(list(schools_reshaped))

def traces(name, subject, school_or_auth):
    result = None
    divisor = None
    y = None
    y2 = None
    y3 = None
    if school_or_auth == "Authority Name":
        result = schools_reshaped[schools_reshaped[school_or_auth] == name]
        divisor = result.groupby("Year")['Sch Enrol'].sum() - result.groupby("Year")["Sch Absent"].sum()    
        y =  100 * (result.groupby("Year")["Sch Acc"].sum() - result.groupby("Year")["Sch Exc"].sum() )/divisor
        y2 = 100 * result.groupby("Year")['Sch Exc'].sum()/divisor
        y3 = 100 * result.groupby("Year")['Sch Below'].sum()/divisor
    
    if school_or_auth == "School Name":
        result = schools_reshaped[schools_reshaped[school_or_auth] == name]
        divisor = result['Sch Writing']
        y = (result['Sch % Acc of Writing']-result['Sch % Exc of Writing'])# - result['Sch Exc']) / divisor
        y2 = result['Sch % Exc of Writing'] #/ divisor
        y3 =  result['Sch % Below of Writing']# / divisor 
        
    result = result[result['Course Name'] == subject]

    trace1 = go.Bar(x=result['Year'], y=y,
                    name=" ".join([name, '% at or above acceptable standard']))#, 
               
    trace2 = go.Bar(x=result['Year'],
                    y=y2,
                    name= " ".join([name, '% achieved a standard of excellence']))#,
                
    trace3 = go.Bar(x=result['Year'], 
                    y=y3, 
                    name = " ".join([name,"% below acceptable standard"]))#,
     
    return [trace1, trace2, trace3]



def compare_results( _type, name, subject, name2 = []):
    
    print(name, subject, _type)
    
    data = traces(name, subject, _type)
    
    if name2: 
        data2 = traces(name2, subject, _type)
        data = data + data2
    
    layout = go.Layout(title=subject,
                xaxis=dict(title='Year'),
                yaxis=dict(title='Percentage',
                      range = [0,100])
                      )
    
 
    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

    
def course_drop(_type, name):
    courses = list(schools_reshaped['Course Name'].unique())
    filtered_course_list = []
   
    for course in courses:
        result = schools_reshaped[schools_reshaped[_type] == name]
        result = result[result['Course Name'] == course]
        if _type == "School Name":
            y = result['Sch % Acc of Writing']
       
        if _type == "Authority Name":
           y = result.groupby("Year")["Sch Acc"].sum() - result.groupby("Year")["Sch Exc"].sum()
            
        if y.isnull().sum() > 4 or y.empty == True:
            # No course for school, do nothing
            #filtered_course_list.append(course)
           
            pass
        else:
            # if something exists, we'll count ita
            filtered_course_list.append(course)
            
    if len(filtered_course_list) == 0:
        # TODO: make an empty thing instead of pretending they do math
        filtered_course_list.append("Mathematics 6")
    return filtered_course_list

course_widget = Dropdown()

type_widget = Dropdown(options = ["School Name", "Authority Name"], value = "School Name")

name_widget = Dropdown()
name_widget2 = Dropdown()

    
def update2(*args):
    a = sorted(list(map(str, list(schools_reshaped[type_widget.value].unique()))))
    name_widget.options = a
    name_widget2.options =  a
    name_widget2.value = None
    # course_widget.options = course_drop(type_widget.value, x_widget.value)
    name_widget.value = a[0]
    

    
def update(*args):
    course_widget.options = course_drop(type_widget.value, name_widget.value)

name_widget.observe(update)  
#type_widget.observe(update)
type_widget.observe(update2)



interact(compare_results, 
        _type = type_widget,
         name = name_widget,
         subject =  course_widget,
         name2 = name_widget2
        )




Fantastic. Now we can compare which schools do well and which do poorly and in what subject. I note that the first school/authority `name` is used to filter out subjects that they don't have data for. That means that you might not see all their choices if youre using them in `name2`. I also note that if a school/authority has no test scores, then it defauls to a blank grid for mathematics 9. 

## Makin' A Map

NOTE: This section is time consuming and not that interesting yet, feel free not to run this entire section as it only adds a map . 

First, we need to get the GPS coordinates of all those schools into the data frames. We can do that first by downloading the location data from Alberta Ed and putting it into a separate data frame, which we do thusly: 

In [None]:
df_loc =  pd.read_excel("https://education.alberta.ca/media/1626669/authority_and_school.xlsx", skiprows=[0,1])
df_loc

Now, we create a new frame which contains the school name and the latitude and longitude coordinates using the `geocoder` python module. Note that this is rate limited as we're basically running a bunch of Google queries, and they tend to get unhappy if you do that too fast. So this can take a few minutes to complete. I also note that because this is "big business" you can only get 2500 locations a day from Google. So this is what I would consider "open-ish" data.

NOTE: This is only for a map and it takes probably about an hour to gather all of the latitude and longitude coordinates of each school. So, if you don't want to gather that data, that's okay. However, pay attention to some merge notes of frames so you don't run into problems.



In [None]:
import geocoder 
temp_df = pd.DataFrame()
temp_df["School Name"] = df_loc['School Name']
from IPython.display import clear_output

def get_coords(postal_code):
    count = 0
    clear_output(wait=True)
    print("Looking up coordinates...")
    
    if postal_code is None:
        return None, None
    
    while True:
        # Need to slow down the loop so we don't get throttled by asking 
        # too many queries. Google also rate-limits you to 50 per second.
        time.sleep(.1)
        count += 1
        
        # If you want to save the data you MUST use the google
        # geocoder. Otherwise you're violating TOS
       #g = geocoder.google(postal_code)

        g = geocoder.arcgis(postal_code)
        
        try:
            to_return =  g.json['lat'], g.json['lng']
            print(g.json['lat'], g.json['lng'])
            if count > 1:
                print("Finally grabbed", postal_code, "on try", count)    
            break 
       
        except:
            
            print("tried", postal_code, count, "times")
            
            if count > 25:
                print ("I don't think", postal_code, "exists -- trying approximate",postal_code[0:3] )
                postal_code = postal_code[0:3]
           
            if count > 50:
                print ("I don't think", postal_code, "exists or you've timed out")
                to_return = None, None 
                break
            continue
            
    return to_return



temp_df["PC"] = df_loc['School Postal Code']
# Here's we're applying our function on evry enery of the column
temp_df["coords"] = temp_df["PC"].apply(get_coords)




In [None]:
# temp_df[['lat', 'long']] = temp_df['coords'].apply(pd.Series)

# temp_df = save.copy()

#pd.save.to_csv("GPS_Coordinates.csv")


#save = temp_df.copy()
temp_df.to_csv("Coordinates_and_colors.csv")

In [None]:
list_of_districts = list(schools_reshaped["Authority Name"].unique())
len(list_of_districts)


a = {}
# create dictionary of schools - district
for schools in schools_reshaped["School Name"].unique():
    try:
        a[schools] = list(schools_reshaped[schools_reshaped["School Name"] == schools]["Authority Name"])[0]
    except:
        print(list(schools_reshaped[schools_reshaped["School Name"] == schools]["Authority Name"]), schools)


districts = pd.DataFrame(list(a.items()), columns=['School Name', 'Authority Name'])
# Create colour dictionary of all districts
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len(list_of_districts))]

b = {}

for i, district in enumerate(list_of_districts):
    b[district] = c[i]
    
for keys in a:
    colour = b[a[keys]]
    a[keys] = str(colour)
    



In [None]:
colors = pd.DataFrame(list(a.items()), columns=['School Name', 'Color'])


temp_df = pd.merge(temp_df, colors,on='School Name', how='left')
temp_df = pd.merge(temp_df, districts, on='School Name', how='left')



In [None]:
from plotly.graph_objs import *
# Again... this is semi-open data. This is free you use, you just need to generate an API key from their website
# YOu need to create an account and get an access key from mapbox.com
mapbox_access_token = YOUR_API_KEY_HERE #'pk.eyJ1IjoiYTEwIiwiYSI6ImNqZjMxOWM1djA5NXczMmwzcHFhbmlhdnoifQ.hI-Uw2ZbDAu76nyD86Ddtg'


# temp_df = pd.merge(temp_df,schools_reshaped, on="School Name")

data = Data([
    Scattermapbox(
        lat=temp_df['lat'],
        lon=temp_df['long'],
        mode='markers',
        marker=Marker(
            size=5,
            color= temp_df["Color"], 
            opacity=0.7
        ),
        text=temp_df['School Name']+ "<br>" + temp_df['Authority Name']
  
    ),
    Scattermapbox(
        lat=temp_df['lat'],
        lon=temp_df['long'],
        mode='markers',
        marker=Marker(
            size=8,
            color= temp_df["Color"],
            opacity=0.7
        ),
        text=temp_df['School Name'] + "<br>" + temp_df['Authority Name'],
        hoverinfo= None #temp_df['School Name']
    )]
)

layout = Layout(
    title='Alberta Schools',
    autosize=True,
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=53.544389,
            lon=-113.4909267
        ),
        pitch=0,
        zoom=3.5,
        style='light'
    ),
)



fig = { 'data':data, 'layout':layout }
iplot(fig)

Feel free to scroll around on this map to your hearts content. Each marker represents the approximate location of a school and is colored according to which school district it falls under. Hover over them in order to see the location, school name, and district name. Some schools are plotted as simply black. This means that we didn't have enough data to properly label them and they're likely schools that don't report data all that often.

NOTE: If you haven't created the map, the merge below won't work by default, see comments below for what to comment out. 

## How Does Differences in Funding Affect Student Performance? 

The code below assumes you've downloaded all the PDFs off of the Alberta education site containing funding information from each district. If you don't have it you can either download those pdfs yourself (not recommended) or get them from our swift container.

TODO: Put them in swift container/if somewhere else you might need them. 


Most of the code below is just wrangling data and making plots of that data. However, what we're doing is gathering all our funding data, combining it with our data frames and then plotting it. What we'll then have is the performance of each district against the provincial average in terms of test scores for each ear and subject, as well as a graph of how those test scores were affected by differences in _total_ funding. In order to do that, we plot the density of funding and performance grades for the entire province, and then fit a line to it in order to judge positive/negative coorelation between funding and grade performance. 

In [None]:


# If you don't have the LAT LONG data, uncomment the line below and run this cell .
# temp_df = schools_reshaped.copy()


In [None]:
from tika import parser
import requests
import glob
import re
import pandas as pd

def convert(x):
    try:
        return x.astype(int)
    except:
        return x
   
def get_funding_data(paths = "FundingPdf/*.pdf"):
    data = []
    count = 0
    for file in glob.iglob(paths):
        parsedPDF = parser.from_file(file)
    
        name = file.split("/")[-1]
        name = name.replace(".pdf", "")
        name = name.replace("-", " ").title()
        name = re.sub("[\(\[].*?[\)\]]", "", name)
        try:
            name = name.replace(" No ", " No. ")
        except:
            pass

        try: 
            name = name.replace(" Ltd", " Ltd. ")
        except:
            pass

        name = name.strip()

        estimated_funding = None
        projected_funding =  None
        estimated_enroll =  None
        projected_enroll =  None
        year = None
    
        for line in parsedPDF['content'].split('\n')[::-1]:

            if "TOTAL FUNDING" in line:
                estimate_funding = line.split()[2].replace('$',"").replace(",","")
                projected_funding = line.split()[3].replace('$',"").replace(",","")
        
            if "As of " in line:
          
                try:
                    print(int(line.split()[-1]))
                    year = line.split()[-1]
                except:
                    pass
        
            if "Funded Enrolment for Grades 1 - 12" in line:
           
                estimated_enroll = line.split()[7].replace(",", "")
                projected_enroll = line.split()[9].replace(",", "")
            elif "Enrolment for Grades 1 - 12" in line:
           
                estimated_enroll = line.split()[6].replace(",", "")
                projected_enroll = line.split()[8].replace(",", "")



        data.append([name, estimate_funding, projected_funding, estimated_enroll, projected_enroll, year])
   

    df = pd.DataFrame(data, columns = ["Authority Name", "Estimated Funding", "Projected Funding", "Estimated 1-12", "Projected 1-12","Year"])
    df.to_csv("district_funding.csv")

In [None]:


# Add district funding 
try: 
    funding = pd.read_csv("district_funding.csv")
    del funding["Unnamed: 0"]
except:
    get_funding_data()
    
    

funding["Year"] = np.nan_to_num(funding["Year"]).astype(int)
funding["Estimated 1-12"] = np.nan_to_num(funding["Estimated 1-12"]).astype(int)
funding["Projected 1-12"] = np.nan_to_num(funding["Projected 1-12"]).astype(int)
# Don't need this year's data. 
funding = funding[funding.Year != 2018]
#funding = funding[funding.Year != np.nan]






In [None]:

# testing = pd.merge()

testing = pd.merge(temp_df, funding,how='left', on = ["Authority Name"])
#
# There's a panda's gotcha with NaN types in integer columns so we have to
# go through all this crap to deal with it. 
testing["Year"] = np.nan_to_num(testing["Year"]).astype(int)
testing["Year"] = np.nan_to_num(testing["Year"]).astype(str)
testing["Year"] = testing["Year"].replace('0', Nan)
combined_frame = pd.merge(schools_reshaped, testing, how='left', on=['Authority Name', "School Name", "Year"])

# create funding per student. 
combined_frame["Est Fund Per Student"] = combined_frame["Estimated Funding"]/combined_frame["Estimated 1-12"]
combined_frame["Proj Fund Per Student"] = combined_frame["Projected Funding"]/combined_frame["Projected 1-12"]


                     

In [None]:
density_plot_frame = combined_frame.sort_values("Year").copy()# .replace(0, np.NaN)

import seaborn as sns
from pylab import *
from matplotlib import animation
import numpy.ma as ma
from scipy.stats import mstats

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]


years = list(density_plot_frame["Year"].unique())
def make_density(category, year, subject, Authority=False, filter=False):
    YEARS = list(density_plot_frame["Year"].unique())
    if subject:
        x = density_plot_frame[density_plot_frame["Course Name"] == subject]
        grade = x[[category, "Year"]]
        funding = x[["Est Fund Per Student", "Year"]]
        points = x[["Est Fund Per Student", "Year", category, "Authority Name"]]
            
    else:
        grade = density_plot_frame[[category, "Year"]]
        funding = density_plot_frame[["Est Fund Per Student", "Year"]]
        points = density_plot_frame[["Est Fund Per Student", "Year", category,"Authority Name"]]
    
    if Authority:
        f, ax = plt.subplots(figsize=(7, 7))
        
        for i, year in enumerate(YEARS):
                line = points[points["Year"] == year][category]
                line = line.mean()
                print(line)
                plt.axhline(y=line,xmin= (i+.1)/(len(YEARS)), xmax = (i+1-.1)/(len(YEARS)), c="g")
       
        plt.style.use('ggplot')
        points = points[points["Authority Name"] == Authority]
        grade = grade.dropna()
        dd = pd.melt(points[["Year", category]], id_vars = ["Year"], var_name = [category])
        title = ''.join([Authority, "\n", subject])
        
        
        try:
            sns.boxplot(x="Year", y="value", data=dd, hue=category)
        except:
            title = ''.join([Authority, "\n", subject," No data"])
        plt.title(title)

    else: 
        f, (ax1, ax2) = plt.subplots(2, figsize=(9, 9))
        plt.tight_layout(pad=4)
       # plt.subplot(2,1,1)
        points = points.dropna()
        if year:
            x = points[points["Year"] == year][category]
            y = points[points["Year"] == year]["Est Fund Per Student"]
            if filter:
                t_f = points[points["Year"] == year][[category, "Est Fund Per Student"]]
            
        else:
            x = points[category]
            y = points["Est Fund Per Student"]
            if filter:
                t_f = points[[category, "Est Fund Per Student"]]
            

        x1 = x.quantile(0.25)
        x2 = x.quantile(0.75)
        y1 = y.quantile(0.25)
        y2 = y.quantile(0.75)
        
        ax1.plot([x1,x1], [y1,y2], c ='r', label = "Box contains\n50% of data")
        ax1.plot([x1,x2], [y1,y1], c='r')
        ax1.plot([x1,x2], [y2,y2], c='r')
        ax1.plot([x2,x2], [y1,y2], c='r')
        
        
        try:
            # To get an idea for the trend I"m plotting al ine. 
            # That said these errors are likely VERY non guassian
            # I don't feel like plotting them -- too deep in rabbit hole
            # to go down another. SO keep in mind these are "trends"
            # and shouldn't be read into beyond a positive/negative 
            # correlation. 
            
            #np.linspace(0,100)
            

            
            if filter:
                # Filter outliers by one stadard dev. (VERY AGRESSIVE) 
                top1 = t_f[category].mean() + t_f[category].std()
                top2 = t_f["Est Fund Per Student"].mean() + t_f["Est Fund Per Student"].std()
                bottom1 = t_f[category].mean() - t_f[category].std()
                bottom2 = t_f["Est Fund Per Student"].mean() - t_f["Est Fund Per Student"].std()
                t_f = t_f[t_f[category] < top1]
                t_f = t_f[t_f[category] > bottom1]
                tf = t_f[t_f["Est Fund Per Student"] < top2]
                tf = t_f[t_f["Est Fund Per Student"] > bottom2]

                x = tf[category]
                y = tf["Est Fund Per Student"]
           
            limits = x
            fit, V = np.polyfit(x, y, deg=1, cov=True)
            
            # 62 percentile. Though probably not really 
            # as this calculation requires the errors to be normally distributed.
            error = 2*np.sqrt(np.diag(V))

            label = ''.join(["Line of best fit\n", 
                            str(round(fit[0],2)), 
                            "±",
                            str(round(error[0])),
                            "x + ",
                            str(round(fit[1],2)), 
                            "±",
                            str(round(error[1],2))])
            
            ax1.plot(limits, fit[0] * limits + fit[1], 
                     color='purple', 
                     label = label)
            ax1.plot(limits, 
                     (fit[0]+ error[0]) * limits + fit[1] + error[1], 
                     color = 'orange',
                    label = "")
            ax1.plot(limits, 
                     (fit[0]- error[0]) * limits + fit[1] - error[1],
                     color = 'orange',
                    label = "")
            
            test = fit[0] * x + fit[1]
            residual = y - test
            
        # Naked exception because I'm a rule breaker. 
        except Exception as e:
            print(e)
            pass
        
        
        if subject:
            pass
        else:
            subject = "All"
        
        title = "".join(["All Districts" ,
                 "\nMean Funding = $", 
                 str(round(y.mean(),2)),
                "\nMean Percent = ",
                str(round(x.mean(),2)),
                " %",
                "\nSubject: ",
                 subject])
        
        ax1.set_title(title)
        ax1.legend()
        
        sns.kdeplot(x, y, shade=True, ax=ax1)
        
        ax2.set_title("Distribution of Residuals of Line of Best Fit")
        
        # Test if the residual is normally distributed to judge our LOBF
        z,pval = mstats.normaltest(residual)
        if pval < 0.05:
            text = "Errors probably not normally distributed\n(Line of best fit shows approximate correlation only)" 
        else:
            text = "Errors probably normally distributed\n(Line of best fit can be used to extrapolate)"
        
        ax2.set_xlabel("Distance from LOBF")
        ax2.set_ylabel("Counts")
        ax2.hist(residual, bins = 20, histtype='bar', ec='black', label = text)
        ax2.legend()
    # plt.show()
    
    
# this is a lazy copy-pase reformat of my filter function. I should probably ahve
# just written a better function originally .
        
def course_drop2(_type, name):
    
    courses = list(density_plot_frame['Course Name'].unique())
    
    if not name:
        return courses
    
    filtered_course_list = []
   
    for course in courses:
        result = pd.DataFrame()
        y = pd.DataFrame()
        result = density_plot_frame[density_plot_frame[_type] == name].copy()
        result = result[result['Course Name'] == course].copy()
        if _type == "School Name":
            y = result['Sch % Acc of Writing']
       
        if _type == "Authority Name":
            y = result["Sch Acc"].copy()
            
        if y.isnull().sum() >= len(y) - 2 or y.empty == True:
            continue
        else:
            # if something exists, we'll count it
            filtered_course_list.append(course)
            
    if len(filtered_course_list) == 0:
        # TODO: make an empty thing instead of pretending they do math
        filtered_course_list.append("Mathematics 6")
    return filtered_course_list    
        
    
categories = sorted(["Sch % Exc of Writing", "Sch % Acc of Writing", "Sch % Below of Writing"])
Authority = [None] + sorted(map(str,list(density_plot_frame["Authority Name"].unique())))

auth_widget = Dropdown(options= Authority)
sub_widget = Dropdown()
    
def update(*args):
    sub_widget.options = course_drop2("Authority Name", auth_widget.value)
    
auth_widget.observe(update)

years = [None] + years

interact(make_density, 
        category = categories,
        year = years, 
        subject = sub_widget,
        Authority = auth_widget)
    
    
    
    

Using the widget above you can look at the year to year and total performance of every school district as a function of funding in the top graph, and below is a histogram of the linear fit residuals. In the small chance those residuals are normally distributed, the line of best fit can be used for extrapolation. However, if they are not, the line of best fit -at best- represents approximate correlation between student performance and funding. 

By selecting an authority you can view the performance of that district year to year against the provincial mean as well. 

A few interesting things to point out about the funding graph however: Excellent and acceptable scores seem to be slightly negatively correlated with funding i.e. more funding seems to be related to worse grades in some cases. That said, correlation does not depend on causation, and there are significant outliers from the actual cluster.