![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcallysto-sample-notebooks&branch=master&subPath=notebooks/Digital_Citizenship/PATScores_No_Map.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

In [None]:
%%html

<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }
  
  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

# Provincial Achievement Test Scores

## Introduction

TODO: Improve discussion and add quantitative results if there's interest in this notebook to do that. 

Every year, the province of Alberta runs standardized testing for grades 6 and 9 for primary courses under the blanket identifier of Provincial Achievement Tests in order to assess how well the students preform. The results of these test are open source and readily downloaded from the Alberta Education website. In this notebook we're going to download and manipulate the data direct from Alberta education, and see if we can easily identify under and over performing school districts. Time permitting, we might even toss these onto a map using another open data set from Alberta education which contains the addresses of every school in Alberta. Using this data in combination with the provincial testing scores, we will likely be able to easily identify which school districts/schools are performing best and worst.  

## Wrangling the data

First let's download the data directly from the Alberta Education website and toss it in a Pandas data frame

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import nan as Nan


df_ero = pd.read_excel("https://education.alberta.ca/media/3680591/pat-multiyear-sch-list.xlsx")

That was pretty easily done as those are hosted excel spreadsheets. So, we don't even have to save the file locally, we can toss it straight in a pandas frame.

In [None]:
school_results = df_ero.copy()
school_results.head(1)

In [None]:
# Let's also collect diploma exam results
diploma_results = pd.read_excel('https://education.alberta.ca/media/3680580/diploma-multiyear-sch-list-annual.xlsx')
diploma_results.head(3)

In [None]:
diploma_results = diploma_results.rename(columns = {"Diploma Course":"Course Name"})



Where the above data format is going to be annoying to work with in order to plot/sort some data. Instead, let's whip this data table into "long form" so that we can manipulate, analyze and plot this data more easily. We do this with the code below. Notice how now we have multiple duplicate entries for "Authority Name" and "School Name" columns, as well as a handy year column for each row. 

In [None]:
import re
import time
schools_reshaped = school_results.copy()
start = time.time()

# If there's a year in the column we want to split that bad boy
def splitter(string):
    r = re.compile(r'\d{4}|\S.*$')
    return r.findall(string)

cols = list(schools_reshaped)[0:8]
years = ['2013', '2014', '2015', '2016', '2017']

# Being lazy and creating duplicate columns with a year index. It's the same 
# accross the board but we need them for the next step. 
# The key is to not respect your RAM. 
for year in years:
    for names in cols:
        schools_reshaped[str(year +" "+names)] = schools_reshaped[names]


schools_reshaped.columns = pd.MultiIndex.from_tuples([tuple(splitter(c)) for c in schools_reshaped.columns])
schools_reshaped = schools_reshaped.stack(0).reset_index(1)

end = time.time()
print(end - start)
schools_reshaped.rename(columns={'level_1': "Year"}, inplace=True)
#schools_reshaped[["School Name", "Course Name", "Sch Enrol", "Year", "Sch Writing"]].loc[schools_reshaped['Year'] == '2013']

# Sort by school name. 
schools_reshaped=schools_reshaped.sort_values('School Name')

del schools_reshaped["Form"]
del schools_reshaped["Language"]




In [None]:
diploma_reshaped = diploma_results.copy()
cols = list(diploma_results)[0:6]
years = ['2013', '2014', '2015', '2016', '2017']

# Being lazy and creating duplicate columns with a year index. It's the same 
# accross the board but we need them for the next step. 
# The key is to not respect your RAM. 
for year in years:
    for names in cols:
        diploma_reshaped[str(year +" "+names)] = diploma_reshaped[names]


diploma_reshaped.columns = pd.MultiIndex.from_tuples([tuple(splitter(c)) for c in diploma_reshaped.columns])
diploma_reshaped = diploma_reshaped.stack(0).reset_index(1)

end = time.time()
print(end - start)
diploma_reshaped.rename(columns={'level_1': "Year"}, inplace=True)
del diploma_reshaped["Sch Exam Mark Acc Sig"]
del diploma_reshaped["Sch Exam Mark Exc Sig"]

print(list(diploma_reshaped))

diploma_reshaped = diploma_reshaped.rename(columns = {"Sch School Mark % Acc":"Sch % Acc of Writing",
                                                      "Sch School Mark % Exc": 'Sch % Exc of Writing',
                                                      "Sch Exam Mark % Exc":"Sch Part 1 % Exc",
                                                     "Sch Exam Mark % Acc":"Sch Part 1 % Acc"})

diploma_reshaped = diploma_reshaped[["Year", 
                                     "Authority Name", 
                                     "Course Name", 
                                     "School Name",
                                     "Sch % Acc of Writing",
                                    "Sch % Exc of Writing",
                                    "Sch Part 1 % Exc",
                                    "Sch Part 1 % Acc"]]
# Sort by school name. 
#diploma_reshaped=diploma_reshaped.sort_values('School Name')
diploma_reshaped.head(1)

Excellent. Now that the data have been reshaped into a "long form" they'll be a lot easier to work with when it comes to plotting and analysis. So, let's start to get an idea at the score distributions between schools and districts by using this dataframe as a back end to an interactive widget.

## Interactive Graph

Before we start any more "involved" analysis let's take a moment to plot these data by year to get an idea of what we're working with. In the widget below `_type` controls whether we're looking at individual schools or the school authority, `name` is the name of the school/authority, `subject` changes the subject, and `name2` is optional and will display another school/authority to compare with. Note that switching to school is a little slower, as that data set requires some set up before we can put it nicely into the widget. Also note that not all subjects are offered in each school, and they're filtered down buy what subjects were offered in the school/authority under `name`. 

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from ipywidgets import interact
init_notebook_mode(connected=True)
from ipywidgets import Dropdown

schools_reshaped = pd.merge(schools_reshaped, diploma_reshaped, on = ["Year", 
                                     "Course Name", 
                                     "School Name",
                                     "Authority Name", 
                                     "Sch % Acc of Writing",
                                     "Sch % Exc of Writing",
                                     "Sch Part 1 % Acc",
                                     "Sch Part 1 % Exc"], how = 'outer')
#schools_reshaped[schools_reshaped["Course Name"] == "Biology 30"]

In [None]:
# Now let's do the same with school districts
# print(list(schools_reshaped))

def traces(name, subject, school_or_auth):
    result = None
    divisor = None
    y = None
    y2 = None
    y3 = None
    if school_or_auth == "Authority Name":
        result = schools_reshaped[schools_reshaped[school_or_auth] == name]
        divisor = result.groupby("Year")['Sch Enrol'].sum() - result.groupby("Year")["Sch Absent"].sum()    
        y =  100 * (result.groupby("Year")["Sch Acc"].sum() - result.groupby("Year")["Sch Exc"].sum() )/divisor
        y2 = 100 * result.groupby("Year")['Sch Exc'].sum()/divisor
        y3 = 100 * result.groupby("Year")['Sch Below'].sum()/divisor
    
    if school_or_auth == "School Name":
        result = schools_reshaped[schools_reshaped[school_or_auth] == name]
        divisor = result['Sch Writing']
        y = (result['Sch % Acc of Writing']-result['Sch % Exc of Writing'])# - result['Sch Exc']) / divisor
        y2 = result['Sch % Exc of Writing'] #/ divisor
        y3 =  result['Sch % Below of Writing']# / divisor 
        
    result = result[result['Course Name'] == subject]

    trace1 = go.Bar(x=result['Year'], y=y,
                    name=" ".join([name, '% at or above acceptable standard']))#, 
               
    trace2 = go.Bar(x=result['Year'],
                    y=y2,
                    name= " ".join([name, '% achieved a standard of excellence']))#,
                
    trace3 = go.Bar(x=result['Year'], 
                    y=y3, 
                    name = " ".join([name,"% below acceptable standard"]))#,
     
    return [trace1, trace2, trace3]



def compare_results( _type, name, subject, name2 = []):
    
    print(name, subject, _type)
    
    data = traces(name, subject, _type)
    
    if name2: 
        data2 = traces(name2, subject, _type)
        data = data + data2
    
    layout = go.Layout(title=subject,
                xaxis=dict(title='Year'),
                yaxis=dict(title='Percentage',
                      range = [0,100])
                      )
    
 
    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

    
def course_drop(_type, name):
    courses = list(schools_reshaped['Course Name'].unique())
    filtered_course_list = []
   
    for course in courses:
        result = schools_reshaped[schools_reshaped[_type] == name]
        result = result[result['Course Name'] == course]
        if _type == "School Name":
            y = result['Sch % Acc of Writing']
       
        if _type == "Authority Name":
           y = result.groupby("Year")["Sch Acc"].sum() - result.groupby("Year")["Sch Exc"].sum()
            
        if y.isnull().sum() > 4 or y.empty == True:
            # No course for school, do nothing
            #filtered_course_list.append(course)
           
            continue
        else:
            # if something exists, we'll count ita
            filtered_course_list.append(course)
            
    if len(filtered_course_list) == 0:
        # TODO: make an empty thing instead of pretending they do math
        filtered_course_list.append("Mathematics 6")
    return filtered_course_list

course_widget = Dropdown()

type_widget = Dropdown(options = ["School Name", "Authority Name"], value = "School Name")

name_widget = Dropdown()
name_widget2 = Dropdown()

    
def update2(*args):
    a = sorted(list(map(str, list(schools_reshaped[type_widget.value].unique()))))
    name_widget.options = a
    name_widget2.options =  a
    name_widget2.value = None
    # course_widget.options = course_drop(type_widget.value, x_widget.value)
    name_widget.value = a[0]
    

    
def update(*args):
    course_widget.options = course_drop(type_widget.value, name_widget.value)

name_widget.observe(update)  
#type_widget.observe(update)
type_widget.observe(update2)



interact(compare_results, 
        _type = type_widget,
         name = name_widget,
         subject =  course_widget,
         name2 = name_widget2
        )




Fantastic. Now we can compare which schools do well and which do poorly and in what subject. I note that the first school/authority `name` is used to filter out subjects that they don't have data for. That means that you might not see all their choices if youre using them in `name2`. I also note that if a school/authority has no test scores, then it defauls to a blank grid for mathematics 9. 

## How Do Differences in Funding Affect Student Performance? 

The code below assumes you've downloaded all the PDFs off of the Alberta education site containing funding information from each district. If you don't have it you can either download those pdfs yourself (not recommended) or get them from our swift container `callysto-open-data` called `district_funding.csv`, of course, it is downloading it directly from swift for you. 


Most of the code below is just wrangling data and making plots of that data. However, what we're doing is gathering all our funding data, combining it with our data frames and then plotting it. What we'll then have is the performance of each district against the provincial average in terms of test scores for each ear and subject, as well as a graph of how those test scores were affected by differences in _total_ funding. In order to do that, we plot the density of funding and performance grades for the entire province, and then fit a line to it in order to judge positive/negative coorelation between funding and grade performance. 

In [None]:


# If you don't have the LAT LONG data, uncomment the line below and run this cell .
temp_df = schools_reshaped.copy()


In [None]:
def convert(x):
    try:
        return x.astype(int)
    except:
        return x
   

def get_funding_data(paths = "FundingPdf/*.pdf"):
    from tika import parser
    import requests
    import glob
    import re
    data = []
    count = 0
    for file in glob.iglob(paths):
        parsedPDF = parser.from_file(file)
    
        name = file.split("/")[-1]
        name = name.replace(".pdf", "")
        name = name.replace("-", " ").title()
        name = re.sub("[\(\[].*?[\)\]]", "", name)
        try:
            name = name.replace(" No ", " No. ")
        except:
            pass

        try: 
            name = name.replace(" Ltd", " Ltd. ")
        except:
            pass

        name = name.strip()

        estimated_funding = None
        projected_funding =  None
        estimated_enroll =  None
        projected_enroll =  None
        year = None
    
        for line in parsedPDF['content'].split('\n')[::-1]:

            if "TOTAL FUNDING" in line:
                estimate_funding = line.split()[2].replace('$',"").replace(",","")
                projected_funding = line.split()[3].replace('$',"").replace(",","")
        
            if "As of " in line:
          
                try:
                    print(int(line.split()[-1]))
                    year = line.split()[-1]
                except:
                    pass
        
            if "Funded Enrolment for Grades 1 - 12" in line:
           
                estimated_enroll = line.split()[7].replace(",", "")
                projected_enroll = line.split()[9].replace(",", "")
            elif "Enrolment for Grades 1 - 12" in line:
           
                estimated_enroll = line.split()[6].replace(",", "")
                projected_enroll = line.split()[8].replace(",", "")



        data.append([name, estimate_funding, projected_funding, estimated_enroll, projected_enroll, year])
   

    df = pd.DataFrame(data, columns = ["Authority Name", "Estimated Funding", "Projected Funding", "Estimated 1-12", "Projected 1-12","Year"])
    df.to_csv("district_funding.csv")

In [None]:


# Add district funding 
try: 
    funding = pd.read_csv("https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_233e84cd313945c992b4b585f7b9125d/callysto-open-data/district_funding.csv")
    del funding["Unnamed: 0"]
except:
    get_funding_data()
    
    
# This is to fix a pandas "gotcha" concerning integer arrays and NaN types. 
# (as in it doesn't handle it and converts to float) 

funding["Year"] = np.nan_to_num(funding["Year"]).astype(int)
funding["Estimated 1-12"] = np.nan_to_num(funding["Estimated 1-12"]).astype(int)
funding["Projected 1-12"] = np.nan_to_num(funding["Projected 1-12"]).astype(int)
# Don't need this year's data. 
funding = funding[funding.Year != 2018]
#funding = funding[funding.Year != np.nan]


In [None]:

# testing = pd.merge()
temp_df = schools_reshaped[["Authority Name", "School Name"]]
testing = pd.merge(funding, temp_df, how='left', on = ["Authority Name"]).drop_duplicates()



#
# There's a panda's gotcha with NaN types in integer columns so we have to
# go through all this crap to deal with it. 
testing["Year"] = np.nan_to_num(testing["Year"]).astype(int)
testing["Year"] = np.nan_to_num(testing["Year"]).astype(str)
testing["Year"] = testing["Year"].replace('0', Nan)
combined_frame = pd.merge(schools_reshaped, testing, how='left',  on=['Authority Name',"School Name", "Year"])

# create funding per student. 
combined_frame["Est Fund Per Student"] = combined_frame["Estimated Funding"]/combined_frame["Estimated 1-12"]
combined_frame["Proj Fund Per Student"] = combined_frame["Projected Funding"]/combined_frame["Projected 1-12"]

temp_df.drop_duplicates();
                     

In [None]:
density_plot_frame = combined_frame.sort_values("Year").copy()# .replace(0, np.NaN)

import seaborn as sns
from pylab import *
from matplotlib import animation
import numpy.ma as ma
from scipy.stats import mstats

density_plot_frame["Acc Differential Part 1"] = density_plot_frame['Sch % Acc of Writing'] - density_plot_frame['Sch Part 1 % Acc']
density_plot_frame["Exc Differential Part 1"] = density_plot_frame['Sch % Exc of Writing'] - density_plot_frame['Sch Part 1 % Exc']
density_plot_frame["Acc Differential Part 2"] = density_plot_frame['Sch % Acc of Writing'] - density_plot_frame['Sch Part 2 % Acc']
density_plot_frame["Exc Differential Part 2"] = density_plot_frame['Sch % Exc of Writing'] - density_plot_frame['Sch Part 2 % Exc']

density_plot_frame = density_plot_frame.rename(columns={'Sch % Exc of Writing': 'School Grade Percentage Excellent',
                        'Sch % Acc of Writing': 'School Grade Percentage Acceptable',
                        'Sch % Below of Writing': 'School Grade Percentage Unacceptable',
                        'Sch Part 1 % Acc':'Provincial/Diploma Exam Percent Acceptable',
                        'Sch Part 2 % Acc': 'Provincial Exam Percent Acceptable, Part 2',
                        'Sch Part 1 % Exc':'Provincial/Diploma Exam Percent Excellent',
                        'Sch Part 2 % Exc':'Provincial Exam Percent Excellent, Part 2',
                        'Acc Differential Part 1': "Acceptable Grade Differential",
                        'Acc Differential Part 2': "Acceptable Grade Differential, Provincial Part 2",
                        'Exc Differential Part 1': "Excellent Grade Differential",
                        'Exc Differential Part 2': "Excellent Grade Differential, Provincial Part 2"})    



def make_density(category, year, subject, Authority=False, filter=False):
    YEARS = list(density_plot_frame["Year"].unique())

    x = density_plot_frame[density_plot_frame["Course Name"] == subject]
    grade = x[[category, "Year"]]
    funding = x[["Est Fund Per Student", "Year"]]
    points = x[["Est Fund Per Student", "Year", category, "Authority Name"]]
    
    if Authority:
        f, ax = plt.subplots(figsize=(7, 7))
        
        for i, year in enumerate(YEARS):
                line = points[points["Year"] == year][category]
                downline = line.mean() - line.std()
                upline = line.mean() + line.std()
                line = line.mean()
                if i == 0:
                    label = "Provincial Mean"
                    label1= "Provincial 1sd"
                    
                else:
                    label = ""
                    label1 = ""
                plt.axhline(y=line,xmin= (i+.1)/(len(YEARS)),
                            xmax = (i+1-.1)/(len(YEARS)), 
                            c="g", 
                            label = label)
                plt.axhline(y=downline,xmin= (i+.1)/(len(YEARS)), 
                            xmax = (i+1-.1)/(len(YEARS)), 
                            c="purple",
                            label = label1)
                plt.axhline(y=upline,
                            xmin= (i+.1)/(len(YEARS)),
                            xmax = (i+1-.1)/(len(YEARS)), 
                            c="purple", 
                            label = "")
       
        plt.style.use('ggplot')
        points = points[points["Authority Name"] == Authority]
        grade = grade.dropna()
        dd = pd.melt(points[["Year", category]], id_vars = ["Year"], var_name = [category])
        title = ''.join([Authority, "\n", subject])
        
        try:
            sns.boxplot(x="Year", y="value", data=dd, hue=category)
            sns.swarmplot(x="Year", y="value", data=dd, color="0.25")
        except:
            title = ''.join([Authority, "\n", subject," No data"])
        plt.title(title)

    else: 
        f, (ax1, ax2) = plt.subplots(2, figsize=(9, 9))
        plt.tight_layout(pad=4)
       # plt.subplot(2,1,1)
        points = points.dropna()
        if year:
            x = points[points["Year"] == year][category]
            y = points[points["Year"] == year]["Est Fund Per Student"]
            if filter:
                t_f = points[points["Year"] == year][[category, "Est Fund Per Student"]]
            
        else:
            x = points[category]
            y = points["Est Fund Per Student"]
            if filter:
                t_f = points[[category, "Est Fund Per Student"]]
            

        x1 = x.quantile(0.25)
        x2 = x.quantile(0.75)
        y1 = y.quantile(0.25)
        y2 = y.quantile(0.75)
        
        ax1.plot([x1,x1], [y1,y2], c ='r', label = "Box contains\n50% of data")
        ax1.plot([x1,x2], [y1,y1], c='r')
        ax1.plot([x1,x2], [y2,y2], c='r')
        ax1.plot([x2,x2], [y1,y2], c='r')
        
        
        try:
            # To get an idea for the trend I"m plotting al ine. 
            # That said these errors are likely VERY non guassian
            # I don't feel like plotting them -- too deep in rabbit hole
            # to go down another. SO keep in mind these are "trends"
            # and shouldn't be read into beyond a positive/negative 
            # correlation. 
       
            if filter:
                # Filter outliers by one stadard dev. (VERY AGRESSIVE) 
                top1 = t_f[category].mean() + t_f[category].std()
                top2 = t_f["Est Fund Per Student"].mean() + t_f["Est Fund Per Student"].std()
                bottom1 = t_f[category].mean() - t_f[category].std()
                bottom2 = t_f["Est Fund Per Student"].mean() - t_f["Est Fund Per Student"].std()
                t_f = t_f[t_f[category] < top1]
                t_f = t_f[t_f[category] > bottom1]
                tf = t_f[t_f["Est Fund Per Student"] < top2]
                tf = t_f[t_f["Est Fund Per Student"] > bottom2]

                x = tf[category]
                y = tf["Est Fund Per Student"]
           
            limits = x
            fit, V = np.polyfit(x, y, deg=1, cov=True)
            
            # 62 percentile. Though probably not really 
            # as this calculation requires the errors to be normally distributed.
            error = 2*np.sqrt(np.diag(V))

            label = ''.join(["Line of best fit\n", 
                            str(round(fit[0],2)), 
                            "±",
                            str(round(error[0])),
                            "x + ",
                            str(round(fit[1],2)), 
                            "±",
                            str(round(error[1],2))])
            
            ax1.plot(limits, fit[0] * limits + fit[1], 
                     color='purple', 
                     label = label)
            ax1.plot(limits, 
                     (fit[0]+ error[0]) * limits + fit[1] + error[1], 
                     color = 'orange',
                    label = "")
            ax1.plot(limits, 
                     (fit[0]- error[0]) * limits + fit[1] - error[1],
                     color = 'orange',
                    label = "")
            
            test = fit[0] * x + fit[1]
            residual = y - test
            
        # Naked exception because I'm a rule breaker. 
        except Exception as e:
            print("No data available for", subject, category)
            return
        
        
        if subject:
            pass
        else:
            subject = "All"
        
        title = "".join(["All Districts" ,
                 "\nMean Funding = \$", 
                 str(round(y.mean(),2)),
                 " $\pm$ ",
                str(round(y.std(), 2)),
                         " (1sd)"
                "\nMean Percent = ",
                str(round(x.mean(),2)),
                 " $\pm$ ",
                str(round(x.std(), 2)),
                " % (1sd)",
                "\nSubject: ",
                 subject,
                        '\n', category])
        
        ax1.set_title(title)
        ax1.legend()
        
        sns.kdeplot(x, y, shade=True, ax=ax1)
        
        ax2.set_title("Distribution of Residuals of Line of Best Fit")
        
        # Test if the residual is normally distributed to judge our LOBF
        z,pval = mstats.normaltest(residual)
        if pval < 0.05:
            text = "Errors probably not normally distributed\n(Line of best fit shows approximate correlation only)" 
        else:
            # don't think this will ever happen
            text = "Errors probably normally distributed\n(Line of best fit can be used to extrapolate)"
        
        ax2.set_xlabel("Distance from LOBF")
        ax2.set_ylabel("Counts")
        ax2.hist(residual, bins = 20, histtype='bar', ec='black', label = text)
        ax2.legend()
    # plt.show()
    
    
# this is a lazy copy-pase reformat of my filter function. I should probably ahve
# just written a better function originally .
        
def course_drop2(_type, name):
    
    courses = list(density_plot_frame['Course Name'].unique())
    for course in courses:
        if "\n" in course:
            courses.remove(course)
    
    if not name:
        return courses
    
    filtered_course_list = []
   
    for course in courses:
        result = pd.DataFrame()
        y = pd.DataFrame()
        result = density_plot_frame[density_plot_frame[_type] == name].copy()
        result = result[result['Course Name'] == course].copy()
        if _type == "School Name":
            y = result['School Grade Percentage Acceptable']
       
        if _type == "Authority Name":
            y = result["School Grade Percentage Acceptable"].copy()
            
        if y.isnull().sum() >= len(y) - 2 or y.empty == True:
            continue
        else:
            # if something exists, we'll count it
            filtered_course_list.append(course)
            
    if len(filtered_course_list) == 0:
        # TODO: make an empty thing instead of pretending they do math
        filtered_course_list.append("Mathematics 6")
    return sorted(filtered_course_list)    
        
    
categories = ['School Grade Percentage Excellent',
                     'School Grade Percentage Acceptable',
                     'School Grade Percentage Unacceptable',
                     'Provincial/Diploma Exam Percent Acceptable',
                     'Provincial Exam Percent Acceptable, Part 2',
                     'Provincial/Diploma Exam Percent Excellent',
                     'Provincial Exam Percent Excellent, Part 2',
                     "Acceptable Grade Differential",
                     "Acceptable Grade Differential, Provincial Part 2",
                     "Excellent Grade Differential",
                     "Excellent Grade Differential, Provincial Part 2",
                     "School Grade Percentage Unacceptable",
                     ]
Authority = [None] + sorted(map(str,list(density_plot_frame["Authority Name"].unique())))

auth_widget = Dropdown(options= Authority)
sub_widget = Dropdown()
    
def update(*args):
    sub_widget.options = course_drop2("Authority Name", auth_widget.value)
    
auth_widget.observe(update)

years = [None] + years



interact(make_density, 
        category = categories,
        year = years, 
        subject = sub_widget,
        Authority = auth_widget)
    
    

Using the widget above you can look at the year to year and total performance of every school district as a function of funding in the top graph, and below is a histogram of the linear fit residuals. In the small chance those residuals are normally distributed, the line of best fit can be used for extrapolation. However, if they are not, the line of best fit -at best- represents approximate correlation between student performance and funding. The differential is defined as 

\begin{equation}
\Delta \text{Score} = S_{grade} - E_{grade}
\end{equation}

where $S$ is the overall grade awarded by the school, and $E$ is the grade students achieved on the exam. 

By selecting an authority you can view the performance of that district year to year against the provincial mean as well. I note that not all authorities have exam or school marks for all courses in all years. In that case, an empty plot will be created. I also note that the provincial mean and standard deviations are also overlaying the plot, as well as individual points for each school with grades recorded. This makes it far easier to judge how well a school division performed relative to the province, as well as decide if these variations from the provincial mean are necessarily meaningful. 

A few interesting things to point out about the funding graph however: Excellent and acceptable scores seem to be slightly negatively correlated with funding i.e. more funding seems to be related to worse grades in some cases. That said, correlation does not depend on causation, and there are significant outliers from the actual cluster that may be over weighting the outliers. You can aggressively remove outliers by clicking the filter button which removes all points (in $x$ and $y$) that are greater than one standard deviation away from the mean of the data. I note that this feature is only available on the density plot regarding funding information.

Regardless the trend is the same, and funding doesn't seem to really matter in terms of performance. If anything, more funding seems to imply that the students do worse. However, the uncertainty is so large, and the residuals are far from normal, so at best I will cautiously state that funding amount does not seem to affect overall student performance. Surely, this is likely a good sign. It may be interesting to take into account the geographic coordinates of each school and compare performance as a function of location. 

## Question Level Precision for Math 30-1 Wild Rose School Div. 66

The province also reports the per-question performance of students on diploma exams. In this case we have the data set of the Wild Rose School Division 66 Math 30-1 scores for 2018. Below we have plotted the percentage of students who got each question correct for both the province, and the students in the Wild Rose School Division. Below we have plotted the differential defined as

\begin{equation}
\Delta \text{Score (%)} = \text{Score (Wild Rose) (%) } - \text{Score (Province) (%)}
\end{equation}

based on the above definition, a positive differential implies that the students of the Wild Rose School Division out performed the province, and a negative differential implies that the students under preformed in relation to the province. 


In [None]:
mathdf = pd.read_csv("math_2018_scores.csv")
del mathdf["instl_grp_id"]


f, xarr = plt.subplots(2,1,figsize=(20, 10))

xarr[0].plot(mathdf.index, mathdf["prov_ms_correct_pct"], label = "Province")
xarr[0].plot(mathdf.index, mathdf["ms_correct_pct"], label="Wildrose")
string = "".join(["Correlation = ", 
                  str(round(mathdf["ms_correct_pct"].corr(mathdf["prov_ms_correct_pct"])*100,3)), 
                  " %"])

x = np.linspace(0,39,10)

up = mathdf["prov_ms_correct_pct"].mean() + mathdf["prov_ms_correct_pct"].std()
down = mathdf["prov_ms_correct_pct"].mean() - mathdf["prov_ms_correct_pct"].std()
up1 = mathdf["ms_correct_pct"].mean() + mathdf["ms_correct_pct"].std()
down1 = mathdf["ms_correct_pct"].mean() - mathdf["ms_correct_pct"].std()


xarr[0].fill_between(x, up1,down1, alpha = 0.2, label = "Wildrose 1sd Range", color='r')
xarr[0].fill_between(x, up,down, alpha = 0.2, label = "Provincial 1sd Range", color='b')

xarr[0].set_xlim(0,39)
xarr[0].text(1,20, string,size=16)
xarr[0].legend()
xarr[0].set_xlabel("Question", size =20)
xarr[0].set_ylabel("Correct (%)", size =20)

up3 = (mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"]).mean() + (mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"]).std()

down3 = (mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"]).mean()-(mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"]).std()

xarr[1].fill_between(x, up3,down3, alpha = 0.4, label = "1sd Range")

xarr[1].plot(mathdf.index, mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"])

xarr[1].set_xlabel("Question", size =20)
xarr[1].set_ylabel("Correct Differential (%)", size =20)
xarr[1].legend()
xarr[1].set_xlim(0,39)


xarr[0].legend()

plt.show()



From the plot above we see the per-question performance of Wild Rose School Division No 66. as compared to to the per-question performance of the province on the Mathematics 30-1 Diploma exam. I note that without access to the non-aggregated provincial score data it is impossible to tell if any of the variations on performance between the school division and province are statistically significant or not. In attempt to estimate this, the range defined by the first standard deviation of these data are plotted as translucent bars on these data to help identify potential outliers, and if any variations between the school division and the province are significant. However, with this small set of data, I would be hard pressed to believe that there's any variations of particular significance. The only data point that jumps out is the poor performance on question 12 by the division. Certainly however, the province did poorly on that question as well. 


What can be stated with certainty is that based on the correlation between  Wild Rose School Division and the province is that students in the division seemed to have trouble/do well on similar questions to the rest of the province. Beyond that however, besides providing some insights into the performance by students on each question, I'm not convinced there's any broad sweeping conclusions that can be made from this data set alone. 

## Performance By Question Type

This can also be broken down by question type, in this case the key to the $x$ axis of each plot is as follows

| Symbol  | C | P | PS | RF | TRIG | PCBT |
|---------|---|---|----|----|------|------|
| **Meaning** | Conceptual  |  Procedural |  Problem Solving  | Relations and Functions   |  Trigonometry   |    Permutations, Combinations and Binomial Theorem   |

Below shows the performance of the province, the wild rose school district, and the differntial of performance between the Wild Rose School District and the province on each question type. 

In [None]:
f, xarr = plt.subplots(2,3,figsize=(20, 10))

sns.boxplot(x=mathdf["Cognitive Level"],y=mathdf["ms_correct_pct"],ax=xarr[0,0])
sns.swarmplot(x=mathdf["Cognitive Level"],y=mathdf["ms_correct_pct"],ax=xarr[0,0],color=".25")
xarr[0,0].set_ylim(0,100)

sns.boxplot(x=mathdf["Cognitive Level"],y=mathdf["prov_ms_correct_pct"], ax=xarr[0,1])
sns.swarmplot(x=mathdf["Cognitive Level"],y=mathdf["prov_ms_correct_pct"], ax=xarr[0,1],color=".25")
xarr[0,1].set_ylim(0,100)

sns.boxplot(x=mathdf["Cognitive Level"],y=mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"], ax=xarr[0,2])
sns.swarmplot(x=mathdf["Cognitive Level"],y=mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"], ax=xarr[0,2],color=".25")
xarr[0,2].set_ylim(-40,40)

sns.boxplot(x=mathdf["Topic"],y=mathdf["ms_correct_pct"],ax=xarr[1,0])
sns.swarmplot(x=mathdf["Topic"],y=mathdf["ms_correct_pct"],ax=xarr[1,0],color=".25")
xarr[1,0].set_ylim(0,100)

sns.boxplot(x=mathdf["Topic"],y=mathdf["prov_ms_correct_pct"], ax=xarr[1,1])
sns.swarmplot(x=mathdf["Topic"],y=mathdf["prov_ms_correct_pct"], ax=xarr[1,1],color=".25")
xarr[1,1].set_ylim(0,100)

sns.boxplot(x=mathdf["Topic"],y=mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"], ax=xarr[1,2])
sns.swarmplot(x=mathdf["Topic"],y=mathdf["ms_correct_pct"]-mathdf["prov_ms_correct_pct"], ax=xarr[1,2],color=".25")
xarr[1,2].set_ylim(-40,40)

xarr[0,0].set_xlabel("Cognitive Level", fontsize=16)
xarr[0,1].set_xlabel("Cognitive Level", fontsize=16)
xarr[0,2].set_xlabel("Cognitive Level", fontsize=16)

xarr[0,0].set_ylabel("Provincial Question Score (%)", fontsize=16)
xarr[0,1].set_ylabel("District Question Score (%)", fontsize=16)
xarr[0,2].set_ylabel("Differential Question Score (%)", fontsize=16)

xarr[1,0].set_xlabel("Topic", fontsize=16)
xarr[1,1].set_xlabel("Topic", fontsize=16)
xarr[1,2].set_xlabel("Topic", fontsize=16)

xarr[1,0].set_ylabel("Provincial Question Score (%)", fontsize=16)
xarr[1,1].set_ylabel("District Question Score (%)", fontsize=16)
xarr[1,2].set_ylabel("Differential Question Score (%)", fontsize=16)
plt.show()



The plots above show the performance of the province and the division based on each question type. From this there's potentially more interesting conclusions than the data set before in that the entire province, as well as the district seems to do poorly at both problem solving, and permutations combinations and binomial theorem as compared to the other categories. If there is any conclusions to take away from the above, it is that the Wild Rose School division sees its most negative differentials with relations and functions, as well as conceptual questions. 

## Conclusion

Unfortunately there's not many concrete conclusions to take away from this data without specialized insight into the differences between divisions and individual schools. However, we did see that funding per-student does not seem to influence student performance in any significant manner. The only place there _may_ be a correlation between funding and test scores, is that students with more funding tend to do more poorly on exams. Beyond that, with the question-level resolution on the Math 30-1 diploma scores with the Wild Rose School Division we see that that division is, more or less, on par with the province with a few outliers in regards to test scores. However, students at the Wild Rose School Division seemed to have the greatest trouble with relations of functions, and conceptual questions. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)