![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)



<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=government-spending/government-spending.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Callysto’s Weekly Data Visualization

## Government Spending

### Recommended Grade levels: 6-9
<br>

### Instructions

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll back to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

## Question

Which sectors receive funding from the Canadian government, and what is the allocation of the government's budget to specific categories? 

### Goal

Our mission is to look at how the different sectors in the Canadian government and identify which spend and most/least of the allocated budget set. We also want to look at the distribution of expenses and find any trends with how the budget is allocated amongst important categories such as health, defense, etc.

### Background

Analyzing the Canadian government's budget, expenses, and future spending habits is important as it promotes transparency and accountability, enabling citizens to assess government priorities and decisions. We can also identify areas for optimization, specifically looking at areas of high spending.

## Gather

All of our data sources used in this notebook comes from [Statistics Canada](https://www.statcan.gc.ca/en/start) and the government of Canada's [open government](https://search.open.canada.ca/opendata/) portal. 

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data.

In [None]:
%pip install -r requirements.txt
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import plotly.express as px
import folium
import geopandas as gpd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
import ipywidgets
from IPython.display import display
warnings.filterwarnings("ignore")

print("Libraries imported.")

### Data

To begin, we'll obtain our datasets necessary for this notebook by using the cell below. In certain notebooks, the column names will be altered to enhance clarity on what the column means.

### Import the data

In [None]:
# Renaming columns
expenses_cols = ['Year','Social protection','Health','Education','General public services','Economic affairs','Other functions']
share_of_expenses_cols = ['Province','Health','Education','General public services','Social protection','Economic affairs','Other functions']

fte = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/government-spending/FTE.csv")
expenditures = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/government-spending/expenditures.csv")
overall_expenses = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/government-spending/overall_expenses.csv", header=1, names=expenses_cols)
share_of_expenses = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/government-spending/share_of_expenses.csv", header=1, names=share_of_expenses_cols)
federal_provincial = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/government-spending/federal_provincial.csv")
print("Datasets imported.")

### Comment on the data

Now that we've obtained our data, we can take a look at what each dataset represents and try to analyze better meaning from it. Throughout this notebook, when a new dataset is being used, the first cell will print the contents of the dataset. This will help establish when a new dataset is being analyzed and the contents of visualizations. 

To begin, let's analyze our `fte` dataset.

In [None]:
fte

Moving forward, we will be calling our datasets, *dataframes*. A dataframe is like a digital spreadsheet or table which contains rows and columns of data. Each row in a dataframe represents a different piece of information or a record, while each column represents a specific attribute or characteristic of that information.

In the dataframe above, we see that the columns `Organization` and `Program` identify the government organization and the different programs they have. The columns `2017-18 Actual FTEs`-`2021-22 Actual FTEs` identify the organization's FTE's. The definition of an FTE (Full Time Equivalents), from the [government of Canada](https://open.canada.ca/data/en/dataset/e2e60f18-95fe-487b-9edd-d1f7bcdd9f9f) is:

"a measure of the extent to which an employee represents a full person-year charge against the departmental budget for future spending years."

In simplified terms, it's like asking, "if one person worked full-time for a year, how much of our budget would they use?". The final columns are `2023-24 Planned FTEs`-`2025-26 Planned FTEs`, which indicate how many FTEs are planned for this particular organization's future. 

# Explore

To begin, let's see how many different organizations are supported under the Government of Canada. 

In [None]:
# Find unique organizations
unique_organizations = set(fte['Organization'])

# Print unique organization
for org in unique_organizations:
    print(org)

Looking at the output above, there appears to be many different organizations supported by the government of Canada, reflecting the country's commitment to a wide array of sectors and initiatives. Using this list, let's create a visualization that can utilize these organizations. Read the commented lines in the code cell below (the lines that start with #).

In [None]:
# Change this to the organization you'd like to look at 
# Example: "Atlantic Canada Opportunities Agency" can be changed to "Canadian Grain Commission"
organization_to_find = "Atlantic Canada Opportunities Agency"

searched_df = fte[fte['Organization'] == organization_to_find]

columns_to_melt = [col for col in searched_df.columns if col != 'Program']
melted_df = pd.melt(searched_df, id_vars=['Organization'], value_vars=columns_to_melt, var_name='Year', value_name='Value')
org_df = melted_df.groupby(['Year', 'Organization'])['Value'].sum().reset_index()

searched_df = fte[fte['Organization'] == organization_to_find]
df_prog = searched_df.groupby('Program').sum().reset_index()
columns_to_melt_prog = [col for col in df_prog.columns if col != 'Program']
program_df = pd.melt(df_prog, id_vars=['Program'], value_vars=columns_to_melt_prog, var_name='Year', value_name='Value')
# Removes unwanted Organization column after melting
program_df = program_df[program_df.Year != "Organization"]

fte_fig = make_subplots(rows=1, cols=2, subplot_titles=("Total FTEs", "Program FTEs"))

for org in org_df['Organization'].unique():
    org_data = org_df[org_df['Organization'] == org]
    fte_fig.add_trace(go.Scatter(x=org_data['Year'], y=org_data['Value'], mode='lines',
                             name=f'{org}'), 
                             row=1, col=1)

for program in program_df['Program'].unique():
    prog_data = program_df[program_df['Program'] == program]
    fte_fig.add_trace(go.Scatter(x=prog_data['Year'], y=prog_data['Value'], mode='lines',
                             name=f'{program}'), row=1, col=2,
                             )
   
    
fte_fig.update_layout(title=f'Progression of FTEs for: {organization_to_find}',
                  xaxis_title='Year', yaxis_title='FTEs',
                  xaxis2_title='Year', yaxis2_title='FTEs')

fte_fig.show()

After viewing the different Full-Time Equivalents (FTEs) of various Canadian organizations, have you gained a different sense of perspective? How might this newfound perspective influence your views on government support for various sectors?

Now that we've done some exploratory analysis on FTE data, let's move onto something more tangible, such as *budgeting*.

In [None]:
expenditures

# Organize

Compared to our `fte` dataframe, we will be adding new columns to our current dataframe `expenditures`. Specifically, we'll be finding which organizations overspend/underspend their allocated budget. 

In [None]:
expenditures = expenditures.set_index('Organization')

# List of years
years = ['2017-18', '2018-19', '2019-20', '2020-21', '2021-22']

for year in years:
    year_col = year + ' - '
    budget_col = year_col + 'Total budgetary authority available for use'
    expenditure_col = year_col + 'Expenditures'
    
    expenditures[f'{year} Delta'] = expenditures[budget_col] - expenditures[expenditure_col]

expenditures = expenditures.reset_index()
display(expenditures)

Perfect! Now we can utilize the new columns `2017-18 Delta`-`2021-22 Delta`. First, let's find the highest delta (i.e difference) in each year based on overspending and underspending budgets. 

In [None]:
cols_to_check = expenditures.columns[12:]
max_indices = expenditures[cols_to_check].idxmax()

for col in max_indices.index:
    max_index = max_indices[col]
    max_value = expenditures.at[max_index, col]
    organization = expenditures.at[max_index, 'Organization']
    program = expenditures.at[max_index, 'Vote 2021-22 / Statutory - Description']  
    
    print(f"Highest {col}:")
    print(f"Organization: {organization}")
    print(f"Description: {program}")
    print(f"Under-Budget by: ${max_value}\n")

cols_to_check = expenditures.columns[12:]
min_indices = expenditures[cols_to_check].idxmin()
print('-----------------------------\n')

for col in min_indices.index:
    min_index = min_indices[col]
    min_value = expenditures.at[min_index, col]
    organization = expenditures.at[min_index, 'Organization']
    program = expenditures.at[min_index, 'Vote 2021-22 / Statutory - Description']  
    
    print(f"Lowest {col}:")
    print(f"Organization: {organization}")
    print(f"Description: {program}")
    print(f"Over-Budget by: ${abs(min_value)}\n")


Looking at the output above, the results are surprising. 

 In every year in the under-budget deltas, particular organizations under-spent their budget by at least a billion dollars. Consistent under-spending by certain organizations, with under-budget deltas exceeding a billion dollars each year, can be viewed positively as a sign of fiscal responsibility and resource efficiency, but it also raises concerns about potential missed opportunities, impact on services, and the need for transparent accountability in budget management. On the flip-side, many organizations on the over-budget 

 On the flip-side, in the over-budget deltas, many organizations diligently did not overspend on their allocated budget! The only major organization over-spending their budget was found in the `2019-20 Delta`, being the Correctional Service Canada organization. 

For clarity purposes, we can also visualize all the different highest deltas, whether it being overspending or underspending. 

In [None]:
budget_analysis = make_subplots(rows=2, cols=1, subplot_titles=("Under-Budget", "Over-Budget"))

for col in max_indices.index:
    max_index = max_indices[col]
    max_value = expenditures.at[max_index, col]
    organization = expenditures.at[max_index, 'Organization']
    program = expenditures.at[max_index, 'Vote 2021-22 / Statutory - Description']
    
    budget_analysis.add_trace(go.Bar(
        x=[organization],
        y=[max_value],
        marker=dict(color='green'),
        name=f"Under-Budget Fig - {col}",
    ), row=1, col=1)

for col in min_indices.index:
    min_index = min_indices[col]
    min_value = expenditures.at[min_index, col]
    organization = expenditures.at[min_index, 'Organization']
    program = expenditures.at[min_index, 'Vote 2021-22 / Statutory - Description']
    
    budget_analysis.add_trace(go.Bar(
        x=[organization],
        y=[min_value],
        marker=dict(color='red'),
        name=f"Over-Budget Fig - {col}",
    ), row=2, col=1)

budget_analysis.update_layout(
    title_text="Budget Analysis of Canadian organizations, 2017-2022",
    barmode='group',
)

budget_analysis.show()


Looking the visualization, two organizations do not have any budgets. Those are *Veterans Affairs Canada* and *Atlantic Canada Opportunities Agency*. This is because for the former they were only over-budget by one dollar, and the latter did not go over budget. What other conclusions can you interpret from the visualization? 

We can also look at the distribution percentage of government expenses on particular categories in the dataframe `overall_expenses`.

In [None]:
overall_expenses

We can first create a visualization representing the general progression of government expenses.

In [None]:
columns = overall_expenses.columns[1:7]

stacked_categories = px.bar(overall_expenses, x='Year', y=columns, title="Stacked Bar Graph of Government Expenses, 2008-2021",
             labels={'variable': f"Category", 'index': "Year", 'value': 'Percentage of Budget'})

stacked_categories.update_layout(barmode='stack').show()

Looking at the visualization, we can generally interpret which categories are prioritized by the government. *Social protection* and *health* appears to be of the highest priority without major changes throughout the years. This is also similar with other categories, appearing to change minimally. However, these minimal changes may be drastic when interpreted in the larger scale of money. 

Let's try to identify the changes throughout the years with more detail. We can calculate this by taking each year's current percent and subtracting it with the previous year. These will be documented in different columns.

In [None]:
for col in columns:
    overall_expenses[col + '_change'] = overall_expenses[col] - overall_expenses[col].shift(1)

# Replace the first row values with 0
overall_expenses.fillna(0, inplace=True)
display(overall_expenses)

Now let's visualize these percent changes in the visualization below.

In [None]:
percentage_fig = go.Figure()

for col in columns:
    color = 'red' if overall_expenses[col + '_change'].iloc[1:].mean() < 0 else 'green'
    percentage_fig.add_trace(go.Scatter(
        x=overall_expenses['Year'],
        y=overall_expenses[col],
        mode='lines+markers',
        line=dict(dash='dot'),
        name=col,
        marker=dict(color=color),
        text=[f'{col}: {y:.2f}%<br>Change: {change:.2f}%' for y, change in zip(overall_expenses[col], overall_expenses[col + '_change'])],
        hoverinfo='text',
    ))

percentage_fig.update_layout(
    title="Progression of Expenses 2008-2021",
    xaxis=dict(title="Year"),
    yaxis=dict(title="Percentage Change"),
).show()

It appears that half of the categories have *decreased* in overall percentage while the other half of categories *increased* in percentage throughout the years of 2008 to 2021. An interesting thing to note is the largest positive increase was in the year 2019-2020 with a 7% increase in social protection. A potential reasoning to why social protection increased so much in this year was that this was that this was the beginning of Covid-19. Unemployment soared during this time and the government may have had to assist more due to increasing poverty rates. However, many would assume *health* would have been prioritized if this was the case, but surprisingly it's allocation in budget decreased. 

Overall, upon closer examination of the specific percentage changes, it becomes apparent that the government has opted for relatively modest adjustments to the budget. This approach is generally positive, as it promotes stability and flexibility to address changing priorities and challenges.

We can also look at the percentages of budget based on province using the `share_of_expenses` dataframe.

In [None]:
share_of_expenses

## Extended Organization

As an extension to organizing data, *data-cleaning* is an essential step in the data preparation process. Generally, it involves identifying, correcting, and handling errors, inconsistencies, and inaccuracies within a dataset. In our particular case, we will be changing the names of our columns to better suit analysis in later code cells.  

We will also be reading in a *geojson* file, which contains information about the geometric latitude/longitude of the borders of provinces in Canada. 

In [None]:
prov_data = gpd.read_file('https://raw.githubusercontent.com/callysto/data-files/main/Science/ClimateAcrossProvinces/geopandas.geojson')

prov_data.prov_name_fr.replace(
    {
        'Alberta': 'Alberta',
        'Manitoba': 'Manitoba',
        'Yukon': 'Yukon',
        'Terre-Neuve-et-Labrador': 'Newfoundland and Labrador',
        'Nouvelle-Écosse': 'Nova Scotia',
        'Territoires du Nord-Ouest': 'Northwest Territories',
        'Île-du-Prince-Édouard': 'Prince Edward Island',
        'Nunavut': 'Nunavut',
        'Québec': 'Quebec',
        'Ontario': 'Ontario',
        'Colombie-Britannique': 'British Columbia'
    },
    inplace=True
)

prov_data.rename(columns={'prov_name_fr': "Province"}, inplace=True)  
prov_data

Now that we've properly cleaned our dataframe, we will merge our two dataframes. Don't worry about the particular details on why this is being done; it's primarily for coding purposes to prepare for a future visualization.

In [None]:
prov_data['prov_name_en'] = prov_data['prov_name_en'].apply(lambda x: ''.join(map(str, x)))

merged_data = share_of_expenses.merge(prov_data, left_on='Province', right_on='Province', how='left')
merged_data

Now that we've obtained a merged dataframe, we can visualize the different provincial spending habits via a folium map! By using the top tab called *Column*, you can select a specific column to visualize. Provinces that are deeper in green represent a higher allocation of government budget, while lighter colours represent the opposite. 

In [None]:
spendingbyprov = ipywidgets.Output(layout={'border': '1px solid black'})

column_names = merged_data.columns[1:7].tolist()
dropdown_options = ipywidgets.Dropdown(
    options=column_names,
    value=column_names[0],
    description='Column:',
    disabled=False
)

def update_choropleth(change):
    spendingbyprov.clear_output()
    with spendingbyprov:
        m = folium.Map(location=[50, -65], zoom_start=3)
        folium.Choropleth(
            geo_data=prov_data,
            data=merged_data,
            columns=['prov_name_en', dropdown_options.value],  
            key_on='feature.properties.prov_name_en',  
            fill_color='YlGn',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name=f'Spending on {dropdown_options.value} by Province',
        ).add_to(m)
        display(m)

dropdown_options.observe(update_choropleth, names='value')
display(dropdown_options)
update_choropleth({'new': column_names[0]})

spendingbyprov

In the final section of our notebook, we'll be visualizing differences in allocated dollars of federal and provincial governments. We'll also be performing another form of *data-cleaning* in our dataframe called `federal_provincial` by converting all our dollar amounts into valid numbers for analysis later.

In [None]:
federal_provincial = federal_provincial.drop(federal_provincial.index[20:], axis=0).reset_index(drop=True)
federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '] = federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '].str.strip()

columns_to_convert = federal_provincial.columns[1:15]
for col in columns_to_convert:
    federal_provincial[col] = federal_provincial[col].str.replace(',', '', regex=True).astype(float)
federal_provincial

Now that we have valid numeric values, let's create a visualization of each different classifications based on federal and provincial budgets. 

*Note*: Since many of the classifications have long names, we'll be using an abbreviation system. The abbreviated names will also be printed below the visualization.

In [None]:
classification_name_mapping = {
    'General public services': 'GPS',
    'Public order and safety': 'POS',
    'Economic affairs': 'EA',
    'Environmental protection': 'EP',
    'Housing and community amenities': 'HCA',
    'Recreation, culture and religion': 'RCR',
    'Social protection': 'SP'
}

federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '] = federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '].map(classification_name_mapping).fillna(federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '])
classifications = federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '].unique()
num_columns = len(classifications)

all_classifications_fig = make_subplots(rows=1, cols=num_columns, subplot_titles=classifications)

col_num = 1

for classification in classifications:
    temp_df = federal_provincial[federal_provincial['Canadian Classification of Functions of Government (CCOFOG) '] == classification]
    temp_df = temp_df.melt(id_vars=['Canadian Classification of Functions of Government (CCOFOG) ', 'Public sector components'], var_name='Year', value_name='Value')

    traces = []
    for component in temp_df['Public sector components'].unique():
        trace = go.Scatter(
            x=temp_df[temp_df['Public sector components'] == component]['Year'],
            y=temp_df[temp_df['Public sector components'] == component]['Value'],
            mode='lines+markers',
            name=component
        )
        traces.append(trace)

    for trace in traces:
        all_classifications_fig.add_trace(trace, row=1, col=col_num)

    all_classifications_fig.update_xaxes(title_text='Year', row=1, col=col_num)

    all_classifications_fig.update_yaxes(title_text='', row=1, col=col_num)

    col_num += 1

all_classifications_fig.update_layout(
    title='Spending Over the Years by Classification',
    showlegend=False  
).show()

print("Classification Name Mapping:")
for full_name, short_name in classification_name_mapping.items():
    print(f"{full_name} => {short_name}")

Looking at our final visualization, we can definitively say that federal government budgets are *much* larger than their respective provincial counterparts. The only classification that has a similar federal and provincial budget is *Health*. Interestingly, the classification *Defence* has no budget allocated in the provincial sector. Can you interpret ideas on why this could be?

# Interpret

### Reflect on What You See

Think about the following questions.

1. How have shifts in government spending patterns impacted your perception of public service accessibility and quality in your area?
2. What strategies or policies do you believe should be implemented to ensure fair and equitable distribution of government resources across various sectors and regions?
3. What insights can be gained from historical instances of government budget adjustments, and how can they inform proactive measures to address future challenges related to government resource allocation and spending decisions?

# Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________ but now I know ____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)