![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Callysto’s Weekly Data Visualization

## Disability Prevelance in Canada
### Recommended Grade levels: 5-9
<br>

### Instructions

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll back to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

# Question

What is the proportion of people who experience disability compared to those who do not in the Canadian population?


### Goal

Our goal with this notebook is to inspire you with visualizations of the proportion of the populatioin that identifies as having some type of disability compared to the proportion of the population that does not identify as having a disability. The data sets are taken from__. 



# Gather

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data. Pandas is a library that helps us to analyze data and plotly.express is a library that has code that allows us to make visualizations. 

In [1]:
import pandas as pd
import plotly.express as px
import re
import plotly.graph_objects as go
print("Libraries imported.")

Libraries imported.


### Data
Data was collected through Statistics Canada, specific links being:
- [Labour force status for persons with disabilities aged 25 to 64 years, by disability type (grouped)](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310073001)
- [Potential to work for persons with disabilities aged 25 to 64 years, by sex](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310074001)
- [Persons with and without disabilities aged 15 years and over, census metropolitan areas](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310075001)
- [A demographic, employment and income profile of Canadians with disabilities aged 15 years and over, 2017](https://www150.statcan.gc.ca/n1/pub/89-654-x/89-654-x2018002-eng.htm)

Note: Data was collected in 2016.

As a brief explanation of the libraries imported: 

| Library               | Description                                                                                            |
|-----------------------|--------------------------------------------------------------------------------------------------------|
| Pandas| Lets you work with structured data easily, perform data analysis, and prepare data for visualization.|
| Plotly Express / Plotly Graph Objects| Simplifies creating interactive visualizations |
| re (Regular Expressions) | Helps with pattern matching and manipulation of text. |

Without importing these libraries we would have to use much more code to analyze our data and generate visualizations. We import the libraries with abbreviations, or aliases, so that we have less typing to do in each line of our code below.

### Import the data

In [2]:
by_pop = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/disabilities/by_pop.csv')
by_type = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/disabilities/by_type.csv')
male_female = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/disabilities/male_female_disabilities.csv')
employment = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/disabilities/employment.csv")

In order to see what is in each data set we can use the code below which allows us to see the first ten rows in each data frame. By investigating the data frames we can tall that there is information about the following: 

* Populations with and without disabilities in different geographic locations.
* Numbers of indiviudals with a certain type of disability.
* Numbers of individuals with disabilities based on potential to work or inability to work. 
* information ..

In [3]:
display(by_pop.head(), by_type.head(), male_female.head(), employment.head())

Unnamed: 0,Geography,Disability,Number,Percent
0,"St. John's, Newfoundland and Labrador","Total population, with and without disabilities 7",167550,100.0
1,"St. John's, Newfoundland and Labrador",Persons with disabilities,37350,22.3
2,"St. John's, Newfoundland and Labrador",Persons without disabilities,130250,77.7
3,"Halifax, Nova Scotia","Total population, with and without disabilities 7",331300,100.0
4,"Halifax, Nova Scotia",Persons with disabilities,94350,28.5


Unnamed: 0,Disability type (grouped),Number
0,Total population with disabilities 7 8,3727920
1,Sensory disability 9,1364120
2,Physical disability 10,1958570
3,Pain-related disability,2512090
4,Mental health-related disability,1421270


Unnamed: 0,Sex,Potential to work,Number,Percent
0,Both Sexes,"Total, with or without work potential 6",1648460,100.0
1,Both Sexes,With work potential 7,644640,39.1
2,Both Sexes,Without work potential 8,1003820,60.9
3,Males,"Total, with or without work potential 6",700120,100.0
4,Males,With work potential 7,294440,42.1


Unnamed: 0,Age Group,Disabilities,Milder,Severe,Gender,Employment Percent
0,25-34 years,0,0,0,Women,77.3
1,25-34 years,0,0,0,Men,65.0
2,25-34 years,0,0,0,Both,81.8
3,35-44 years,0,0,0,Women,81.7
4,35-44 years,0,0,0,Men,89.5


# Organize

Eric give details about data cleaning...

In [4]:
by_pop[["City", "Province"]] = by_pop['Geography'].str.split(",", n=1, expand=True)
by_pop['City'] = by_pop['City'].str.strip()
by_pop['Province'] = by_pop['Province'].str.strip()
by_pop

Unnamed: 0,Geography,Disability,Number,Percent,City,Province
0,"St. John's, Newfoundland and Labrador","Total population, with and without disabilities 7",167550,100,St. John's,Newfoundland and Labrador
1,"St. John's, Newfoundland and Labrador",Persons with disabilities,37350,22.3,St. John's,Newfoundland and Labrador
2,"St. John's, Newfoundland and Labrador",Persons without disabilities,130250,77.7,St. John's,Newfoundland and Labrador
3,"Halifax, Nova Scotia","Total population, with and without disabilities 7",331300,100,Halifax,Nova Scotia
4,"Halifax, Nova Scotia",Persons with disabilities,94350,28.5,Halifax,Nova Scotia
...,...,...,...,...,...,...
100,"Vancouver, British Columbia",Persons with disabilities,410510,20.5,Vancouver,British Columbia
101,"Vancouver, British Columbia",Persons without disabilities,1591390,79.5,Vancouver,British Columbia
102,"Victoria, British Columbia","Total population, with and without disabilities 7",307700,100,Victoria,British Columbia
103,"Victoria, British Columbia",Persons with disabilities,89250,29,Victoria,British Columbia


Next, let's tackle a common problem prevalent in our dataframes, namely inconsistencies and inaccuracies. These may include blank values in rows or unexpected integers/strings present in inappropriate data fields. By addressing these issues, we can ensure the data is more reliable and suitable for analysis.

To solve this issue, we're going to define 2 functions, which solve different issues present in our dataframes. 

1. **remove_integers**:
   
The remove_integers function takes a string as input and returns a new string with all the digits (integers) removed, leaving only non-numeric characters.

2. **remove_commas_and_letters**:
   
The remove_commas_and_letters function takes a string representing a numerical value, possibly with commas or non-digit characters, and returns the cleaned and converted integer value by removing commas and any non-digit characters.

In [5]:
def remove_integers(string):
    return ''.join(i for i in string if not i.isdigit())

def remove_commas_and_letters(value):
    value = value.replace(',', '')  # Remove commas
    value = re.sub('[^0-9]', '', value)  # Remove non-digit characters using regex
    return int(value)

Now, we're going to apply these functions onto their appropriate columns.

In [6]:
# r'\d+' checks to match any digits in the string
by_pop['Province'] = by_pop['Province'].str.replace(r'\d+', '', regex=True).str.strip()

In [7]:
by_pop['Disability'] = by_pop["Disability"].apply(remove_integers)
by_pop['Number'] = by_pop["Number"].apply(remove_commas_and_letters)


by_type["Disability type (grouped)"] = by_type["Disability type (grouped)"].apply(remove_integers)
by_type['Number'] = by_type["Number"].apply(remove_commas_and_letters)

male_female['Potential to work'] = male_female["Potential to work"].apply(remove_integers)
male_female['Number'] = male_female["Number"].apply(remove_commas_and_letters)

Let's take a look at our dataframes after being cleaned.

In [8]:
display(by_pop.head(), by_type.head(), male_female.head(), employment.head())

Unnamed: 0,Geography,Disability,Number,Percent,City,Province
0,"St. John's, Newfoundland and Labrador","Total population, with and without disabilities",167550,100.0,St. John's,Newfoundland and Labrador
1,"St. John's, Newfoundland and Labrador",Persons with disabilities,37350,22.3,St. John's,Newfoundland and Labrador
2,"St. John's, Newfoundland and Labrador",Persons without disabilities,130250,77.7,St. John's,Newfoundland and Labrador
3,"Halifax, Nova Scotia","Total population, with and without disabilities",331300,100.0,Halifax,Nova Scotia
4,"Halifax, Nova Scotia",Persons with disabilities,94350,28.5,Halifax,Nova Scotia


Unnamed: 0,Disability type (grouped),Number
0,Total population with disabilities,3727920
1,Sensory disability,1364120
2,Physical disability,1958570
3,Pain-related disability,2512090
4,Mental health-related disability,1421270


Unnamed: 0,Sex,Potential to work,Number,Percent
0,Both Sexes,"Total, with or without work potential",1648460,100.0
1,Both Sexes,With work potential,644640,39.1
2,Both Sexes,Without work potential,1003820,60.9
3,Males,"Total, with or without work potential",700120,100.0
4,Males,With work potential,294440,42.1


Unnamed: 0,Age Group,Disabilities,Milder,Severe,Gender,Employment Percent
0,25-34 years,0,0,0,Women,77.3
1,25-34 years,0,0,0,Men,65.0
2,25-34 years,0,0,0,Both,81.8
3,35-44 years,0,0,0,Women,81.7
4,35-44 years,0,0,0,Men,89.5


Perfect, it seems that all the main issues present in the dataframes from before are fixed now. Let's begin exploring the cleaned data and making observations now. 

# Explore

Before we get started in exploring visualizations, it's essential to approach the topic of disabilities with sensitivity and avoid generalizations, as each person's experiences and condition are unique. 

With this in mind, let's get a better understanding of the distribution of disabilities in Canada, specifically looking at the distribution of *types* of disabilities. 

In [9]:
by_type_fig = px.histogram(by_type, x="Disability type (grouped)", y="Number", color="Number")
by_type_fig.update_traces(showlegend=False).show()

Looking at the figure above, the main disabilities that disabled people have are pain-related or physical. Pinpointing why many disabilities are linked with pain or physical limitations can be difficult to decipher, but there are several reasons to why this is:

1. **Genetic/Biological Factors**: Many disabilities stem from genetic mutations that affect the development of the body, leading to physical impairments or pain.
   
2. **Acquired Disabilities**: Certain disabilities may be the result of accidents, injuries, or medical conditions that lead to physical limitations.
   
3. **Age-Related Disabilities**: As people age, they are more prone to physical disabilities due to natural wear and tear on the body. Disabilities like these include arthritis or mobility issues. 

# Intrepret

Now that we've gotten a better sense of the distribution of the different *types* of disabilities within Canada we can begin to explore the main topic at hand, employment rates for disabled people in Canada. 

From viewing our dataframe earlier after cleaning, we see that **by_pop** displays the total number of people who are employed within a particular city and province. We can utilize this column `Number` by finding which cities have a larger percentage of disabled workers versus cities that have a lesser percentage of disabled workers.

In [10]:
filtered_df_with= by_pop.loc[(by_pop['Percent'] != 100) & (by_pop['Disability'] == 'Persons with disabilities')]
filtered_df_without= by_pop.loc[(by_pop['Percent'] != 100) & (by_pop['Disability'] == 'Persons without disabilities')]

maximum_with = filtered_df_with[filtered_df_with.Percent == filtered_df_with.Percent.max()].reset_index(drop=True)
maximum_without = filtered_df_without[filtered_df_without.Percent == filtered_df_without.Percent.max()].reset_index(drop=True)
minimum_without = filtered_df_without[filtered_df_without.Percent == filtered_df_without.Percent.min()].reset_index(drop=True)
minimum_with = filtered_df_with[filtered_df_with.Percent == filtered_df_with.Percent.min()].reset_index(drop=True)

print("Highest Percent with Disabilities: ")
display(maximum_with)

print("Highest Percent without Disabilties: ")
display(maximum_without)

print("Lowest Percent with Disabilities: ")
display(minimum_with)

print("Lowest Percent without Disabilties: ")
display(minimum_without)

Highest Percent with Disabilities: 


Unnamed: 0,Geography,Disability,Number,Percent,City,Province
0,"Belleville, Ontario",Persons with disabilities,36750,43.5,Belleville,Ontario


Highest Percent without Disabilties: 


Unnamed: 0,Geography,Disability,Number,Percent,City,Province
0,"Trois-Rivières, Quebec",Persons without disabilities,103750,88,Trois-Rivières,Quebec


Lowest Percent with Disabilities: 


Unnamed: 0,Geography,Disability,Number,Percent,City,Province
0,"Trois-Rivières, Quebec",Persons with disabilities,14200,12.0E,Trois-Rivières,Quebec


Lowest Percent without Disabilties: 


Unnamed: 0,Geography,Disability,Number,Percent,City,Province
0,"Belleville, Ontario",Persons without disabilities,47800,56.5,Belleville,Ontario


It appears that that *Belleville, Ontario* has the highest percentage of employed individuals who are disabled, at a staggering 43.5%. Unfortunately, *Trois-Rivières, Quebec* has the lowest percentage of employed individuals employed who are disabled at 12.0%. 

We can also look at all provinces and cities on a larger scale by visualizing our data in a treemap. A treemap is particularly useful in this scenario as we have display hierarchial data in a tree-like structure. As a result, we can visualize the origins of the provinces certain cities come from. 

Note: Data with smaller numbers can be harder to visualize, as a result, information regarding the city can be emphasized by hovering and clicking over the particular city.

In [11]:
provinces = px.treemap(by_pop, path=[px.Constant("Canada"), 'Province', 'City', 'Disability'], values='Number')
provinces.update_traces(root_color="lightgrey")
provinces.update_layout(margin = dict(t=50, l=35, r=35, b=35))
provinces.show()

Upon examining the visualization, a striking pattern emerges: numerous cities in Ontario predominate in this dataset, regardless of disability status. Notably, major cities such as Toronto, Montréal, Vancouver, Calgary, Edmonton, and Ottawa exhibit a consistent trend, with employment rates of *persons without disabilities* hovering around 20-30%, while the remaining 70-80% represents *persons with disabilities*. These findings suggest that despite modern-day efforts to promote disabled individuals' employability, many continue to encounter challenges in accessing opportunities.

We can take this a step further by adding viewing employment the lens of age. Let's take the dataframe **grouped_df** and group the data by the mean of `Age Group`, `Disabilities`, and `Gender`.

In [12]:
# Group the DataFrame by Age Group and Disabilities and calculate the mean of Employment Percent
grouped_df = employment.groupby(['Age Group', 'Disabilities', 'Gender'])['Employment Percent'].mean().reset_index()

with_disabilities = grouped_df.loc[grouped_df['Disabilities'] == 1]
without_disabilities = grouped_df.loc[grouped_df['Disabilities'] == 0]

with_disabilities_fig = px.bar(with_disabilities, x='Age Group', y='Employment Percent', color='Gender', barmode='group', title='Mean Employment Percent by Age Group and Gender of Disabled People').show()
without_disabilities_fig = px.bar(without_disabilities, x='Age Group', y='Employment Percent', color='Gender', barmode='group', title='Mean Employment Percent by Age Group and Gender of Non-Disabled People').show()

Comparing our two visualizations, it appears that regardless of disability status both visualizations indicate similar trends where years 25-34, 35-44, and 45-54 all show similar employment percentages until the 55-64 years. This is mainly due to several factors such as health and physical limitations, skill relevance due to rapid technological growth, and just retiring from the workforce in general. 

Unfortunately, it also appears that the mean employment percent for non-disabled individuals is significantly higher at all age groups compared to disabled individuals. This can be attributed for a variety of reasons:

1. **Disability Stigma and Discrimination**: Disability stigma and discrimination are pervasive in society. Many employers may hold negative stereotypes about disabled individuals, assuming they are less capable, less productive, or more expensive to accommodate. This bias can lead to discriminatory hiring practices, limiting job opportunities for disabled people.
   
2. **Limited Training and Skill Development**: Disabled individuals may face fewer opportunities for training and skill development, particularly if educational institutions and training programs are not designed to accommodate their needs.

3. **Lack of Representation**: The underrepresentation of disabled individuals in the workforce can contribute to a positive-feedback cycle of limited role models and opportunities for career advancement.

4. **Unsupportive Work Environment**: Some work environments may not be supportive or understanding of the needs and accommodations required by disabled employees, leading to discomfort or difficulty in the workplace.

We can also visualize mean employment percent based on the severity of disability. In this particular case, disabilities are labelled as either **Milder** or **Severe**. 

In [13]:
# Calculate the mean employment percentage for each group
no_disabilities_mean = employment.loc[employment['Disabilities'] == 0, 'Employment Percent'].mean()
milder_disabilities_mean = employment.loc[employment['Milder'] == 1, 'Employment Percent'].mean()
severe_disabilities_mean = employment.loc[employment['Severe'] == 1, 'Employment Percent'].mean()

data = {
    'Disabilities': ['No Disabilities', 'Milder Disabilities', 'Severe Disabilities'],
    'Mean Employment Percent': [no_disabilities_mean, milder_disabilities_mean, severe_disabilities_mean]
}
df = pd.DataFrame(data)

mean_fig_total = go.Figure(data=go.Bar(
    x=df['Disabilities'],
    y=df['Mean Employment Percent'],
    marker=dict(color=['rgb(31,119,180)', 'rgb(255,127,14)', 'rgb(44,160,44)'])
))

mean_fig_total.update_layout(
    xaxis=dict(title='Disabilities'),
    yaxis=dict(title='Mean Employment Percent'),
    title='Mean Employment Percent by Disabilities'
).show()

As expected, individuals *without disabilities* have the highest mean employment rate, reaching 78.1%. Surprisingly, those with *milder disabilities* do not lag too far behind, with a mean employment rate of 73.6%. However, it is concerning to note that individuals with *severe disabilities* face significant challenges, with the lowest mean employment rate at 44.2%, falling far behind the other two categories.

The statistics highlight the importance of providing better employment opportunities for individuals with severe disabilities. These individuals often encounter substantial barriers to entry due to the severity of their impairments, which can result in limited access to suitable jobs and necessary workplace accommodations. Despite being individuals who often need financial stability the most, they find themselves grappling with limited employment options.

In our final visualization, we utilize the *male_female* dataframe. Earlier in the notebook we had a visualization which also highlighted gender as a factor for employment, but, it contained gender data separated by ages. The data in this particular dataframe looks strictly on the gender variable to find work potential between genders. 

In [14]:
male_female_queried = male_female[~male_female['Potential to work'].str.contains('Total, with or without work potential')]
grouped_male_female = px.bar(male_female_queried, x='Sex', y='Number', color='Potential to work', barmode='group').show()

Looking at the visualization, the first that strikes out is that regardless of gender, a difference of approximately 400,000 people are without work potential. When separating genders, females have slightly more people who are with work potential but with a much higher number of people without work potential. For males, the differences are smaller but the number of people without work potential are still greater than those with. 

# Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)