![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Languages spoken in Canada : Part 2<br><br>Appreciation of the diversity of multilingualism<br><br>


Welcome to this Jupyter notebook. This notebook is a free resource provided by the Callysto project which aims to promote data science in the grade 5 to 12 classroom.

In this notebook, we will use Python to explore the different steps to data analysis, interpretation and visualization.

## Steps to analysing data
1. **ask a question** - formulate a question to research.
2. **select** - find a suitable data-base as well as the necessary Python libraries to answer the question.
3. **organise** - clean and organize the data to prepare yourself for the data analysis
4. **explore** - analyse the data and create data visualizations to represent the data
5. **interpret** - explain the observation made through the visualizations. 
6.  **communicate** - form a conclusion to your initial question based on your observations

## Question

**How do we understand multilingualism in Canada?**

## Selection

### Loading the libraries

In [None]:
# To create tables, import the pandas library under the label pd
import pandas as pd

# To manipulate numbers, import the numpy library under the label np
import numpy as np

# For the visualizations, import the plotly express library under the label px and the matplotlib library under the label plt
import plotly.express as px
import matplotlib.pyplot as plt

print("Libraries have been imported.")

### Data sources

📕 

[Open Government Licence - Canada](https://ouvert.canada.ca/fr/licence-du-gouvernement-ouvert-canada)

Statistics Canada. (2023). Population by knowledges of official languages and geography, 1951 to 2021. [Series issue ID 15100004].  https://open.canada.ca/data/en/dataset/ca075a79-5962-4fc0-9a51-7439f659ea62 
<br><br>

### Loading in the data

The dataset we will be using is called `1951-2021-langue-maternelle.csv` in the folder `data`. This dataset gives the mother languages in Canada, alongside giving information specifically within provinces and territories between 1951 and 2021. 

We will load the dataset in the variable `mother_language_data`.

In [None]:
mother_language_data = pd.read_csv("data/1951-2021-mother-language.csv")
mother_language_data

The dataset `1951-2021-mother-language` is made of 2920 rows and 17 columns, which makes up about 49 000 data points as 'NaN' fills several cells and columns. 'NaN' means *Not a Number* and indicates that the cell is empty. We will need to clean this data in the next step: **organizing**. 
<br><br>

## Organize

Let's return to the research question of how do we understand multilingualism. We will have to identify and understant the variables that are relavent in order to organize and explore the data. 

To get more information on the variables related to language, check the [language reference guide, census of population, 2021](https://www12.statcan.gc.ca/census-recensement/2021/ref/98-500/003/98-500-x2021003-eng.cfm).

### A note on the language data and the dataset

Here are the questions relevant to our research that have been posed during the 2021 census :

>The question regarding **mother tongue** is the following:
>
>What is the language that this person first learned at home in childhood and still understands?
>
>The question regarding the languages spoken **at home** was divided into two parts:
>a) What language(s) does this person speak on a regular basis at home?
>b) Of these languages, which one does this person speak most often at home?
>
>The question regarding the language spoken **at work** was divided into two parts:
>
>a) In this job, what language(s) did this person use on a regular basis?
>b) Of these languages, which one did this person use most often in this job?

### The mother tongue period data set `mother_language_data`

Let's remove unnecessary columns and remove many of our "NaN" values in our dataset.

In [None]:
mother_language_data = mother_language_data.drop(columns=["DGUID", "UOM", "UOM_ID", "SCALAR_FACTOR", "SCALAR_ID", "VECTOR", "COORDINATE", "STATUS", "SYMBOL", "TERMINATED", "DECIMALS"])

Let's also only use the "Percentage" values in our column for future purposes in our visualizations.

In [None]:
mother_language_data_percent = mother_language_data.loc[mother_language_data["Statistics"] == "Percentage"]

We can first look at the different languages in our `Knowledge of official languages` column.

In [None]:
mother_language_data['Knowledge of official languages'].unique()

Looking at the output above, we have *'Total, knowledge of official languages'* which is not an official language. Most likely, this is just a column header which was mistakenly added as a row value in our dataframe. Let's remove this value.

In [None]:
mother_language_data_reorg = mother_language_data_percent[mother_language_data_percent["Knowledge of official languages"].str.contains("Total") == False]
mother_language_data_reorg = mother_language_data_reorg.reset_index(drop=True)
mother_language_data_reorg

## Explore

We will see how the use of language depends on the context. We will explore the percentage of mother tongues declared every year between 1951 and 2021 in the line graph below. 

The classifications and in the legend: 
- French only
- English only
- English and French
- Neither English nor French

The y-axis represents the percentage and the x-axis represents the year

In [None]:
mother_language_data_reorg

In [None]:
df_geo = mother_language_data_reorg.loc[lambda df: (df["GEO"]=="Canada")]

mother_language_fig = px.line(df_geo, x="REF_DATE", y="VALUE", color="Knowledge of official languages", color_discrete_sequence=px.colors.diverging.Portland, title="Progression of people who know their mother language in Canada from 1951-2021, via percentage")
mother_language_fig.show()

🔎 Which languages began to be represented in 1991 when more choices have been added to the census questions? 

We can also look at our dataset in another lens. Instead of filtering on *Percentage*, let's take a look at the total value of **number** of people who speak a particular language in Canada.

In [None]:
mother_language_data_number = mother_language_data.loc[mother_language_data["Statistics"] == "Number"]

mother_language_number_reorg = mother_language_data_number[mother_language_data_number["Knowledge of official languages"].str.contains("Total") == False]
mother_language_number_reorg = mother_language_number_reorg.reset_index(drop=True)
df_geo = mother_language_number_reorg.loc[lambda df: (df["GEO"]=="Canada")]

mother_language_fig = px.line(df_geo, x="REF_DATE", y="VALUE", color="Knowledge of official languages", color_discrete_sequence=px.colors.diverging.Portland, title="Progression of people who know their mother language in Canada from 1951-2021, via numerical value")
mother_language_fig.show()

🔎 Which languages appear to grow faster/slower based on numerical visualization compared to percentage visualization?

We can also take a look at the impact of separating within particular provinces/areas in Canada. Let's see what values we have within the `GEO` column.

In [None]:
mother_language_data['GEO'].unique()

Looking at our Python output, we see that we have the following provinces/areas:

- "Canada"
- 'Newfoundland and Labrador'
- 'Prince Edward Island'
- 'Nova Scotia'
- 'New Brunswick'
- 'Quebec'
- 'Ontario'
- 'Manitoba',
- 'Saskatchewan'
- 'Alberta'
- 'British Columbia'
- 'Yukon'
- 'Northwest Territories including Nunavut'
- 'Canada outside Quebec'
- 'Northwest Territories'
- 'Nunavut'

Let's take a look at which provinces have the highest number of purely English speakers. Afterwards, we can also take a look at which province has the highest number of purely French speaks. 

Since this dataset goes until 2021, let's use this year as our baseline to compare. 

We'll first filter for the year **2021**.

In [None]:
mother_language_2021 = mother_language_data_number[mother_language_data['REF_DATE'] == 2021]
mother_language_2021 = mother_language_2021[mother_language_2021["Knowledge of official languages"].str.contains("Total") == False]

mother_language_2021 = mother_language_2021.reset_index(drop=True)
mother_language_2021

Now, we can find the highest **VALUE** for each province.

In [None]:
province_fig = px.bar(mother_language_2021, x='GEO', y='VALUE', color='Knowledge of official languages',title='Number of people who know their mother language in Provinces in Canada (2021)', labels={'VALUE': 'Value', 'GEO': 'Province'},category_orders={'Knowledge of official languages': ['French only', 'English only', 'English and French', 'Neither English nor French']})
province_fig.show()

🔎 Are there particular provinces that seem to go against the trend of other provinces? Can you identify reasons why this could be?

## Interpret

🔎 Which category of mother tongue language has an increasing trend even with the addition of the extra responses in 1991?

## Communicate

🔎 How does the exploration of the census data help you in discovering what bilingualism and multilingualism mean? 

🔎 In your opinion, which results have the most impact? Explain.

🔎 Do you have your own questions about multilingualism? What are your questions? What kind of data could help you answer your questions?

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)