![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Indigenous Populations in Canada

## A Current Data Analysis

In this notebook we are looking at population data for Indigenous people across Canada.

We will scrape some data from a Wikipedia webpage and then visualize the data.

Our data source is the Wikipedia page [Indigenous peoples in Canada](https://en.wikipedia.org/wiki/Indigenous_peoples_in_Canada), which is largely based on the [2016 Canada Census](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Lp-eng.cfm?LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=0&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=122&VID=0&VNAMEE=&VNAMEF=).

### Step 1: Libraries

FIrst, we import some useful Python libraries: [pandas](https://pandas.pydata.org/) for data analysis and [Plotly Express](https://plotly.com/python/plotly-express/) for creating visualizations.

In [None]:
import pandas as pd
import plotly.express as px
print('Libraries imported')

### Step 2: Getting Data

We then read a Wikipedia webpage, and store the results in a dataframe called `df`. We've saved the data to a CSV file that we can import if things have changed on the Wikipedia page.

In [None]:
df = pd.read_html('https://en.wikipedia.org/wiki/Indigenous_peoples_in_Canada')[3]
if df.columns[1] != 'Number':
    print('reading data from CSV file')
    df = pd.read_csv('data/indigenous-populations-present.csv')
df

### Step 3: Cleaning Up

Let's clean up the dataframe.

We will start by removing the last row (sources) and the last column (`Unnamed: 8`) as they don't contain useful information for our purposes.

In [None]:
df = df.drop(14)
df = df.drop('Unnamed: 8',axis=1)
df

It turns out the column names have some problems as they include some additional characters, including some that are invisible. We can see this by listing the column names.

In [None]:
df.columns

Let's rename the columns.

In [None]:
df.columns = ['Province/Territory','Number','%','First Nations','Métis','Inuit','Multiple','Other']
df

Let's also rename the `NaN` at the bottom of `Province/Territory` column. This really should be the total for Canada.

In [None]:
df['Province/Territory'][13] = 'Canada'
df

### Step 4: Plotting the Data

Let's use Plotly Express to create a bar chart of the number of Indigenous people in each Province or Territory.

In [None]:
px.bar(df, x='Province/Territory', y='Number', title='A: Indigenous Populations by location')

### Step 5: What Went Wrong?

You may have noticed that something is wrong with the. What went wrong and how do we fix it?

Discuss before going to the next stage.

### Step 6: Fixing the Data

The problem is that Python thinks thoses numbers are just text strings. So we have to force them to be numbers. There is a command called `to_numeric` which will convert the text in a column into numbers.

In [None]:
df['Number'] = pd.to_numeric(df['Number'], errors='coerce')
px.bar(df, x='Province/Territory', y='Number', title='B: Indigenous Populations by Location')

### Step 7: A Better Plot

Let's drop row 13, which is the total for Canada. It is so big that it obscures the other bars.

The result shows each province and territory in more detail.

In [None]:
px.bar(df.drop(13), x='Province/Territory', y='Number', title='C: Indigenous Populations by Province/Territory')

### Step 8: Plot More Data

Finally, let's do a multi-bar plot to show the breakdown of the different categories of Indigenous people in this data. Plotly has an easy way to do this, using the "wide format" for data frames. 

First, though, we need to convert the data into numerical values in each columns. Then we call the Plotly bar command.

In [None]:
df['First Nations'] = pd.to_numeric(df['First Nations'], errors='coerce')
df['Métis'] = pd.to_numeric(df['Métis'], errors='coerce')
df['Inuit'] = pd.to_numeric(df['Inuit'], errors='coerce')
df['Multiple'] = pd.to_numeric(df['Multiple'], errors='coerce')
df['Other'] = pd.to_numeric(df['Other'], errors='coerce')
px.bar(df, x='Province/Territory', y=['First Nations','Métis','Inuit','Multiple','Other'], title='D: Indigenous Populations in Canada', barmode='group')

Opps. Let's get rid of the `Canada` bar again, to make the other values clearer.

In [None]:
px.bar(df.drop(13), x='Province/Territory', y=['First Nations','Métis','Inuit','Multiple','Other'], title='E: Indigenous populations in Canada', barmode='group')

### Notes on data sources

This data came from Wikipedia, based on the 2016 Canadian census.

This data may be updated in the future, as the census is completed every 5 years, and often released in the fall following the year it was collected. See if you can update the data with the most current available census.

## Further work

- What else can you do with this data? Perhaps you want to plot information about percentages. How would you do that?

- What other sources of data can you get? Can you access Wikipedia data about Indigenous populations in other countries that interest you? How about the United States? Or Australia?

### A little help

Let's see if we can plot the percentages, by `Province/Territory`. Try it yourself first, and then check your work here.

First, remove the percentage sign by replacing it with just `''`. Then convert the percentage numbers (which are strings) to numerical values.

In [None]:
df['%'] = df['%'].str.replace('%','')
df['%'] = pd.to_numeric(df['%'], errors='coerce')
px.bar(df, x='Province/Territory', y='%', title='F: Indigenous Population Percent each Province or Territory')

## Conclusions

We have shown how to scrape some data from Wikipedia, place into a Pandas dataframe and produce informative charts using Plotly. Important steps including cleaning up the data to get it into a form that can be understood by the program and put into good shapre for plotting. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)