![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

## Indigenous population in Canada, present day

In this short notebook, we show how to scrape some data from a Wikipedia webpage and then visualize the data in a nice graphic.

We are looking at population data for Indigenous people across Canada. 

Our data source is this wikipedia page: https://en.wikipedia.org/wiki/Indigenous_peoples_in_Canada

### Step 1. Libraries


FIrst, we import some useful Python libraries: Pandas for data analysis, requests for reading the webpage, and warning to manage any errors when we read the webpage. 

We also import a plotting library

In [None]:
import pandas as pd
import requests, warnings
import plotly.express as px

### Step 2

Wel then point a variable **url** to the webpage we want to use at Wikipedia, and store the results in **res**.

In [None]:
url = 'https://en.wikipedia.org/wiki/Indigenous_peoples_in_Canada'

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    res = requests.get(url, verify=False)


### Step 3

The variable **res** is actually a data structure. We could look at all the text in iy using the component **res.text**  but instead, leads load it into a panda dataframe and look at the result.

In [None]:
pd.read_html(res.text)

### Step 3 results, interpretation

The result of Step 3 is a list of charts. We can select them using an index. Experiment a bit, and we discover the chart we want is number three. Let's put this into a dataframe, which we call **df**

In [None]:
df = pd.read_html(res.text)[3]
df

### Step 4. Clean up

Let's clean up the dataframe. 

We will remove the last row (index 14) and the last column ('Unnamed') as they dont's contain useful information.

In [None]:
df = df.drop(14)
df = df.drop('Unnamed: 8',axis=1)
df

### Step 4+

Let's also get rid of the "NaN" in the second column. This really should be the total for Canada. 

It's a bit tricky, as this "NaN" in not a string, but a floating point value that represents "Not a Number." So in our Python code, we call it float("nan")

In [None]:
df = df.replace([float("nan")],'Canada')

In [None]:
df

### Step 5. More cleanup 

It turns out the Column names have some problems as they include some additional characters, including some that are invisible. We can see this by listing the column names. We use this information to then clean up the Column names.

In [None]:
list(df)

In [None]:
## We use the "inplace" option to update the current dataframe
df.rename(columns = {'Province\xa0/ Territory':'Province/Territory', '%[a]':'%', 'Other[b]': 'Other'}, inplace = True)
df.rename(columns = {'First Nations(Indian)': 'First Nations'}, inplace = True)
df

### Step 6. Plotting the data

Let use Plotly to create a simple bar chart. We will attempt to plot the Number of indigenous people in each Province/Territory.

In [None]:
fig = px.bar(df, x='Province/Territory', y='Number')
fig.show()

### Step 6 - what went wrong?

You may have noticed the plot is wrong. What went wrong? And how do we fix it?

Discuss before going to the next stage.


### Step 7. Fixing the data.

The problem is that Python thinks thoses numbers are just text strings. So we have to force them to be numbers. 
There is a simple command to do this, called **to_numeric** which will convert the text in a column into numbers.

In [None]:
df['Number'] = pd.to_numeric(df['Number'], errors='coerce')
px.bar(df, x='Province/Territory', y='Number')

### Step 7+  A better plot

Let's drop the 13th row, which is the Canada number, as it is so big. The result shows each province and territory in more detail.

In [None]:
px.bar(df.drop(13), x='Province/Territory', y='Number')

### Step 8. More data in the plot.

Finally, let's do a multi-bar plot to show the breakdown of the different categories of people in this data. Plotly has an easy way to do this, using the "wide format" for data frames. 

First, though, we need to convert the data into numerical values in each columns. Then we call the Plotly bar command. 


In [None]:
df['First Nations'] = pd.to_numeric(df['First Nations'], errors='coerce')
df['Métis'] = pd.to_numeric(df['Métis'], errors='coerce')
df['Inuit'] = pd.to_numeric(df['Inuit'], errors='coerce')
df['Multiple'] = pd.to_numeric(df['Multiple'], errors='coerce')
df['Other'] = pd.to_numeric(df['Other'], errors='coerce')

In [None]:
px.bar(df, x='Province/Territory', 
           y=['First Nations','Métis','Inuit','Multiple','Other'], title="Indigenous populations in Canada")

### Opps.

Let's get rid of Canada again!

In [None]:
px.bar(df.drop(13), x='Province/Territory', 
           y=['First Nations','Métis','Inuit','Multiple','Other'], title="Indigenous populations in Canada")

## Notes on data sources

This data came from Wikipedia, based on the 2016 Canadian census.

We expect similar data to appear in September 2022, based on the 2021 Canadian census. A link to the data sources at Statistics Canada is given here: https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/dt-td/index-eng.cfm

## Further work

- What else can you do with this data? Perhaps you want to plot information about percentages. How would you do that?

- What other sources of data can you get? Can you access Wikipedia data about indigenous populations in other countries that interest you? How about the United States? Or Australia?

### A little help

Let's see if we can plot the percentages, by Province. Try it yourself first, and then check your work here.


In [None]:
## First, remove the percentage sign by remove the last character in each string
## Then convert the percentage numers (which are strings) to numerical values.

df['%'] = df['%'].map(lambda x: x[:-1])
df['%'] = pd.to_numeric(df['%'])

In [None]:
px.bar(df, x='Province/Territory', y='%', title="Indigenous population, by percentage")


## Conclusions

We have shown how to scrape some data from Wikipedia, place into a Pandas dataframe and produce informative charts using Plotly. Important steps including cleaning up the data to get it into a form that can be understood by the program and put into good shapre for plotting. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)