![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

## Indigenous population before first contact with Europeans

In this short notebook, we show how to scrape some data from a Wikipedia webpage and then visualize the data in a nice graphic.

We are looking at population data for Indigenous peoples before contact with Europeans. 

Our data source is this wikipedia page: https://en.wikipedia.org/wiki/Population_history_of_Indigenous_peoples_of_the_Americas

Note that this is a delicate topic, as the population estimates vary widely, often reflecting biases of various authors. 

### Step 1. Libraries


FIrst, we import some useful Python libraries: Pandas for data analysis, requests for reading the webpage, and warning to manage any errors when we read the webpage. 

We also import a plotting library

In [None]:
import pandas as pd
import requests, warnings
import plotly.express as px

### Step 2

We then point a variable **url** to the webpage we want to use at Wikipedia, and store the results in **res**. We've copied the code below to show how we've done this, but to avoid any possibility of the data changing in the future, we've saved the data to a CSV file that we'll import to bring in the data.

<code class="python">
url = 'https://en.wikipedia.org/wiki/Population_history_of_Indigenous_peoples_of_the_Americas'

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    res = requests.get(url, verify=False)
    df = pd.read_html(res.text)[0]
</code>

In [None]:
df = pd.read_csv('data/indigenous-populations-precontact.csv')
df

### Step 3

The columns contain text references to ranges of numbers, but computers usually don't like that. Instead, we're going to replace the text with the number that represents the average of the value.

As we'll be focusing on the USA and Canada, we'll primarily replace values in that column.

In [None]:
df.replace(['2 million-3 million'],'2500000',inplace=True)
df.replace(['0.9 million'],'900000',inplace=True)
df.replace(['1 million'],'1000000',inplace=True)
df.replace(['9.8-12.25 million'],'11000000',inplace=True)
df.replace(['1.213-2.639 million'],'2000000',inplace=True)
df.replace(['3.79 million'],'3790000',inplace=True)
df.replace(['3.44 million'],'3440000',inplace=True)
df.replace(['3.5 million'],'3500000',inplace=True)
df.replace(['7 million'],'7000000',inplace=True)
df['USA and Canada'] = pd.to_numeric(df['USA and Canada'], errors='coerce')
df

### Step 4. Clean up

Let's clean up the dataframe. 

Wikipedia is an excellent source of information, but one of the drawbacks of being so well sourced is that the editors of Wikipedia often add tags to pieces of information to show where the data came from. In our dataframe, this shows up in the Author column as \[#\], with various numbers depending on the source.

We can remove those with the below code, that simply removes the last three characters from a string.

In [None]:
df['Author'] = df['Author'].str[:-4]
df

### Step 5. Plotting the data

Let use Plotly to create a simple bar chart. We will attempt to plot the Number of indigenous people in the USA/Canada prior to European contact.

In [None]:
px.bar(df, x=df.agg(lambda x: f'{x["Author"]} ({x["Date"]})', axis=1), # Combines author and year
       y='USA and Canada', labels = {'x': 'Author (Year)'}, title="Different estimates for population, by author",)


## Further work

- What else can you do with this data? Perhaps clean up the data in other columns and plot as well. 

- What other sources of data can you use? Is there data for particular regions, such as B.C. or the Pacific West Coast that might have more accurate data? Look into the work of anthropologist Robert Boyd. 

## Conclusions

We have shown how to scrape some data from Wikipedia, place into a Pandas dataframe and produce informative charts using Plotly. Important steps including cleaning up the data to get it into a form that can be understood by the program and put into good shapre for plotting. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)