![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Exploring Indigenous Populations in the Americans (Canada and United States)

## Prior to European Contact 

In this short notebook, we show how to scrape some data from a Wikipedia webpage and then visualize the data in a nice graphic.

We are looking at population data for Indigenous peoples before contact with Europeans. 

Our data source is the Wikipedia page [Population history of the Indigenous peoples of the Americas](https://en.wikipedia.org/wiki/Population_history_of_Indigenous_peoples_of_the_Americas). Note that population estimates vary widely, often reflecting biases of various authors. 

### Step 1: Libraries

FIrst, we import some useful Python libraries: [pandas](https://pandas.pydata.org/) for data analysis and [Plotly Express](https://plotly.com/python/plotly-express/) for creating visualizations.

In [None]:
import pandas as pd
import plotly.express as px
print('Libraries imported')

### Step 2: Getting Data

We then read a Wikipedia webpage, and store the results in a dataframe called `df`. We've saved the data to a CSV file that we can import if things have changed on the Wikipedia page.

In [None]:
df = pd.read_html('https://en.wikipedia.org/wiki/Population_history_of_Indigenous_peoples_of_the_Americas')[0]
if df.columns[0] != 'Author':
    print('reading data from CSV file')
    df = pd.read_csv('data/indigenous-populations-precontact.csv')
df

### Step 3: Cleaning Data

The columns contain text references to ranges of numbers, but computers usually don't like that. Instead, we're going to replace the text with the number that represents the average of the value.

As we'll be focusing on the USA and Canada, we'll primarily replace values in that column.

In [None]:
df.replace(['2 million-3 million'],'2500000',inplace=True)
df.replace(['0.9 million'],'900000',inplace=True)
df.replace(['1 million'],'1000000',inplace=True)
df.replace(['9.8-12.25 million'],'11000000',inplace=True)
df.replace(['1.213-2.639 million'],'2000000',inplace=True)
df.replace(['3.79 million'],'3790000',inplace=True)
df.replace(['3.44 million'],'3440000',inplace=True)
df.replace(['3.5 million'],'3500000',inplace=True)
df.replace(['7 million'],'7000000',inplace=True)
df['USA and Canada'] = pd.to_numeric(df['USA and Canada'], errors='coerce')
df

Wikipedia is an excellent source of information, but one of the drawbacks of being so well sourced is that the editors of Wikipedia often add tags to pieces of information to show where the data came from. In our dataframe, this shows up in the Author column as \[#\], with various numbers depending on the source.

We can remove those with the code below, that simply removes the last three characters from a string.

In [None]:
df['Author'] = df['Author'].str[:-4]
df

It will make things easier if we combine the `Author` and `Date` columns. We'll do this in the next code cell.

In [None]:
df['Author (Year)'] = df['Author'] + ' (' + df['Date'].astype(str) + ')'
df

### Step 4: Plotting Data

Let's use Plotly Express to create a bar chart. We will plot the number of indigenous people in the USA/Canada prior to European contact.

In [None]:

px.bar(df, x='Author (Year)', y='USA and Canada', title='Precontact Indigenous Populations Estimates by Author')


## Further Work

- What else can you do with this data? Perhaps clean up the data in other columns to make other plots.

- What other sources of data can you use? Is there data for particular regions, such as B.C. or the Pacific West Coast, that might have more accurate data? Look into the work of anthropologist [Robert Boyd](https://en.wikipedia.org/wiki/Robert_Boyd_(anthropologist)).

## Conclusions

This notebook showed how to scrape data from Wikipedia into a pandas dataframe and produce informative charts. The data needed to be cleaned in order to be understood and plotted using Plotly Express.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)