# Canadian Population

This project uses Canadian population data from a [Wikipedia page](https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory). It showcases how to read in the data, clean it up so it can be plotted, and then plots the data.

We look at two data cases:

* population by province or territory
* population density by province or territory

We will start by importing the data that had been copied from Wikipedia.

In [None]:
import pandas as pd
df = pd.read_csv('population-canada.csv')
df

## Data Cleaning

### Fix the Table of Contents

If you look at the the output above you'll see that there are two contents rows, which makes plotting more difficult. We could fix it by joining the column names and the first row, then dropping the first row. This isn't ideal but it can be cleaned up later by renaming the columns.

In [None]:
df.columns = df.columns + '_' + df.iloc[0]
df = df.drop(0)
df

Let's now rename the column headers to clean it up.

You'll notice we use underscores ("`_`") between the words. It's not necessary, but it makes it easier for some of the programming we'll be doing later.

In [None]:

df.columns = ['Population_Rank', 
              'Name', 
              'Population_2021', 
              'Population_Proportion', 
              'Growth_2016_21', 
              'Land_area_km2', 
              'Population_density_per_km2',
              'Commons_house_seats', 
              'Commons_seats_Proportion', 
              'Senate_seats',
              'Senate_seats_Proportion']
df

We can look at the column names like this:

In [None]:
df.columns

### Deleting Rows We Don't Need

If we look at the dataframe we see that there's a "Canada" line in the rows. We need to delete that!

We can find that row by looking for the item in the `'Name'` column that equals `'Canada'`.

In [None]:
df[df['Name'] == 'Canada']

We delete rows using the index number (or value), so let's modify the above code accordingly. You'll see that we add `.index` to the search term.

In [None]:
df[df['Name'] == 'Canada'].index

That will return the index number we need to delete. Putting it all together we get:

In [None]:
df = df.drop(df[df.Name == "Canada"].index)
df

## Data Sorting

We need to sort the data by the way we want to plot it. In our first case, we want to sort it by population, which is the `'Population_2021'` column.

In [None]:
df.sort_values(by='Population_2021')

That doesn't look right, though, it seems to be sorting alphabetically rather than numerically. When we imported the data the first row didn't contain numbers, so the data type needs to be converted to numeric.

In [None]:
df['Population_2021'] = pd.to_numeric(df['Population_2021'])
df.sort_values(by='Population_2021')

We should probably try converting all of the columns in the dataframe to numbers, if possible. The code cell below will `try` to convert to numeric, but if there's an error it will just pass on to the next column.

In [None]:
for column in df.columns:
    try:
        df[column] = pd.to_numeric(df[column])
    except:
        pass
df

You will notice that the dataframe is still in the original order, since we didn't assign the result back to our `df` variable when we sorted by `'Population_2021'`.

We can sort it by other columns, and try sorting in the opposite order by using `ascending=False`.

In [None]:
df.sort_values(by='Population_density_per_km2', ascending=False)

## Data Visualization

### Populations

We will use [Plotly Express](https://plotly.com/python/plotly-express) to create visualizations of our data.

In [None]:
import plotly.express as px
px.bar(df, x='Name', y='Population_2021')

We can also add some more options to make it nicer.

In [None]:
px.bar(df, x='Name', y='Population_2021', title='Populations of Provinces and Territories (2021)', height=800).update_yaxes(title='Population')

### Population Densities

Let's try a bar plot of `'Population_density_per_km2'`.

In [None]:
df1 = df.sort_values(by='Population_density_per_km2', ascending=False)
px.bar(df1, x='Name', y='Population_density_per_km2', title='Population Density of Provinces and Territories (2021)').update_yaxes(title='Population Density (per km^2)')

## Pie Chart

A pie chart can show the proportions of the Canadian population that live in each province or territory.

In [None]:
px.pie(df, values='Population_2021', names='Name', title='Population Proportion of Provinces and Territories (2021)')

## Conclusion

In this project we used data from a [Wikipedia page](https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory) to look at Canadian population by province or territory and population density by province or territory.

In the future it would be interesting to explore other data in this dataset, such as Commons house seats and Senate seats.