# Module 6 - Data Analysis with Pandas

Pandas is a Python library that facilitates data analysis. Pandas DataFrames are extremelly useful to perform data analysis as it provides a user friendly way to visualize and run statistical tests on data, its interface has intuitive combinations of commands which resemble SQL in relational databases. 

Reference on Pandas: https://www.w3schools.com/python/pandas/default.asp

- Importing `pandas` as `pd` -- similarly to how `numpy` commonly used alis is `np`.

In [1]:
import pandas as pd

## DataFrames: Analysing a dataset

**The dataset**

We will use the *Canada Cities Database* for this notebook. You can find the dataset here: https://simplemaps.com/data/canada-cities

This dataset has data con Canadian citties such as, city name, province, postal code, population, etc.

We also have the *Country List* dataset which contains a collection of countries and their capital. More info here: https://github.com/icyrockcom/country-capitals/blob/master/data/country-list.csv


- loading the datasets

In [2]:
# reading the dataset from csv file using pandas
cities = pd.read_csv("canadacities.csv")
countries = pd.read_csv("country-list.csv")

In [3]:
cities.head()

Unnamed: 0,city,city_ascii,province_id,province_name,lat,lng,population,density,timezone,ranking,postal,id
0,Toronto,Toronto,ON,Ontario,43.7417,-79.3733,5429524.0,4334.4,America/Toronto,1,M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5...,1124279679
1,Montréal,Montreal,QC,Quebec,45.5089,-73.5617,3519595.0,3889.0,America/Montreal,1,H1X H1Y H1Z H1P H1R H1S H1T H1V H1W H1H H1J H1...,1124586170
2,Vancouver,Vancouver,BC,British Columbia,49.25,-123.1,2264823.0,5492.6,America/Vancouver,1,V6Z V6S V6R V6P V6N V6M V6L V6K V6J V6H V6G V6...,1124825478
3,Calgary,Calgary,AB,Alberta,51.05,-114.0667,1239220.0,1501.1,America/Edmonton,1,T1Y T2H T2K T2J T2L T2N T2A T2C T2B T2E T2G T2...,1124690423
4,Edmonton,Edmonton,AB,Alberta,53.5344,-113.4903,1062643.0,1360.9,America/Edmonton,1,T5X T5Y T5Z T5P T5R T5S T5T T5V T5W T5H T5J T5...,1124290735


- Similarly to `numpy`, we can apply different aggregation function to dataframes

In [4]:
# ex. total population of Canada
cities['population'].sum()

40161685.0

In [5]:
cities[['lat', 'lng']].mean()

lat    47.846277
lng   -83.943544
dtype: float64

In [6]:
# how many different postal codes there are
cities['postal'].apply(lambda postal_code: len(postal_code.split())).sum()

2972

### Finding the population per province
- aggregating, grouping, and projecting

In [7]:
#
# Project to columns, group by, aggregate
#

cities[['province_id', 'population']].groupby(['province_id']).sum()

Unnamed: 0_level_0,population
province_id,Unnamed: 1_level_1
AB,3488355.0
BC,6101869.0
MB,1196667.0
NB,743629.0
NL,389968.0
NS,678507.0
NT,31958.0
NU,29274.0
ON,16672868.0
PE,80445.0


In [8]:
#
# Groupby, project to columns, aggregate
#
cities.groupby(['province_id'])['population'].sum()

province_id
AB     3488355.0
BC     6101869.0
MB     1196667.0
NB      743629.0
NL      389968.0
NS      678507.0
NT       31958.0
NU       29274.0
ON    16672868.0
PE       80445.0
QC     9865133.0
SK      856552.0
YT       26460.0
Name: population, dtype: float64

In [9]:
#
# Groupby, aggregate, project to columns
#
cities.groupby(['province_id']).sum()[['population']]

Unnamed: 0_level_0,population
province_id,Unnamed: 1_level_1
AB,3488355.0
BC,6101869.0
MB,1196667.0
NB,743629.0
NL,389968.0
NS,678507.0
NT,31958.0
NU,29274.0
ON,16672868.0
PE,80445.0


###  Find the number of different timezones for each province

In [11]:
cities.head(2)

Unnamed: 0,city,city_ascii,province_id,province_name,lat,lng,population,density,timezone,ranking,postal,id
0,Toronto,Toronto,ON,Ontario,43.7417,-79.3733,5429524.0,4334.4,America/Toronto,1,M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5...,1124279679
1,Montréal,Montreal,QC,Quebec,45.5089,-73.5617,3519595.0,3889.0,America/Montreal,1,H1X H1Y H1Z H1P H1R H1S H1T H1V H1W H1H H1J H1...,1124586170


In [12]:
cities.groupby('province_name')['timezone'].nunique()

province_name
Alberta                      1
British Columbia             5
Manitoba                     1
New Brunswick                3
Newfoundland and Labrador    2
Northwest Territories        2
Nova Scotia                  2
Nunavut                      4
Ontario                      7
Prince Edward Island         1
Quebec                       5
Saskatchewan                 4
Yukon                        1
Name: timezone, dtype: int64

In [13]:
#
# What are the different timezones in each province
#
cities[['province_name','timezone']].drop_duplicates()

Unnamed: 0,province_name,timezone
0,Ontario,America/Toronto
1,Quebec,America/Montreal
2,British Columbia,America/Vancouver
3,Alberta,America/Edmonton
5,Ontario,America/Montreal
7,Manitoba,America/Winnipeg
14,Nova Scotia,America/Halifax
23,Saskatchewan,America/Regina
48,Newfoundland and Labrador,America/St_Johns
49,New Brunswick,America/Moncton


### How many quebec residence live according to the Halifax timezone?

In [14]:
cities.groupby(['province_name', 'timezone'])[['population']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,population
province_name,timezone,Unnamed: 2_level_1
Alberta,America/Edmonton,3488355.0
British Columbia,America/Creston,5351.0
British Columbia,America/Dawson_Creek,40140.0
British Columbia,America/Edmonton,46407.0
British Columbia,America/Fort_Nelson,5393.0
British Columbia,America/Vancouver,6004578.0
Manitoba,America/Winnipeg,1196667.0
New Brunswick,America/Moncton,736088.0
New Brunswick,America/Montreal,3126.0
New Brunswick,America/New_York,4415.0


In [15]:
# If we don't want hierarchical indexes, just
# use DataFrame.reset_index()

cities.groupby(['province_name', 'timezone'])[['population']].sum().reset_index()[:5]

Unnamed: 0,province_name,timezone,population
0,Alberta,America/Edmonton,3488355.0
1,British Columbia,America/Creston,5351.0
2,British Columbia,America/Dawson_Creek,40140.0
3,British Columbia,America/Edmonton,46407.0
4,British Columbia,America/Fort_Nelson,5393.0


### Merging data

Let's use our `countries` dataset

In [16]:
countries.head()

Unnamed: 0,country,capital,type
0,Abkhazia,Sukhumi,countryCapital
1,Afghanistan,Kabul,countryCapital
2,Akrotiri and Dhekelia,Episkopi Cantonment,countryCapital
3,Albania,Tirana,countryCapital
4,Algeria,Algiers,countryCapital


In [17]:
# let's set the index to the country name
countries.set_index('country', inplace=True)
countries.head(3)

Unnamed: 0_level_0,capital,type
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Abkhazia,Sukhumi,countryCapital
Afghanistan,Kabul,countryCapital
Akrotiri and Dhekelia,Episkopi Cantonment,countryCapital


In [18]:
# Now we can find the data of a capital city using our `cities` dataset
cities.set_index('city').loc['London']

city_ascii                                                  London
province_id                                                     ON
province_name                                              Ontario
lat                                                        42.9836
lng                                                       -81.2497
population                                                383822.0
density                                                      913.1
timezone                                           America/Toronto
ranking                                                          2
postal           N5Z N5X N5Y N5V N5W N6A N6P N6G N6E N6C N6N N6...
id                                                      1124469960
Name: London, dtype: object

### The first way is merge by index.

In [19]:
cities.set_index('city_ascii', inplace=True)

In [20]:
cities.head(2)

Unnamed: 0_level_0,city,province_id,province_name,lat,lng,population,density,timezone,ranking,postal,id
city_ascii,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Toronto,Toronto,ON,Ontario,43.7417,-79.3733,5429524.0,4334.4,America/Toronto,1,M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5...,1124279679
Montreal,Montréal,QC,Quebec,45.5089,-73.5617,3519595.0,3889.0,America/Montreal,1,H1X H1Y H1Z H1P H1R H1S H1T H1V H1W H1H H1J H1...,1124586170


In [21]:
#
# Have countries indexed by capital
#
countries = countries.reset_index().set_index('capital')

In [22]:
countries.head(2)

Unnamed: 0_level_0,country,type
capital,Unnamed: 1_level_1,Unnamed: 2_level_1
Sukhumi,Abkhazia,countryCapital
Kabul,Afghanistan,countryCapital


In [23]:
pd.merge(cities, countries, left_index=True, right_index=True)[['province_name', 'population', 'country']]

Unnamed: 0,province_name,population,country
Athens,Ontario,3013.0,Greece
Douglas,New Brunswick,6154.0,Isle of Man
Hamilton,Ontario,693645.0,Bermuda
Kingston,New Brunswick,2913.0,Jamaica
Kingston,New Brunswick,2913.0,Norfolk Island
London,Ontario,383822.0,United Kingdom; England
Ottawa,Ontario,989567.0,Canada
St. George's,Newfoundland and Labrador,1203.0,Grenada
St. John's,Newfoundland and Labrador,108860.0,Antigua and Barbuda
Stanley,Manitoba,9038.0,Falkland Islands


### Another way of merge is on columns instead of indexes

In [24]:
# reloading the datasets
cities = pd.read_csv("canadacities.csv")
countries = pd.read_csv("country-list.csv")

In [25]:
cities.head(2)

Unnamed: 0,city,city_ascii,province_id,province_name,lat,lng,population,density,timezone,ranking,postal,id
0,Toronto,Toronto,ON,Ontario,43.7417,-79.3733,5429524.0,4334.4,America/Toronto,1,M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5...,1124279679
1,Montréal,Montreal,QC,Quebec,45.5089,-73.5617,3519595.0,3889.0,America/Montreal,1,H1X H1Y H1Z H1P H1R H1S H1T H1V H1W H1H H1J H1...,1124586170


In [26]:
countries.head(2)

Unnamed: 0,country,capital,type
0,Abkhazia,Sukhumi,countryCapital
1,Afghanistan,Kabul,countryCapital


In [27]:
pd.merge(cities, countries, left_on=['city_ascii'], right_on=['capital'])[['city', 'province_name', 'country']]

Unnamed: 0,city,province_name,country
0,Ottawa,Ontario,Canada
1,Hamilton,Ontario,Bermuda
2,London,Ontario,United Kingdom; England
3,Victoria,British Columbia,Seychelles
4,Victoria,Newfoundland and Labrador,Seychelles
5,Victoria,Manitoba,Seychelles
6,St. John's,Newfoundland and Labrador,Antigua and Barbuda
7,Stanley,Manitoba,Falkland Islands
8,Douglas,New Brunswick,Isle of Man
9,Wellington,New Brunswick,New Zealand
