<img src="CDSI_Fac.of.Sc_logo.png" alt="CDSI Logo" height="100"/>
<img src="mcgill_ccr_approval_croppedforblock_0.png" alt="CCR Approved Logo" height="100"/>

# Data Science with Python (Pandas)


Do you want to get into the state-of-the-art way of doing data science in academia? 

Register for this workshop where you will learn the first steps of importing, manipulating and cleaning data sets in Python using the pandas package.

At the end of this workshop, you will be able to
- Import data sets of different file formats
- Understand the structure of pandas DataFrames
- Apply common data wrangling functions (subset, reshape, merge, etc.)
- Understand method chaining (AKA piping)
- Apply different methods to treat missing values

Pre-requisites? Introductory understanding of Python.


## Introductory Remarks

This workshop assumes minimal working knowledge of Python. 

Pandas is a great place to start using Python for data science
- Similar feel to other stats software like R or Stata
- Works well on its own but also integrates well with the Python data science ecosystem

`Pandas` is excellent for [Data Wrangling](https://en.wikipedia.org/wiki/Data_wrangling), our main topic. 

This is the process making raw data ready for statistical analysis and/or modeling.

## Contents

1. Python Review
2. Installing and Importing Pandas
3. Pandas Datastructures
    - Series
    - DataFrames
    - Indexing & Subsetting
    - Creating, Dropping & Renaming Columns
4. Basic Data Wrangling
    - Reading and Writing Data
    - Viewing Data
    - Adjusting Datatypes
    - Subsetting the Data
    - Handling Missing Values
    - Calculating New Variables
        - Transforming & Filtering Data by Groups
        - Method Chaining (AKA Piping)
5. More Data Wrangling
    - Aggregating Data by Groups
    - Concatenation
    - Merging
    - Reshaping
6. Data Exploration
    - Summary Statistics
    - Basic Plotting 
    - Exporting Tables
7. Advanced Topics
    - A peek at `statsmodels`
    - A peek at `numpy`

## Useful Resources

The [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is an excellent two-page summary of essential `pandas` features.

The [Official Pandas Docs](https://pandas.pydata.org/docs/) are the single best resource for information on `pandas` short of the source code.

If you are looking to learn about a specific function or feature, go into the API section.

The docs also offer tutorials and other helpful material.

## Python Review

In [1]:
lists = []

def func(__1, __2, __3, four, five, *more, **keywords):
    pass

## Installing and Importing `Pandas`

`pandas` is fairly easy to install if you have Python, and it comes pre-packaged
in Anaconda.

If you have the `conda` package manager but no `pandas` installation, you can
simply run the following code in your console, and follow any prompts that pop
up. 

```conda install pandas```

If you don't have Anaconda, but you have a python installation, you can also
try the following code. 

```pip install pandas```

Base `pandas` has only a few dependencies. You may also have to install optional
dependencies depending on your work and your setup.

For a full list of optional dependencies see the
[Installation Documentation](https://pandas.pydata.org/docs/getting_started/install.html)

You import `pandas` like any python package, as below.


In [2]:
# Use 'as' to define a short name for pandas 
# Otherwise you'll need to spell out 'pandas' every time
# `pd` is a common abbreviation for `pandas`

import pandas as pd

## Essential `pandas` Datastructures

In [3]:
# df.index # Returns the row index
# df.columns # Returns the column index, handy for getting variable names.
# dtypes
# loc
# iloc
# etc.

## Basic Data Wrangling

You can divide data wrangling into three broad activities.
1. Inspection - Get familiar with the data and check for potential problems
2. Preparation - Creating or cleaning variables, dealing with missing values
3. Exploration - Computing descriptives statistics and identifying basic patterns in the data

The ordering above is typical, but not absolute. The different activities often tend to blend together. 

### Reading and Writing Data

`pandas` can import a wide variety of common data formats. 

Some supported file types include the following.
- CSV
- Fixed Width
- Excel
- HTML tables
- SAS
- SPSS
- STATA

For a full list, see the [I/O documentation](https://pandas.pydata.org/docs/reference/io.html).

In [4]:
# To load data, call the appropriate 'read' function, like below 
df = pd.read_csv("federal-candidates-2021-10-20.csv", )

  exec(code_obj, self.user_global_ns, self.user_ns)


Writing to these formats is also generally supported.

In [5]:
# To write a dataframe, call the write method for the chosen format
df.to_csv("federal-candidates-copy.csv", )

Read and write functions have many optional parameters.

You can use these parameters to control formatting, metadata, and more. 

These parameters can save you a lot of time, so be sure to check the documentation.

In [6]:
# Here, we are exporting the data to excel. 
# We specify the sheet name as well as the representation for missing values. 
# Be warned, this code may not behave well depending on your Python environment. Read on.
# df.to_excel("federal-candidates.xlsx", sheet_name="Candidate Data", na_rep="")

In some cases, read/write functions may throw a `ModuleNotFoundError`.

If so, that's because you are missing one of the optional dependencies.

It might be alarming at first, but it's an easy fix, if a bit tedious. 

First, install the missing dependency into the current Python environment.

There are a few ways to do this:

- Using conda: 
    1. In the following code snippet, replace `<dependency>` with the name of the dependency.   
    ```conda install <dependency>```
    2.  Run the resulting code in your console (powershell for windows, terminal for mac/linux).
- Using pip: Do the same as above, but with the following code snippet instead.  
```pip install <dependency>```

The methods above are not fail-safe but should work in nearly all cases.

Then, restart the notebook and run everything again.

If you don't restart the notebook, the new package will not be available.

### Viewing the Data

The first step in analyzing data is to get familiar with it.

This way, you can get a feel for how much cleaning will be necessary and give yourself a chance to catch anything weird.

`pandas` provides several utilities for this.

First, take a look at the metadata.

In [7]:
df.info() # General metadata: Information on indices, variable datatypes, and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46526 entries, 0 to 46525
Data columns (total 31 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  46526 non-null  int64  
 1   parliament          46526 non-null  int64  
 2   year                46526 non-null  int64  
 3   type_elxn           46526 non-null  object 
 4   elected             46526 non-null  object 
 5   candidate_name      46526 non-null  object 
 6   edate               46526 non-null  object 
 7   incumbent           46458 non-null  object 
 8   gender              46525 non-null  object 
 9   birth_year          12276 non-null  float64
 10  country_birth       543 non-null    object 
 11  lgbtq2_out          2137 non-null   object 
 12  indigenousorigins   46526 non-null  object 
 13  occupation          42405 non-null  object 
 14  lawyer              42135 non-null  object 
 15  censuscategory      40930 non-null  object 
 16  ridi

A good next step is to inspect the data directly and continue to look for
anything weird.

In [8]:
df.head(10) # Peek at the first n rows. See anything weird?

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,acclaimed,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates
0,26093,1,1867,General,Elected,"POWER,",1867-08-07,Not incumbent,M,1815.0,...,Not acclaimed,Switcher,Single,Anti-Confederate,Third_Party,Third_Party,Conservative,Conservative,Conservative,4
1,13011,1,1867,General,Elected,"JONES,",1867-08-07,Not incumbent,M,1824.0,...,Not acclaimed,Switcher,Single,Labour,Labour,Third_Party,Conservative,Conservative,Conservative,4
2,27974,1,1867,General,Not elected,"SHANNON, S.L.",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,4
3,18040,1,1867,General,Elected,"KIRKPATRICK, Thomas",1867-08-07,Not incumbent,M,1805.0,...,Not acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,2
4,1798,1,1867,General,Elected,"BLANCHET, Hon. J.G.",1867-08-07,Not incumbent,M,1829.0,...,Acclaimed,Not switcher,Single,Liberal-Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,1
5,28652,1,1867,General,Elected,"SPROAT, A.",1867-08-07,Not incumbent,M,1834.0,...,Not acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,2
6,6004,1,1867,General,Not elected,"DUFRESNE, A.",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
7,9180,1,1867,General,Elected,"HEATH, Edmund",1867-08-07,Not incumbent,M,1813.0,...,Acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,1
8,26033,1,1867,General,Not elected,"POULIN,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
9,20986,1,1867,General,Not elected,"LEEMING,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2


There are other functions, that let us look at the bottom of the dataset or to
randomly sample rows from the data set. These are shown below.

In [9]:
df.tail(10) # Peek at the last n rows.

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,acclaimed,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates
46516,26759,44,2021,General,Not elected,"Richardson, Sheila G.",2021-09-20,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Liberal Party of Canada,Liberal,Liberal,5
46517,1423,44,2021,General,Elected,"Bergeron, Stéphane",2021-09-20,Incumbent,M,1965.0,...,Not acclaimed,Not switcher,Single,Bloc Québécois,Bloc,Bloc,Liberal Party of Canada,Liberal,Liberal,5
46518,36076,44,2021,General,Not elected,"Sanderson, Jody",2021-09-20,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Conservative Party of Canada,Conservative,Conservative,Liberal Party of Canada,Liberal,Liberal,5
46519,36062,44,2021,General,Not elected,"Hickey, Jason",2021-09-20,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Liberal Party of Canada,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,5
46520,31671,44,2021,General,Not elected,"Baetz, Elaine",2021-09-20,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Marxist-Leninist Party of Canada,Marxist_Lennist,Third_Party,Liberal Party of Canada,Liberal,Liberal,8
46521,36038,44,2021,General,Not elected,"Bandou, Rachid",2021-09-20,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Bloc Québécois,Bloc,Bloc,Liberal Party of Canada,Liberal,Liberal,5
46522,33663,44,2021,General,Elected,"Zahid, Salma",2021-09-20,Incumbent,F,1970.0,...,Not acclaimed,Not switcher,Single,Liberal Party of Canada,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,5
46523,36037,44,2021,General,Not elected,"Jemmah, Rachid",2021-09-20,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Liberal Party of Canada,Liberal,Liberal,9
46524,32856,44,2021,General,Elected,"McLeod, Michael",2021-09-20,Incumbent,M,1959.0,...,Not acclaimed,Not switcher,Single,Liberal Party of Canada,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,5
46525,35991,44,2021,General,Not elected,"Joshi, Medha",2021-09-20,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Conservative Party of Canada,Conservative,Conservative,Liberal Party of Canada,Liberal,Liberal,3


In [10]:
df.sample(10) # Peek at a random sample of rows.

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,acclaimed,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates
42333,35779,42,2019,By-election,Not elected,"Lund, Mathew",2019-02-25,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Liberal Party of Canada,Liberal,Liberal,9
1071,24255,3,1874,General,Elected,"MOUSSEAU, Joseph-Alfred",1874-01-22,Not incumbent,M,1838.0,...,Not acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Liberal Party of Canada,Liberal,Liberal,2
44015,34875,43,2019,General,Not elected,"Gower, Christina",2019-10-21,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,New Democratic Party,NDP,CCF_NDP,Liberal Party of Canada,Liberal,Liberal,6
23829,30993,32,1980,General,Not elected,"WIEBE, David",1980-02-18,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Liberal,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,6
44994,32126,44,2021,General,Not elected,"Dookeran, Nira",2021-09-20,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Liberal Party of Canada,Liberal,Liberal,5
15809,2049,25,1962,General,Not elected,"BOUCHARD, Claude",1962-06-18,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Liberal,Liberal,Liberal,Progressive Conservative Party,Conservative,Conservative,3
33088,22240,37,2000,General,Not elected,"MANNING, Bernadette",2000-11-27,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Liberal Party of Canada,Liberal,Liberal,7
37957,8137,40,2008,General,Not elected,"GORDON, Brian G.",2008-10-14,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Conservative Party of Canada,Conservative,Conservative,6
24032,898,32,1980,General,Not elected,"BARNEY, Ann Margaret",1980-02-18,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Independent,Independent,Independent,Liberal Party of Canada,Liberal,Liberal,5
28285,7417,35,1993,General,Not elected,"GAGNON, Henri",1993-10-25,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Abolitionist Party of Canada,Social_Credit,Third_Party,Liberal Party of Canada,Liberal,Liberal,7


### Adjusting Datatypes

Sometimes pandas fails to assign the right dtype when constructing a variable.
So, our first step is to clean up the data types. 

String data are typically assigned the 'object' type.

In [11]:
object_cols = df.select_dtypes(include="object").columns
print(f"Object-type columns: {object_cols}.")

Object-type columns: Index(['type_elxn', 'elected', 'candidate_name', 'edate', 'incumbent',
       'gender', 'country_birth', 'lgbtq2_out', 'indigenousorigins',
       'occupation', 'lawyer', 'censuscategory', 'riding', 'province',
       'acclaimed', 'switcher', 'multiple_candidacy', 'party_raw',
       'party_minor_group', 'party_major_group', 'gov_party_raw',
       'gov_minor_group', 'gov_major_group'],
      dtype='object').


In this dataset, almost all columns with string data represent categorical
variables. 

One exception to this rule is edate, which represents a date. Furthermore,
candidate_name and occupation are both a free-form strings.

We can cast the affected variables to the correct dtypes as follows.

In [12]:
df.edate = pd.to_datetime(df.edate, yearfirst=True) 

for col in df.select_dtypes(include="object").columns:
    if col not in ["candidate_name", "occupation"]: 
        df[col] = df[col].astype("category")

df[object_cols].dtypes # Check what happened to the object-type columns

type_elxn                   category
elected                     category
candidate_name                object
edate                 datetime64[ns]
incumbent                   category
gender                      category
country_birth               category
lgbtq2_out                  category
indigenousorigins           category
occupation                    object
lawyer                      category
censuscategory              category
riding                      category
province                    category
acclaimed                   category
switcher                    category
multiple_candidacy          category
party_raw                   category
party_minor_group           category
party_major_group           category
gov_party_raw               category
gov_minor_group             category
gov_major_group             category
dtype: object

Finally, let's take a look at the numerical data.

Do you notice anything weird?

In [13]:
float_cols = df.select_dtypes("float").columns
print(f"Float columns: {float_cols}")

int_cols = df.select_dtypes("int").columns
print(f"Int columns: {int_cols}")

Float columns: Index(['birth_year', 'riding_id', 'votes', 'percent_votes'], dtype='object')
Int columns: Index(['id', 'parliament', 'year', 'num_candidates'], dtype='object')


Of the four float columns, it is only natural to represent percent_votes as a 
float type variable. 

For riding_id, we are better off using a categorical datatype, even though the 
values look like integers. Likewise, the id column should also be viewed as a
categorical. In both cases, the ordering of the values is not meaningful.

In [14]:
df.riding_id = df.riding_id.astype("category")
df.id = df.id.astype("category")

For the other two float columns, we should be using an integer dtype.

But why were three variables get 'mis-coded' as float in the first place? 

It's because the default integer datatype in pandas does not support missing
values, and these variables have missing values.

Luckily, `pandas` has another integer datatype that supports missing values. We
can just cast to that.

In [15]:
# Don't mix up 'int64' with 'Int64'! In pandas, they are different: 
#   - 'int64' refers to the regular int type that has no missing value support
#   - 'Int64' refers to the to the int type with missing value support

df.birth_year = df.birth_year.astype("Int64")
df.votes = df.votes.astype("Int64")

It's good to occasionally check if we did things right.

In [16]:
# pd.Index.union() allows us to combine the two indices we isolated eariler.
# You could also just do float_cols.union(int_cols), but I wanted to be explicit
df[pd.Index.union(float_cols, int_cols)].dtypes

birth_year           Int64
id                category
num_candidates       int64
parliament           int64
percent_votes      float64
riding_id         category
votes                Int64
year                 int64
dtype: object

Before moving on, let's take a moment to admire our work. Notice that a lot of
the weird stuff is gone.

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46526 entries, 0 to 46525
Data columns (total 31 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   id                  46526 non-null  category      
 1   parliament          46526 non-null  int64         
 2   year                46526 non-null  int64         
 3   type_elxn           46526 non-null  category      
 4   elected             46526 non-null  category      
 5   candidate_name      46526 non-null  object        
 6   edate               46526 non-null  datetime64[ns]
 7   incumbent           46458 non-null  category      
 8   gender              46525 non-null  category      
 9   birth_year          12276 non-null  Int64         
 10  country_birth       543 non-null    category      
 11  lgbtq2_out          2137 non-null   category      
 12  indigenousorigins   46526 non-null  category      
 13  occupation          42405 non-null  object    

In [18]:
df.head(10)

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,acclaimed,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates
0,26093,1,1867,General,Elected,"POWER,",1867-08-07,Not incumbent,M,1815.0,...,Not acclaimed,Switcher,Single,Anti-Confederate,Third_Party,Third_Party,Conservative,Conservative,Conservative,4
1,13011,1,1867,General,Elected,"JONES,",1867-08-07,Not incumbent,M,1824.0,...,Not acclaimed,Switcher,Single,Labour,Labour,Third_Party,Conservative,Conservative,Conservative,4
2,27974,1,1867,General,Not elected,"SHANNON, S.L.",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,4
3,18040,1,1867,General,Elected,"KIRKPATRICK, Thomas",1867-08-07,Not incumbent,M,1805.0,...,Not acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,2
4,1798,1,1867,General,Elected,"BLANCHET, Hon. J.G.",1867-08-07,Not incumbent,M,1829.0,...,Acclaimed,Not switcher,Single,Liberal-Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,1
5,28652,1,1867,General,Elected,"SPROAT, A.",1867-08-07,Not incumbent,M,1834.0,...,Not acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,2
6,6004,1,1867,General,Not elected,"DUFRESNE, A.",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
7,9180,1,1867,General,Elected,"HEATH, Edmund",1867-08-07,Not incumbent,M,1813.0,...,Acclaimed,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,1
8,26033,1,1867,General,Not elected,"POULIN,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
9,20986,1,1867,General,Not elected,"LEEMING,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2


### Subsetting Data

Often, we are not interested in the entirety of the data. 

Perhaps we care about only certain variables, or we care about only certain
rows, and we may want to restrict our data to only those cases.

There are several ways to do this, depending on the circumstance.