# Data Science with Python (Pandas)

Can Şerif Mekik

December 6, 2021

<table align="left">
<tr>
<td><img src=CDSI_Fac.of.Sc_logo.png alt="CDSI Logo" width="300"/></td>
<td><img src=mcgill_ccr_approval_croppedforblock_0.png alt="CCR Approved Logo" width="300"/></td>
</tr>
</table>

## Introductory Remarks

This workshop assumes minimal working knowledge of Python. 

Pandas is a great place to start using Python for data science
- Similar feel to other stats software like R or Stata
- Works well on its own but also integrates well with the Python data science ecosystem

`Pandas` is excellent for [Data Wrangling](https://en.wikipedia.org/wiki/Data_wrangling), our main topic. 

This is the process making raw data ready for statistical analysis and/or modeling.

## Useful Resources

The [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is an excellent two-page summary of essential `pandas` features.

The [Official Pandas Docs](https://pandas.pydata.org/docs/) are the single best resource for information on `pandas` short of the source code.

If you are looking to learn about a specific function or feature, go into the API section.

The docs also offer tutorials and other helpful material.

## Contents

1. Setup
2. Main Data Structures
3. Basic Data Wrangling: Loading, Viewing, Cleaning, and Enriching Data
4. More Data Wrangling: Aggregating, Merging and Concatenating Data
5. Data Exploration
6. Advanced Topics (if time allows)

## Setup

We will walk through the first steps of analyzing a sample dataset.

To follow the workshop on your own machine, you should have Anaconda already installed.

This will automatically include the necessary dependencies.

Our data set is Semra Sevi's Canadian Federal Elections dataset.

You can find a copy of the dataset and this jupyter notebook at the following address.

```https://link.to.materials```

### Getting Ready to Code

`Jupyter` is a python tool for rich interactive coding.

This presentation uses `Jupyter` notebook, in fact!

To get set, create a new folder in which you will work and copy the materials into it.

I would call it `CDSI_DSwP`, or something like that.

Then launch your machines console, activate your conda environment and run the following.

```jupyter notebook```

This should launch Jupyter notebook in your browser. When it does, you can open the notebook.

### Installing and Importing pandas

`pandas` comes pre-packaged in Anaconda.

Base `pandas` has only a few dependencies, but you may have to install optional dependencies depending on your work and your setup.

For a full list of optional dependencies see the
[Installation Documentation](https://pandas.pydata.org/docs/getting_started/install.html)


In [2]:
# import the pandas library!

import pandas as pd

## Main Data Structures

There are two main data structures in `pandas`. 
- `Series` represent one single column of data
- `DataFrame` represents a two dimensional array of data

These are *array-like* data structures. 

- They have fixed dimensions.
- Their entries are of a homogeneous datatype.
- They are associated with indices, which help with data access
- They support a variety of mathematical operations.

### Pandas Series

> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 

In [3]:
s = pd.Series([1, 2, 3, 4, 5], index=["Label1", "Label2", "Label3", "Label4", "Label5"])

Label1    1
Label2    2
Label3    3
Label4    4
Label5    5
dtype: int64

In [4]:
s.index

Index(['Label1', 'Label2', 'Label3', 'Label4', 'Label5'], dtype='object')

In [7]:
s.dtype

dtype('int64')

### Series Data Access

There are *many* ways to access data with Series objects.

Which one you use will depend on the situation.

Let's look at some common patterns.

See the docs on [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html) for more details.

#### Using `loc` and `iloc`

These are the basic data access methods

In [26]:
s.loc["Label1"]  # access data by index label

(1, 1)

In [None]:
s.iloc[0] # access data by position in index

#### Using Regular Subscripts

Subscripts try to behave intelligently depending on the data type you pass in.

In [27]:
s["Label1"] 
# direct indexing; multifunction, tries to be smart

1

In [28]:
s[0]

1

#### Subsetting Data

We can use boolean/logical expressions to select data.

See the cheat sheet for a quick list of different operators.

In [48]:
2 < s # Constructing a boolean series

Label1    False
Label2    False
Label3     True
Label4     True
Label5     True
dtype: bool

In [49]:
s[2 < s] # Simple boolean query

Label3    3
Label4    4
Label5    5
dtype: int64

In [47]:
s[(2 < s) & (s < 5)] # More complex boolean; watch operator precedence!

Label3    3
Label4    4
dtype: int64

Look at what happens to the index with a boolean selection.

In [50]:
s[2 < s] 

Label3    3
Label4    4
Label5    5
dtype: int64

We lose some index entries! 

But, sometimes, we want to keep the index intact. Here is how to do that.

In [30]:
s.where(s > 2) # boolean indexing again, but preserving the original index

Label1    NaN
Label2    NaN
Label3    3.0
Label4    4.0
Label5    5.0
dtype: float64

### Pandas DataFrames

> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.


In [57]:
df = pd.DataFrame({"another_s": [6, 7, 8, 9, 10],
                   "yet another": [None, None, 11, 12, 13]},
    index=["Label1", "Label2", "Label3", "Label4", "Label5"])
df

Unnamed: 0,another_s,yet another
Label1,6,
Label2,7,
Label3,8,11.0
Label4,9,12.0
Label5,10,13.0


In [59]:
df.columns # dfs have column indices!

Index(['another_s', 'yet another'], dtype='object')

### DataFrame Data Access

You can use the same methods as Series to access and subset DataFrame data.

The behaviour is slightly different in some cases, and there some additional functionality that allows you to work with columns.

For further details, see 

#### Accessing Rows and Columns

In [63]:
df.another_s

Label1     6
Label2     7
Label3     8
Label4     9
Label5    10
Name: another_s, dtype: int64

In [65]:
# 'df.yet another' invalid!
df["yet another"]

Label1     NaN
Label2     NaN
Label3    11.0
Label4    12.0
Label5    13.0
Name: yet another, dtype: float64

To select rows, use `loc` and `iloc`

In [68]:
df.loc["Label1"]

another_s      6.0
yet another    NaN
Name: Label1, dtype: float64

In [69]:
df.iloc[0]

another_s      6.0
yet another    NaN
Name: Label1, dtype: float64

#### Adding, Selecting and Reordering Columns

In [72]:
df["s"] = s # Remeber s?
df

Unnamed: 0,another_s,yet another,s
Label1,6,,1
Label2,7,,2
Label3,8,11.0,3
Label4,9,12.0,4
Label5,10,13.0,5


In [73]:
df[["s", "another_s"]]

Unnamed: 0,s,another_s
Label1,1,6
Label2,2,7
Label3,3,8
Label4,4,9
Label5,5,10


### Basic DataFrame Subsetting

You can use boolean expressions as with series, but with different columns too!

In [76]:
df[df.s < 2]

Unnamed: 0,another_s,yet another,s,s2
Label1,6,,1,


In [77]:
df[df["yet another"] > df.s]

Unnamed: 0,another_s,yet another,s,s2
Label3,8,11.0,3,3.0
Label4,9,12.0,4,4.0
Label5,10,13.0,5,5.0


## Basic Data Wrangling: Loading, Viewing, Cleaning and Enriching Data

You can divide data wrangling into three broad activities.
1. Inspection - Get familiar with the data and check for potential problems
2. Preparation - Create or clean variables, deal with missing values
3. Exploration - Compute descriptives statistics and identify basic patterns in the data

The ordering above is typical, but not absolute. 

The different activities often tend to blend together. 

### Reading and Writing Data

`pandas` can import a wide variety of common data formats. 

Some supported file types include the following.
- CSV
- Excel
- SPSS
- STATA

For a full list, see the [I/O documentation](https://pandas.pydata.org/docs/reference/io.html).

In [111]:
# For this workshop, we will work with the following data
load_path = "federal-candidates-2021-10-20.csv"

# Load the data 
df = pd.read_csv(load_path)


  exec(code_obj, self.user_global_ns, self.user_ns)


Writing to these formats is also generally supported.

In [80]:
# Let's save a copy of the data to this file
save_path = "federal-candidates-copy.csv"

# To write a dataframe, call the write method for the chosen format
df.to_csv(save_path)


### Viewing the Data

The first step in analyzing data is to get familiar with it.

This way, we get a feel for how much cleaning will be necessary.

We also give ourselves a chance to catch anything weird.

To get our bearings, let's take a look at the metadata.

In [21]:
df[df.columns[:10]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46526 entries, 0 to 46525
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              46526 non-null  int64  
 1   parliament      46526 non-null  int64  
 2   year            46526 non-null  int64  
 3   type_elxn       46526 non-null  object 
 4   elected         46526 non-null  object 
 5   candidate_name  46526 non-null  object 
 6   edate           46526 non-null  object 
 7   incumbent       46458 non-null  object 
 8   gender          46525 non-null  object 
 9   birth_year      12276 non-null  float64
dtypes: float64(1), int64(3), object(6)
memory usage: 3.5+ MB


A good next step is to inspect the data directly and continue to look for
anything weird.

We can look at the top or bottom of the dataset or to randomly sample rows from the data set.

In [16]:
df.head(5) # Peek at the first n rows. See anything weird?

NameError: name 'df' is not defined

In [None]:
df.tail(5) # Peek at the last n rows.

In [None]:
df.sample(5) # Peek at a random sample of rows.

### Data Cleaning

Let's get familiar with some `pandas` data cleaning tools.

Our first data cleaning step is to clean up the data types.

Sometimes `pandas` fails to assign the right dtype when constructing a variable. 

#### Adjusting Datatypes

String data are typically assigned the 'object' type.

This is `pandas`'s way of signaling it doesn't really know how to interpret the data.

In this dataset, almost all columns with string data represent categorical
variables. 

However, edate is a date, and candidate_name and occupation are a free-form strings.

In [81]:
object_cols = df.select_dtypes(include="object").columns
print(f"Object-type columns: {object_cols}.")

Object-type columns: Index(['type_elxn', 'elected', 'candidate_name', 'edate', 'incumbent',
       'gender', 'country_birth', 'lgbtq2_out', 'indigenousorigins',
       'occupation', 'lawyer', 'censuscategory', 'riding', 'province',
       'acclaimed', 'switcher', 'multiple_candidacy', 'party_raw',
       'party_minor_group', 'party_major_group', 'gov_party_raw',
       'gov_minor_group', 'gov_major_group'],
      dtype='object').


We can cast the affected variables to the correct dtypes as follows.

In [113]:
# Cast edate as a datetime variable
df.edate = pd.to_datetime(df.edate, yearfirst=True) 

# Cast object variables as categorical
for col in df.select_dtypes(include="object").columns:
    if col not in ["candidate_name", "occupation"]: 
        df[col] = df[col].astype("category")

 Let's check what happened to the object-type columns.

In [83]:
df[object_cols].dtypes

type_elxn                   category
elected                     category
candidate_name                object
edate                 datetime64[ns]
incumbent                   category
gender                      category
country_birth               category
lgbtq2_out                  category
indigenousorigins           category
occupation                    object
lawyer                      category
censuscategory              category
riding                      category
province                    category
acclaimed                   category
switcher                    category
multiple_candidacy          category
party_raw                   category
party_minor_group           category
party_major_group           category
gov_party_raw               category
gov_minor_group             category
gov_major_group             category
dtype: object

Finally, let's take a look at the numerical data.

Do you notice anything weird?

In [84]:
float_cols = df.select_dtypes("float").columns
print(f"Float columns: {float_cols}")

int_cols = df.select_dtypes("int").columns
print(f"Int columns: {int_cols}")

Float columns: Index(['birth_year', 'riding_id', 'votes', 'percent_votes'], dtype='object')
Int columns: Index(['id', 'parliament', 'year', 'num_candidates'], dtype='object')


Of the four float columns, it is only natural to represent `percent_votes` as a 
float type variable. 

For `riding_id`, we are better off using a categorical datatype, even though the 
values look like integers. 

Likewise, the `id` column should also be viewed as a
categorical. In both cases, the ordering of the values is not meaningful.

Here is how we convert the dataype of `riding_id` and `id`.

In [114]:
df.riding_id = df.riding_id.astype("category")
df.id = df.id.astype("category")

The `birth_year` and `votes` columns should be using an integer dtype.

But why were three variables get 'mis-coded' as float in the first place? 

Perhaps we can get a clue by comparing to correct int variables

In [86]:
df[["birth_year", "votes", "year", "num_candidates"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46526 entries, 0 to 46525
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   birth_year      12276 non-null  float64
 1   votes           45835 non-null  float64
 2   year            46526 non-null  int64  
 3   num_candidates  46526 non-null  int64  
dtypes: float64(2), int64(2)
memory usage: 1.4 MB


It's because the default integer datatype does not support missing
values! 

Luckily, `pandas` has another integer datatype that supports missing values. We
can just cast to that.

In [115]:
df.birth_year = df.birth_year.astype("Int64")
df.votes = df.votes.astype("Int64")

Don't mix up 'int64' with 'Int64'! In pandas, they are different: 
- `int64` refers to the regular int type that has no missing value support
- `Int64` refers to the to the int type with missing value support

#### Cleaning Categorical Variables

Categorical variables tend to require some extra attention, even in the most
well-curated datasets.

Take a look at the values of the `gender` variable. 

Keep in mind what the data's README file says about this variable:
> gender is a binary factor variable encoding candidate gender.

In [88]:
df.gender.cat.categories # This is how we access category names

Index(['2', 'F', 'M'], dtype='object')

We were told `gender` is supposed to be a binary variable, but we got three values?

So what's going on here? 

To get a better idea, let's tabulate the variable.

In [89]:
df.gender.value_counts()

M    39938
F     6585
2        2
Name: gender, dtype: int64

Let's look up these cases by subsetting the data. 

In [90]:
df[df.gender == "2"]

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,acclaimed,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates
45911,34718,44,2021,General,Not elected,"Walker, Gillian",2021-09-20,Not incumbent,2,,...,Not acclaimed,Not switcher,Single,Green Party of Canada,Green,Third_Party,Liberal Party of Canada,Liberal,Liberal,5
46045,35663,44,2021,General,Not elected,"Woodmass, Rowan",2021-09-20,Not incumbent,2,,...,Not acclaimed,Not switcher,Single,New Democratic Party,NDP,CCF_NDP,Liberal Party of Canada,Liberal,Liberal,6


These candidates are from the most recent election and, if you look them up, you'll find that they both have non-binary gender identity. 

We can recode the data to give those two cases more explicit labels. 

In [116]:
df.gender = df.gender.replace({"2": "NB"})
df.gender.cat.categories

Index(['NB', 'F', 'M'], dtype='object')

Replace is very powerful and works with other variable types as well!

It is particularly handy when you want to collapse multiple categories.

Another adjustment we can make is to rename categories. 

Why? Because it is confusing and difficult to work with
poorly named categories.

Take a look at the category names in the `censuscategory` variable. 

In [92]:
df.censuscategory.cat.categories

Index(['Business, finance and administration occupations',
       'Health occupations', 'Management occupations', 'Members of Parliament',
       'Natural and applied sciences and related occupations',
       'Natural resources, agriculture and related production occupations',
       'Occupations in art, culture, recreation and sport',
       'Occupations in education, law and social, community and government services',
       'Occupations in manufacturing and utilities',
       'Sales and service occupations',
       'Trades, transport and equipment operators and related occupations'],
      dtype='object')

`censuscategory` gives candidates' occupation according to the Census Canada taxonomy.

These names are precise, but wordy. Let's abbreviate them. 

In [117]:
df.censuscategory = df.censuscategory.cat.rename_categories([
    "Business", "Health", "Management", "MP", "Science", "Resources", "Culture",
    "Social", "Manufacturing", "Sales", "Trades"
])
df.censuscategory.head(10)

0       Sales
1       Sales
2      Social
3      Social
4      Health
5    Business
6         NaN
7      Social
8      Health
9         NaN
Name: censuscategory, dtype: category
Categories (11, object): ['Business', 'Health', 'Management', 'MP', ..., 'Social', 'Manufacturing', 'Sales', 'Trades']

### Subsetting Data

Often, we are not interested in the entirety of the data. It then makes sense to subset the data. 

The exact timing of this step can change depending on the case. We do it here to reduce clutter. 

In [131]:
df = df[df.edate < pd.to_datetime(1990)] # keep data from 1990 on
df = df[["parliament", "edate", "province", "riding", "type_elxn", "id", "candidate_name",
         "gender", "birth_year", "party_major_group", "gov_major_group", "acclaimed", "votes", "percent_votes"]]
df.head(5)

Unnamed: 0,parliament,edate,province,riding,type_elxn,id,candidate_name,gender,birth_year,party_major_group,gov_major_group,acclaimed,votes,percent_votes
0,1,1867-08-07,Nova Scotia,HALIFAX,General,26093,"POWER,",M,1815.0,Third_Party,Conservative,Not acclaimed,2367.0,26.125828
1,1,1867-08-07,Nova Scotia,HALIFAX,General,13011,"JONES,",M,1824.0,Third_Party,Conservative,Not acclaimed,2381.0,26.280354
2,1,1867-08-07,Nova Scotia,HALIFAX,General,27974,"SHANNON, S.L.",M,,Independent,Conservative,Not acclaimed,2154.0,23.774834
3,1,1867-08-07,Ontario,FRONTENAC,General,18040,"KIRKPATRICK, Thomas",M,1805.0,Conservative,Conservative,Not acclaimed,1242.0,64.186043
4,1,1867-08-07,Quebec,LÉVIS,General,1798,"BLANCHET, Hon. J.G.",M,1829.0,Conservative,Conservative,Acclaimed,,100.0


### Handling Missing Data

There will almost always be some missing values.

In [120]:
na_counts = df.isna().sum() # .sum() is an aggregation function!
na_counts[na_counts > 0]

incumbent            67
birth_year        11885
country_birth     19338
lgbtq2_out        19338
occupation         2172
lawyer             2172
censuscategory     2289
riding_id         19338
votes               669
percent_votes        51
acclaimed             1
dtype: int64

It's good to first understand why there are missing values. 

We then have to decide whether we want to keep, drop, or impute (i.e., fill in) these values.

In [130]:
df[df.lawyer.isna()]

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,acclaimed,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates
6,6004,1,1867,General,Not elected,"DUFRESNE, A.",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
9,20986,1,1867,General,Not elected,"LEEMING,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
10,27272,1,1867,General,Not elected,"ROUSSEAU,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
17,8133,1,1867,General,Not elected,"GORDON,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,2
18,5349,1,1867,General,Not elected,"DES BRISAY,",1867-08-07,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Unknown,Independent,Independent,Conservative,Conservative,Conservative,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16857,8203,26,1963,General,Not elected,"GOUGH, Grace",1963-04-08,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,Social Credit Party of Canada,Social_Credit,Third_Party,Liberal Party of Canada,Liberal,Liberal,4
17858,28750,27,1965,General,Elected,"STANBURY, Robert",1965-11-08,Not incumbent,M,1929,...,Not acclaimed,Not switcher,Single,Liberal,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,4
18064,3710,27,1965,General,Not elected,"CHEFURKA, Pat",1965-11-08,Not incumbent,F,,...,Not acclaimed,Not switcher,Single,New Democratic Party,NDP,CCF_NDP,Liberal Party of Canada,Liberal,Liberal,3
18642,29469,28,1968,General,Not elected,"THÉBERGE, Gaétan",1968-06-25,Not incumbent,M,,...,Not acclaimed,Not switcher,Single,Liberal,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,4


### Creating New Columns



In [44]:
df["age"] = df.year - df.birth_year

In [50]:
df.age.describe()

count    12276.000000
mean        49.195422
std         10.273206
min        -54.000000
25%         42.000000
50%         49.000000
75%         56.000000
max         84.000000
Name: age, dtype: float64

In [51]:
df[df.year < df.birth_year]

Unnamed: 0,id,parliament,year,type_elxn,elected,candidate_name,edate,incumbent,gender,birth_year,...,switcher,multiple_candidacy,party_raw,party_minor_group,party_major_group,gov_party_raw,gov_minor_group,gov_major_group,num_candidates,age
1200,24412,3,1874,By-election,Elected,"MURRAY, Thomas",1874-11-04,Not incumbent,M,1898,...,Not switcher,Single,Liberal,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,2,-24
1291,24412,3,1876,By-election,Not elected,"MURRAY, William",1876-01-21,Incumbent,M,1898,...,Not switcher,Single,Liberal,Liberal,Liberal,Liberal Party of Canada,Liberal,Liberal,2,-22
2923,52,7,1891,General,Elected,"ADAMS, Michael",1891-03-05,Not incumbent,M,1945,...,Not switcher,Single,Conservative,Conservative,Conservative,Conservative,Conservative,Conservative,2,-54
3337,24392,7,1892,By-election,Not elected,"MURRAY, Thomas",1892-06-26,Incumbent,M,1898,...,Not switcher,Single,Liberal,Liberal,Liberal,Conservative,Conservative,Conservative,2,-6


In [None]:
df.