<img src="CDSI_Fac.of.Sc_logo.png" alt="CDSI Logo" height="100"/>
<img src="mcgill_ccr_approval_croppedforblock_0.png" alt="CCR Approved Logo" height="100"/>

# Data Science with Python (Pandas)


Do you want to get into the state-of-the-art way of doing data science in academia? 

Register for this workshop where you will learn the first steps of importing, manipulating and cleaning data sets in Python using the pandas package.

At the end of this workshop, you will be able to
- Import data sets of different file formats
- Understand the structure of pandas DataFrames
- Apply common data wrangling functions (subset, reshape, merge, etc.)
- Understand method chaining (AKA piping)
- Apply different methods to treat missing values

Pre-requisites? Introductory understanding of Python.


## Introductory Remarks

This workshop assumes minimal working knowledge of Python. 

Pandas is a great place to start using Python for data science
- Similar feel to other stats software like R or Stata
- Works well on its own but also integrates well with the Python data science ecosystem

`Pandas` is excellent for [Data Wrangling](https://en.wikipedia.org/wiki/Data_wrangling), our main topic. 

This is the process making raw data ready for statistical analysis and/or modeling.

## Contents

1. Python Review
2. Installing and Importing Pandas
3. Pandas Datastructures
    - Series
    - DataFrames
    - Indexing & Subsetting
    - Creating, Dropping & Renaming Columns
4. Basic Data Wrangling
    - Reading and Writing Data
    - Viewing Data
    - Adjusting Datatypes
    - Cleaning Up Categorical Variables
    - Calculating New Variables
        - Transforming & Filtering Data by Groups
        - Method Chaining (AKA Piping)
    - Handling Missing Values
    - Subsetting the Data
5. More Data Wrangling
    - Aggregating Data by Groups
    - Concatenation
    - Merging
    - Reshaping
6. Data Exploration
    - Summary Statistics
    - Basic Plotting 
    - Exporting Tables
7. Advanced Topics
    - A peek at `statsmodels`
    - A peek at `numpy`

## Useful Resources

The [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is an excellent two-page summary of essential `pandas` features.

The [Official Pandas Docs](https://pandas.pydata.org/docs/) are the single best resource for information on `pandas` short of the source code.

If you are looking to learn about a specific function or feature, go into the API section.

The docs also offer tutorials and other helpful material.

## Python Review

In [None]:
lists = []

def func(__1, __2, __3, four, five, *more, **keywords):
    pass

## Installing and Importing `Pandas`

`pandas` is fairly easy to install if you have Python, and it comes pre-packaged
in Anaconda.

If you have the `conda` package manager but no `pandas` installation, you can
simply run the following code in your console, and follow any prompts that pop
up. 

```conda install pandas```

If you don't have Anaconda, but you have a python installation, you can also
try the following code. 

```pip install pandas```

Base `pandas` has only a few dependencies. You may also have to install optional
dependencies depending on your work and your setup.

For a full list of optional dependencies see the
[Installation Documentation](https://pandas.pydata.org/docs/getting_started/install.html)

You import `pandas` like any python package, as below.


In [None]:
# Use 'as' to define a short name for pandas 
# Otherwise you'll need to spell out 'pandas' every time
# `pd` is a common abbreviation for `pandas`

import pandas as pd

## Essential `pandas` Datastructures

In [None]:
# Series.cat
# df.index # Returns the row index
# df.columns # Returns the column index, handy for getting variable names.
# dtypes
# loc
# iloc
# etc.

## Basic Data Wrangling

You can divide data wrangling into three broad activities.
1. Inspection - Get familiar with the data and check for potential problems
2. Preparation - Creating or cleaning variables, dealing with missing values
3. Exploration - Computing descriptives statistics and identifying basic patterns in the data

The ordering above is typical, but not absolute. The different activities often tend to blend together. 

### Reading and Writing Data

`pandas` can import a wide variety of common data formats. 

Some supported file types include the following.
- CSV
- Fixed Width
- Excel
- HTML tables
- SAS
- SPSS
- STATA

For a full list, see the [I/O documentation](https://pandas.pydata.org/docs/reference/io.html).

In [None]:
# To load data, call the appropriate 'read' function, like below 
df = pd.read_csv("federal-candidates-2021-10-20.csv")

Writing to these formats is also generally supported.

In [None]:
# To write a dataframe, call the write method for the chosen format
df.to_csv("federal-candidates-copy.csv")

Read and write functions have many optional parameters.

You can use these parameters to control formatting, metadata, and more. 

These parameters can save you a lot of time, so be sure to check the documentation.

In [None]:
# Here, we are exporting the data to excel. 
# We specify the sheet name as well as the representation for missing values. 
# Be warned, this code may not behave well depending on your Python environment. Read on.
# df.to_excel("federal-candidates.xlsx", sheet_name="Candidate Data", na_rep="")

In some cases, read/write functions may throw a `ModuleNotFoundError`.

If so, that's because you are missing one of the optional dependencies.

It might be alarming at first, but it's an easy fix, if a bit tedious. 

First, install the missing dependency into the current Python environment.

There are a few ways to do this:

- Using conda: 
    1. In the following code snippet, replace `<dependency>` with the name of the dependency.   
    ```conda install <dependency>```
    2.  Run the resulting code in your console (powershell for windows, terminal for mac/linux).
- Using pip: Do the same as above, but with the following code snippet instead.  
```pip install <dependency>```

The methods above are not fail-safe but should work in nearly all cases.

Then, restart the notebook and run everything again.

If you don't restart the notebook, the new package will not be available.

### Viewing the Data

The first step in analyzing data is to get familiar with it.

This way, you can get a feel for how much cleaning will be necessary and give yourself a chance to catch anything weird.

`pandas` provides several utilities for this.

First, take a look at the metadata.

In [None]:
df.info() # General metadata: Information on indices, variable datatypes, and memory usage.

A good next step is to inspect the data directly and continue to look for
anything weird.

In [None]:
df.head(10) # Peek at the first n rows. See anything weird?

There are other functions, that let us look at the bottom of the dataset or to
randomly sample rows from the data set. These are shown below.

In [None]:
df.tail(10) # Peek at the last n rows.

In [None]:
df.sample(10) # Peek at a random sample of rows.

### Adjusting Datatypes

Sometimes pandas fails to assign the right dtype when constructing a variable.
So, our first step is to clean up the data types. 

String data are typically assigned the 'object' type.

In [None]:
object_cols = df.select_dtypes(include="object").columns
print(f"Object-type columns: {object_cols}.")

In this dataset, almost all columns with string data represent categorical
variables. 

One exception to this rule is edate, which represents a date. Furthermore,
candidate_name and occupation are both a free-form strings.

We can cast the affected variables to the correct dtypes as follows.

In [None]:
df.edate = pd.to_datetime(df.edate, yearfirst=True) 

for col in df.select_dtypes(include="object").columns:
    if col not in ["candidate_name", "occupation"]: 
        df[col] = df[col].astype("category")

df[object_cols].dtypes # Check what happened to the object-type columns

Finally, let's take a look at the numerical data.

Do you notice anything weird?

In [None]:
float_cols = df.select_dtypes("float").columns
print(f"Float columns: {float_cols}")

int_cols = df.select_dtypes("int").columns
print(f"Int columns: {int_cols}")

Of the four float columns, it is only natural to represent percent_votes as a 
float type variable. 

For riding_id, we are better off using a categorical datatype, even though the 
values look like integers. Likewise, the id column should also be viewed as a
categorical. In both cases, the ordering of the values is not meaningful.

In [None]:
df.riding_id = df.riding_id.astype("category")
df.id = df.id.astype("category")

For the other two float columns, we should be using an integer dtype.

But why were three variables get 'mis-coded' as float in the first place? 

It's because the default integer datatype in pandas does not support missing
values, and these variables have missing values.

Luckily, `pandas` has another integer datatype that supports missing values. We
can just cast to that.

In [None]:
# Don't mix up 'int64' with 'Int64'! In pandas, they are different: 
#   - 'int64' refers to the regular int type that has no missing value support
#   - 'Int64' refers to the to the int type with missing value support

df.birth_year = df.birth_year.astype("Int64")
df.votes = df.votes.astype("Int64")

It's good to occasionally check if we did things right.

In [None]:
# pd.Index.union() allows us to combine the two indices we isolated eariler.
# You could also just do float_cols.union(int_cols), but I wanted to be explicit
df[pd.Index.union(float_cols, int_cols)].dtypes

Before moving on, let's take a moment to admire our work. Notice that a lot of
the weird stuff is gone.

In [None]:
df.info()

In [None]:
df.head(10)

### Cleaning Up Categorical Variables

Categorical variables tend to require some extra attention, even in the most
well-curated datasets.

We often want to make adjustments.

Let's take a look, for instance, at the values of the `gender` variable, keeping
in mind what the data's README file says about this variable:
> gender is a binary factor variable encoding candidate gender.

In [30]:
df.gender.cat.categories # This is how we access category names

Index(['2', 'F', 'M'], dtype='object')

We were told this is supposed to be a binary variable, but we got three values?

So what's going on here? To get a better idea, let's tabulate the variable 
using the `Series.value_counts()` method.

In [31]:
df.gender.value_counts()

M    39938
F     6585
2        2
Name: gender, dtype: int64

Phew! It only looks like a couple of entries were somehow miscoded. 

We can easily fix something like this by recoding the data using the 
`Series.replace()` method.

In this case, we'll just send those to cases to a missing value, but we could
assign any value we want to them using essentially the same technique.

In [42]:
df.gender = df.gender.replace({"2": None})
df.gender.cat.categories

Index(['F', 'M'], dtype='object')

Another adjustment we can make is to rename categories. 

The main reason to do this is that it is confusing and difficult to work with
poorly named data values. But, it can also come in handy when producing reports.

Take a look at the category names in the `censuscategory` variable, which gives
candidates' occupation according to the Census Canada taxonomy.

In [46]:
df.censuscategory.cat.categories

Index(['Business, finance and administration occupations',
       'Health occupations', 'Management occupations', 'Members of Parliament',
       'Natural and applied sciences and related occupations',
       'Natural resources, agriculture and related production occupations',
       'Occupations in art, culture, recreation and sport',
       'Occupations in education, law and social, community and government services',
       'Occupations in manufacturing and utilities',
       'Sales and service occupations',
       'Trades, transport and equipment operators and related occupations'],
      dtype='object')

These names are precise, but wordy. We can abbreviate them using the 
`Series.cat.rename_categories()` method. 

In [50]:
# We could pass in a dictionary to be extra explicit, but that would involve a
# lot of copying. Since categories are ordered, we can just pass in a list with
# abbreviations enumerated in the correct order.
df.censuscategory = df.censuscategory.cat.rename_categories([
    "Business", "Health", "Management", "MP", "Science", "Resources", "Culture",
    "Social", "Manufacturing", "Sales", "Trades"
])
df.censuscategory.head(10)

0       Sales
1       Sales
2      Social
3      Social
4      Health
5    Business
6         NaN
7      Social
8      Health
9         NaN
Name: censuscategory, dtype: category
Categories (11, object): ['Business', 'Health', 'Management', 'MP', ..., 'Social', 'Manufacturing', 'Sales', 'Trades']

### Handling Missing Data



### Subsetting Data

Often, we are not interested in the entirety of the data. 

Perhaps we care about only certain variables, or we care about only certain
rows, and we may want to restrict our data to only those cases.

There are several ways to do this, depending on the circumstance.