# Data Science with Python (Pandas)

Can Şerif Mekik

PhD Candidate <br/>
Department of Cognitive Science <br/>
Rensselaer Polytechnic Institute

December 6, 2021

<table align="left">
<tr>
<td><img src=CDSI_Fac.of.Sc_logo.png alt="CDSI Logo" width="300"/></td>
<td><img src=mcgill_ccr_approval_croppedforblock_0.png alt="CCR Approved Logo" width="300"/></td>
</tr>
</table>

## Introductory Remarks

This workshop assumes minimal working knowledge of Python. 

Pandas is a great place to start using Python for data science
- Similar feel to other stats software like R or Stata
- Works well on its own but also integrates well with the Python data science ecosystem

`Pandas` is excellent for [Data Wrangling](https://en.wikipedia.org/wiki/Data_wrangling), our main topic. 

This is the process making raw data ready for statistical analysis and/or modeling.

## Useful Resources

The [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is an excellent two-page summary of essential `pandas` features.

The [Official Pandas Docs](https://pandas.pydata.org/docs/) are the single best resource for information on `pandas` short of the source code.

If you are looking to learn about a specific function or feature, go into the API section.

The docs also offer tutorials and other helpful material.

## Contents

1. Setup
2. Main Data Structures
3. Basic Data Wrangling: Loading, Viewing, Cleaning, and Enriching Data
4. More Data Wrangling: Aggregating, Reshaping, Merging and Concatenating Data
5. Conclusion

## Setup

We will walk through the first steps of analyzing a sample dataset.

To follow the workshop on your own machine, you should have Anaconda already installed.

https://www.anaconda.com/products/individual

This will automatically include the necessary dependencies.

Our data set is Semra Sevi's Canadian Federal Elections dataset.

You can find a copies of the dataset and code at the following addresses.

- Code: https://github.com/cmekik/CDSI_DSwP
- Data: https://doi.org/10.7910/DVN/ABFNSQ 

### Getting Ready to Code

`Jupyter` is a python tool for rich interactive coding that ships with Anaconda.

This presentation uses `Jupyter` notebook, in fact!

To get set, create a new folder in which you will work and copy the materials into it.

Then launch your machines console, navigate to your folder, activate your conda environment, and run the following.

```jupyter notebook```

This should launch Jupyter notebook in your browser. When it does, you can open the notebook.

### Installing and Importing pandas

`pandas` comes pre-packaged in Anaconda.

You can always install it using the following pip command: ```pip install pandas```

If you have conda, but not pandas, you can also do: ```conda install pandas```

Base `pandas` has only a few dependencies, but you may have to install optional dependencies depending on your work and your setup.

For a full list of optional dependencies see the
[Installation Documentation](https://pandas.pydata.org/docs/getting_started/install.html)


In [None]:
# import the pandas library!

import pandas as pd

## Main Data Structures

There are two main data structures in `pandas`. 
- `Series` represent one single column of data
- `DataFrame` represents a two dimensional array of data

These are *array-like* data structures. 

- They have fixed dimensions.
- Their entries are of a homogeneous datatype.
- They are associated with indices, which help with data access
- They support a variety of mathematical operations.

### Pandas Series

> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 

In [None]:
s = pd.Series([1, 2, 3, 4, 5], index=["Lbl1", "Lbl2", "Lbl3", "Lbl4", "Lbl5"])
s

In [None]:
s.shape, s.size, s.dtype

In [None]:
s.index

### Series Data Access

There are *many* ways to access data with Series objects.

Which one you use will depend on the situation.

Let's look at some common patterns.

See the docs on [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html) for more details.

#### Using `loc` and `iloc`

These are the basic data access methods

In [None]:
s.loc["Lbl1"]  # access data by index label

In [None]:
s.iloc[0] # access data by position in index

#### Using Regular Subscripts

Subscripts try to behave intelligently depending on the data type you pass in.

In [None]:
s["Lbl1"], s[0] 
# direct indexing; multifunction, tries to be smart

#### Subsetting Data

We can use boolean/logical expressions to select data.

See the cheat sheet for a quick list of different operators.

In [None]:
2 < s # Constructing a boolean series

In [None]:
s[2 < s] # Simple boolean query

In [None]:
s[(2 < s) & (s < 5)] # More complex boolean; watch operator precedence!

Look at what happens to the index with a boolean selection.

In [None]:
s[2 < s] 

We lose some index entries! 

But, sometimes, we want to keep the index intact. Here is how to do that.

In [None]:
s.where(s > 2) # boolean indexing again, but preserving the original index

### Pandas DataFrames

> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.


In [None]:
df = pd.DataFrame({"another_s": [6, 7, 8, 9, 10],
                   "yet another": [11, None, None, 14, 15]},
    index=["Lbl1", "Lbl2", "Lbl3", "Lbl4", "Lbl5"])
df

In [None]:
df.columns # dfs have column indices!

### DataFrame Data Access

You can use the same methods as Series to access and subset DataFrame data.

The behaviour is slightly different in some cases, and there some additional functionality that allows you to work with columns.

For further details, see 

#### Accessing Rows and Columns

In [None]:
df.another_s

In [None]:
# 'df.yet another' invalid!
df["yet another"]

To select individual rows, use `loc` and `iloc`

In [None]:
df.loc["Label1"]

In [None]:
df.iloc[0]

#### Adding, Selecting and Reordering Columns

In [None]:
df["s"] = s # Remeber s?
df

In [None]:
df[["s", "another_s"]] # Subset and reorder columns

### DataFrame Subsetting

You can use boolean expressions as with series, but with different columns too!

In [None]:
df[df.s < 2]

In [None]:
df[df["yet another"] > df.s]

### Working with Missing Data

Take another look at the Dataframe.

In [None]:
df

The `NaN` values indicate missing data. 

Let's briefly review their behavior. We'll look into how to deal with them later.

We can check for missing values.

In [None]:
df["yet another na"] = df["yet another"].isna()
df[["another_s", "yet another", "yet another na"]]

Missing values get propagated in mathematical operations and ignored in aggregation functions.

In [None]:
df["another_s"] + df["yet another"]

In [None]:
df["yet another"].sum()

### Special Datatypes

Pandas has special datatypes for categorical and time data.

These datatypes support a number of sophisticated operations.

We will see some of them later on, but for reference see the following:
- [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)
- [Timeseries data](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)

## Basic Data Wrangling: Loading, Viewing, Cleaning and Enriching Data

You can divide data wrangling into three broad activities.
1. Inspection - Get familiar with the data and check for potential problems
2. Preparation - Create or clean variables, deal with missing values
3. Exploration - Compute descriptives statistics and identify basic patterns in the data

The ordering above is typical, but not absolute. 

The different activities often tend to blend together. 

### Reading and Writing Data

`pandas` can import a wide variety of common data formats. 

Some supported file types include the following.
- CSV
- Excel
- SPSS
- STATA

For a full list, see the [I/O documentation](https://pandas.pydata.org/docs/reference/io.html).

We can load datasets by calling the appropriate read function.

In [None]:
# For this workshop, we will work with the following data
load_path = "federal-candidates-2021-10-20.csv"

# Load the data 
df = pd.read_csv(load_path)

Writing to these formats is also generally supported.

In [None]:
# Let's save a copy of the data to this file
save_path = "federal-candidates-copy.csv"

# To write a dataframe, call the write method for the chosen format
df.to_csv(save_path)

### Viewing the Data

The first step in analyzing data is to get familiar with it.

This way, we get a feel for how much cleaning will be necessary.

We also give ourselves a chance to catch anything weird.

To get our bearings, let's take a look at the metadata.

In [None]:
df[df.columns[:10]].info()

A good next step is to inspect the data directly and continue to look for
anything weird.

We can look at the top or bottom of the dataset or to randomly sample rows from the data set.

In [None]:
df.head(5) # Peek at the first n rows. See anything weird?

In [None]:
df.tail(5) # Peek at the last n rows.

In [None]:
df.sample(5) # Peek at a random sample of rows.

### Data Cleaning

Let's get familiar with some `pandas` data cleaning tools.

Our first data cleaning step is to clean up the data types.

Sometimes `pandas` fails to assign the right dtype when constructing a variable. 

#### Adjusting Datatypes

String data are typically assigned the 'object' type.

This is `pandas`'s way of signaling it doesn't really know how to interpret the data.

In this dataset, almost all columns with string data represent categorical
variables. 

However, edate is a date, and candidate_name and occupation are a free-form strings.

In [None]:
object_cols = df.select_dtypes(include="object").columns
print(f"Object-type columns: {object_cols}.")

We can cast the affected variables to the correct dtypes as follows.

In [None]:
# Cast edate as a datetime variable
df.edate = pd.to_datetime(df.edate, yearfirst=True) 

# Cast object variables as categorical
for col in df.select_dtypes(include="object").columns:
    if col not in ["candidate_name", "occupation"]: 
        df[col] = df[col].astype("category")

 Let's check what happened to the object-type columns.

In [None]:
df[object_cols].dtypes

Finally, let's take a look at the numerical data.

Do you notice anything weird?

In [None]:
float_cols = df.select_dtypes("float").columns
print(f"Float columns: {float_cols}")

int_cols = df.select_dtypes("int").columns
print(f"Int columns: {int_cols}")

Of the four float columns, it is only natural to represent `percent_votes` as a 
float type variable. 

For `riding_id`, we are better off using a categorical datatype, even though the 
values look like integers. 

Likewise, the `id` column should also be viewed as a
categorical. In both cases, the ordering of the values is not meaningful.

Here is how we convert the dataype of `riding_id` and `id`.

In [None]:
df.riding_id = df.riding_id.astype("category")
df.id = df.id.astype("category")

The `birth_year` and `votes` columns should be using an integer dtype.

But why were three variables get 'mis-coded' as float in the first place? 

Perhaps we can get a clue by comparing to correct int variables

In [None]:
df[["birth_year", "votes", "year", "num_candidates"]].info()

It's because the default integer datatype does not support missing
values! 

Luckily, `pandas` has another integer datatype that supports missing values. We
can just cast to that.

In [None]:
df.birth_year = df.birth_year.astype("Int64")
df.votes = df.votes.astype("Int64")

Don't mix up 'int64' with 'Int64'! In pandas, they are different: 
- `int64` refers to the regular int type that has no missing value support
- `Int64` refers to the to the int type with missing value support

#### Cleaning Categorical Variables

Categorical variables tend to require some extra attention, even in the most
well-curated datasets.

Take a look at the values of the `gender` variable. 

Keep in mind what the data's README file says about this variable:
> gender is a binary factor variable encoding candidate gender.

In [None]:
df.gender.cat.categories # This is how we access category names

We were told `gender` is supposed to be a binary variable, but we got three values?

So what's going on here? 

To get a better idea, let's tabulate the variable.

In [None]:
df.gender.value_counts()

Let's look up these cases by subsetting the data. 

In [None]:
df[df.gender == "2"]

These candidates are from the most recent election and, if you look them up, you'll find that they both have non-binary gender identity. 

We can recode the data to give those two cases more explicit labels. 

In [None]:
df.gender = df.gender.replace({"2": "NB"})
df.gender.cat.categories

Replace is very powerful and works with other variable types as well!

It is particularly handy when you want to collapse multiple categories.

Another adjustment we can make is to rename categories. 

Why? Because it is confusing and difficult to work with
poorly named categories.

Take a look at the category names in the `censuscategory` variable. 

In [None]:
df.censuscategory.cat.categories

`censuscategory` gives candidates' occupation according to the Census Canada taxonomy.

These names are precise, but wordy. Let's abbreviate them. 

In [None]:
df.censuscategory = df.censuscategory.cat.rename_categories([
    "Business", "Health", "Management", "MP", "Science", "Resources", "Culture",
    "Social", "Manufacturing", "Sales", "Trades"
])
df.censuscategory.head(10)

#### Subsetting the Data

Often, we are not interested in the entirety of the data. It then makes sense to subset the data using the techniques we saw above. 

Here, we'll focus on a subset of the available variables in elections after 1990.

In [None]:
df = df[(df.year > 1990) & (df.type_elxn == "General")] # keep general election data from 1990 on
df = df[["parliament", "edate", "year", "province", "riding", "id", "candidate_name", "birth_year", 
         "censuscategory", "party_major_group", "votes", "percent_votes", "elected"]]
df.head(5)

### Handling Missing Data

Let's try to identify where we have missing values.

In [None]:
na_counts = df.isna().sum() # .sum() is an aggregation function!
na_counts[na_counts > 0]

`pandas` gives us several options to deal with missing values.

Our options are basically to do nothing, drop the missing values, or impute them (i.e., fill them in).

To keep things simple, let's focus on a single candidate, Elizabeth May (`id == 22644`).

In [None]:
elizabeth_may = df[df.id == 22644][["edate", "birth_year", "censuscategory"]]
elizabeth_may

Dropping is the simplest solution, but is not ideal because of how much data is lost.

In [None]:
elizabeth_may.dropna() # Drop all rows with missing data

In [None]:
elizabeth_may.dropna(axis=1) # Drop columns with missing data

Another strategy is to fill in the missing data with some reasonable values.

In [None]:
elizabeth_may.birth_year # Original data with missing values

In [None]:
elizabeth_may.birth_year.fillna(1954) # Fill birth year by fixed value

It's best not to do manual imputation. We could use some automation instead.

A simple automation is to carry existing values forwards or backwards through the data. But, we have to make sure the data is sorted properly.

In [None]:
em_sorted = elizabeth_may.sort_values("edate") # Make sure data is sorted first.

In [None]:
em_sorted.birth_year.fillna(method="bfill") # Carry birth year back over the index

In [None]:
em_sorted.censuscategory.fillna(method="pad") # Carry occupation forward in time

You can do more sophisticated imputation as well, using the `interpolate()` method.

A detailed discussion of missing values and basic imputation is available in [Working with Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html).

Finally, we can combine the imputation functions with techniques discussed in the next section to handle grouped data.

### Enriching the Data with New Variables

Once we have sufficiently cleaned our data, we usually want to calculate new variables from it.

We can simply calculate the value and add a new variable as show below.

In [None]:
df["age"] = df.year - df.birth_year
df.age.describe()

#### Grouping

Often, we want to calculate statistics by group.

For this, we can use `groupby()`, which will automatically split the dataset by group. 

We can then [aggregate, filter, or transform the data by group](https://pandas.pydata.org/docs/user_guide/groupby.html).

As a starting example let's try to find the age of the oldest MP in each parliament.

In [None]:
grouped = df.groupby("parliament") # Split into groups
grouped_age = grouped.age # Select the age variable
df["oldest_mp_age"] = grouped_age.transform(max) # Find max age within each group
df.oldest_mp_age.describe() # Describe the result

To give a more complex example, let's try calculating the margin of victory in each contest.

In [None]:
grouped = df.groupby(["edate", "province", "riding"]) # group by contest
grouped_pct_votes = grouped.percent_votes # select grouped percent votes variable

def compute_margin(data): # define margin computation
    if data.size > 1:
        runner_up = data.sort_values(ascending=False).iloc[1]
        return data - runner_up
    else:
        return None

df["margin"] = grouped_pct_votes.transform(compute_margin) # calculate margin
df.margin.describe()

#### Method Chaining and Piping

Did you notice how tedious it was to calculate the pervious variables? 

We had to set a lot of variables. 

We can avoid this using a technique called *method chaining* or *piping*.

Here is how it goes for the margin of victory:

In [None]:
df["margin"] = (df
    .groupby(["edate", "province", "riding"])
    .percent_votes
    .transform(compute_margin)
)

This works because each function we call in the sequence is calls a method of the result returned by the previous function.

It's a technique is good for grouping together operations into logical chunks. But, try not to abuse it.

What happens if we want to use piping with functions that are not methods of the result?

For instance, what happens if we want to pipe some custom functions?

We just use the special method `pipe()`, allows us to pass in arbitrary functions and arguments.

- [DataFrame pipe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html)
- Also available for `Series` and grouped data

## More Data Wrangling: Aggregating, Reshaping, Merging and Concatenating Data

Often, we need to modify the structure of our dataset.

There are a few reasons we might want to do this:
- We might want to look at aggregate statistics
- Our data might need to be formatted a certain way to facilitate or enable analysis
- We might need to incorporate some supplemental data to calculate some variable of interest
- Our data may be spread over many files and/or dataframes

### Aggregating Data

We often need to calculate aggregate statistics when trying to provide descriptives for our data. 

Let's calculate the number of candidates in each election of each party and occupation type.

In [None]:
count = (df
    .groupby(["edate", "party_major_group", "censuscategory"], as_index=False)
    .size())

### Reshaping Data

Suppose now that we want the average number of candidates in  each occupation by party.

For this, one easy approach is to pivot the party and censuscategory variables to be column indices and then take an average.

Refer to [Reshaping and Pivot Tables](https://pandas.pydata.org/docs/user_guide/reshaping.html) for more information.

In [None]:
count_p = count.pivot(index="edate", columns=["party_major_group", "censuscategory"], values="size")
count_p.head(4)

In [None]:
count_p.mean()

We can massage the dataset back to its original shape.

In [None]:
count_p.unstack().reset_index()

### Merging Data

To get a handle on merging, let's say we are interested in analyzing candidate occupation by sector.

In [None]:
sector = pd.DataFrame({
            "sector": ["Tertiary", "Tertiary", "Tertiary", "Tertiary", "Tertiary", "Primary", 
                       "Tertiary", "Tertiary", "Secondary", "Tertiary", "Secondary"],
    "censuscategory": ["Business", "Health", "Management", "MP", "Science", "Resources", "Culture", 
                       "Social", "Manufacturing", "Sales", "Trades"]
})
sector

Using the merge function we can specify how we want this DataFrame to be combined with the main data set.

In [None]:
merged = pd.merge(df, sector, on="censuscategory", how="left")
merged[["year", "candidate_name", "censuscategory", "sector"]]

Look more closely at the code:
```python
pd.merge(df, sector, on="censuscategory", how="left")
```

We usually talk about a left and right hand arguments. These are the first two arguments to merge.

The `on` argument tells us what variable(s) to merge on. 

The `how` argument specifies how the data gets merged.

To read more see the [guide on database-style joins](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging)

### Concatenating Data

Finally, sometimes our data is spread over multiple dataframes, and we just want to combine it.

This is what the concatenation operation does.

See [Concatenating Objects](https://pandas.pydata.org/docs/user_guide/merging.html#concatenating-objects) for further details.

In [None]:
# Separate counts by election; note iteration over groupby
separate_dataframes = [obj for name, obj in count.groupby("edate")]
separate_dataframes[:2]

In [None]:
pd.concat(separate_dataframes) # back together!

## Conlusion

We have only scratched the surface of what you can do with `pandas`.

Always remember that the documentation is your best friend.

Let me leave you with a pointer to a few other libraries;

- `numpy` is the standard Python library for mathematical tasks. It is very powerful and you can use it almost seamlessly in conjunction with `pandas`.
- If you want to run statistical analyses with your `pandas` data, check out Python's `statsmodels` package.
- Finally, if you want to make beautiful graphs, please come back next week, when we will look at `matplotlib`, Python's de facto standard plotting library.
