# Data Science with Python (Pandas)

Can Şerif Mekik

December 6, 2021

<table align="left">
<tr>
<td><img src=CDSI_Fac.of.Sc_logo.png alt="CDSI Logo" width="300"/></td>
<td><img src=mcgill_ccr_approval_croppedforblock_0.png alt="CCR Approved Logo" width="300"/></td>
</tr>
</table>

## Introductory Remarks

This workshop assumes minimal working knowledge of Python. 

Pandas is a great place to start using Python for data science
- Similar feel to other stats software like R or Stata
- Works well on its own but also integrates well with the Python data science ecosystem

`Pandas` is excellent for [Data Wrangling](https://en.wikipedia.org/wiki/Data_wrangling), our main topic. 

This is the process making raw data ready for statistical analysis and/or modeling.

## Useful Resources

The [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is an excellent two-page summary of essential `pandas` features.

The [Official Pandas Docs](https://pandas.pydata.org/docs/) are the single best resource for information on `pandas` short of the source code.

If you are looking to learn about a specific function or feature, go into the API section.

The docs also offer tutorials and other helpful material.

## Contents

1. Setup
2. Pandas Datastructures
3. Basic Data Wrangling
4. More Data Wrangling
5. Data Exploration
6. Advanced Topics (if time allows)

## Setup

We will walk through the first steps of analyzing a sample dataset.

Our data set is Semra Sevi's Canadian Federal Elections dataset.

You can find a copy of the dataset and this jupyter notebook at the following address.

```https://link.to.materials```

### Installing and Importing pandas

`pandas` is fairly easy to install as it comes pre-packaged
in Anaconda.

Base `pandas` has only a few dependencies. 

You may have to install optional dependencies depending on your work and your setup.

For a full list of optional dependencies see the
[Installation Documentation](https://pandas.pydata.org/docs/getting_started/install.html)


In [None]:
# import the pandas library!

import pandas as pd

## Essential `pandas` Datastructures

In [None]:
# Series.cat
# df.index # Returns the row index
# df.columns # Returns the column index, handy for getting variable names.
# dtypes
# loc
# iloc
# etc.

## Basic Data Wrangling

You can divide data wrangling into three broad activities.
1. Inspection - Get familiar with the data and check for potential problems
2. Preparation - Create or clean variables, deal with missing values
3. Exploration - Compute descriptives statistics and identify basic patterns in the data

The ordering above is typical, but not absolute. 

The different activities often tend to blend together. 

### Reading and Writing Data

`pandas` can import a wide variety of common data formats. 

Some supported file types include the following.
- CSV
- Excel
- SPSS
- STATA

For a full list, see the [I/O documentation](https://pandas.pydata.org/docs/reference/io.html).

In [None]:
# For this workshop, we will work with the following data
load_path = "federal-candidates-2021-10-20.csv"

# Load the data 
df = pd.read_csv(load_path)


Writing to these formats is also generally supported.

In [None]:
# Let's save a copy of the data to this file
save_path = "federal-candidates-copy.csv"

# To write a dataframe, call the write method for the chosen format
df.to_csv(save_path)


### Viewing the Data

The first step in analyzing data is to get familiar with it.

This way, we get a feel for how much cleaning will be necessary.

We also give ourselves a chance to catch anything weird.

In [None]:
# Get general metadata
# E.g. Info on indices, datatypes, memory etc.
df.info()


A good next step is to inspect the data directly and continue to look for
anything weird.

We can look at the top or bottom of the dataset or to randomly sample rows from the data set.

In [None]:
df.head(10) # Peek at the first n rows. See anything weird?

In [None]:
df.tail(10) # Peek at the last n rows.

In [None]:
df.sample(10) # Peek at a random sample of rows.

### Data Cleaning

Let's get familiar with some `pandas` data cleaning tools.

Our first data cleaning step is to clean up the data types.

Sometimes `pandas` fails to assign the right dtype when constructing a variable. 

#### Adjusting Datatypes

String data are typically assigned the 'object' type.

This is `pandas`'s way of signaling it doesn't really know how to interpret the data.

In this dataset, almost all columns with string data represent categorical
variables. 

However, edate is a date, and candidate_name and occupation are a free-form strings.

In [28]:
object_cols = df.select_dtypes(include="object").columns
print(f"Object-type columns: {object_cols}.")

Object-type columns: Index(['candidate_name', 'occupation'], dtype='object').


We can cast the affected variables to the correct dtypes as follows.

In [None]:
df.edate = pd.to_datetime(df.edate, yearfirst=True) 

for col in df.select_dtypes(include="object").columns:
    if col not in ["candidate_name", "occupation"]: 
        df[col] = df[col].astype("category")

 Let's check what happened to the object-type columns.

In [19]:
df[object_cols].dtypes

candidate_name    object
occupation        object
dtype: object

Finally, let's take a look at the numerical data.

Do you notice anything weird?

In [14]:
float_cols = df.select_dtypes("float").columns
print(f"Float columns: {float_cols}")

int_cols = df.select_dtypes("int").columns
print(f"Int columns: {int_cols}")

Float columns: Index(['birth_year', 'riding_id', 'votes', 'percent_votes'], dtype='object')
Int columns: Index(['id', 'parliament', 'year', 'num_candidates'], dtype='object')


Of the four float columns, it is only natural to represent `percent_votes` as a 
float type variable. 

For `riding_id`, we are better off using a categorical datatype, even though the 
values look like integers. 

Likewise, the `id` column should also be viewed as a
categorical. In both cases, the ordering of the values is not meaningful.

Here is how we convert the dataype of `riding_id` and `id`.

In [None]:
df.riding_id = df.riding_id.astype("category")
df.id = df.id.astype("category")

The `birth_year` and `votes` columns should be using an integer dtype.

But why were three variables get 'mis-coded' as float in the first place? 

Perhaps we can get a clue by comparing to correct int variables

In [18]:
df[["birth_year", "votes", "year", "num_candidates"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46526 entries, 0 to 46525
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   birth_year      12276 non-null  Int64
 1   votes           45835 non-null  Int64
 2   year            46526 non-null  int64
 3   num_candidates  46526 non-null  int64
dtypes: Int64(2), int64(2)
memory usage: 1.5 MB


It's because the default integer datatype does not support missing
values! 

Luckily, `pandas` has another integer datatype that supports missing values. We
can just cast to that.

In [17]:
df.birth_year = df.birth_year.astype("Int64")
df.votes = df.votes.astype("Int64")

Don't mix up 'int64' with 'Int64'! In pandas, they are different: 
- `int64` refers to the regular int type that has no missing value support
- `Int64` refers to the to the int type with missing value support

#### Cleaning Categorical Variables

Categorical variables tend to require some extra attention, even in the most
well-curated datasets.

Take a look at the values of the `gender` variable. 

Keep in mind what the data's README file says about this variable:
> gender is a binary factor variable encoding candidate gender.

In [24]:
df.gender.cat.categories # This is how we access category names

Index(['2', 'F', 'M'], dtype='object')

We were told `gender` is supposed to be a binary variable, but we got three values?

So what's going on here? 

To get a better idea, let's tabulate the variable.

In [25]:
df.gender.value_counts()

M    39938
F     6585
2        2
Name: gender, dtype: int64

Phew! It just looks like a couple of entries were somehow miscoded. 

We can easily fix something like this by recoding the data.

In this case, we'll just send those to cases to a missing value. 

In [None]:
df.gender = df.gender.replace({"2": None})
df.gender.cat.categories

Replace can with other variable types as well!

Another adjustment we can make is to rename categories. 

Why? Because it is confusing and difficult to work with
poorly named categories.

Take a look at the category names in the `censuscategory` variable. 

In [26]:
df.censuscategory.cat.categories

Index(['Business, finance and administration occupations',
       'Health occupations', 'Management occupations', 'Members of Parliament',
       'Natural and applied sciences and related occupations',
       'Natural resources, agriculture and related production occupations',
       'Occupations in art, culture, recreation and sport',
       'Occupations in education, law and social, community and government services',
       'Occupations in manufacturing and utilities',
       'Sales and service occupations',
       'Trades, transport and equipment operators and related occupations'],
      dtype='object')

`censuscategory` gives candidates' occupation according to the Census Canada taxonomy.

These names are precise, but wordy. Let's can abbreviate them. 

In [None]:
# We could pass in a dictionary to be extra explicit, but that would involve a
# lot of copying. Since categories are ordered, we can just pass in a list with
# abbreviations enumerated in the correct order.
df.censuscategory = df.censuscategory.cat.rename_categories([
    "Business", "Health", "Management", "MP", "Science", "Resources", "Culture",
    "Social", "Manufacturing", "Sales", "Trades"
])
df.censuscategory.head(10)

### Handling Missing Data



### Subsetting Data

Often, we are not interested in the entirety of the data. 

Perhaps we care about only certain variables, or we care about only certain
rows, and we may want to restrict our data to only those cases.

There are several ways to do this, depending on the circumstance.

Read and write functions have many optional parameters.

You can use these parameters to control formatting, metadata, and more. 

These parameters can save you a lot of time, so be sure to check the documentation.