# Lecture 4: Introduction to `pandas` and Data Wrangling

- Introduction to working with (tabular) data using `pandas`

- Learn about the two key data structures in pandas:
   +  `Series` (1-dimensional) 
   + `DataFrame` (2-dimensional).

- Import a csv file into a `pandas` `DataFrame`

- Selecting rows of a DataFrame

- Selecting columns of a DataFrame

- Computing summary statistics on a DataFrame


## Reading CSV files

Last week we looked at reading csv files using a loop and the .split() method.  This gives us a dictionary where each district is a key and the value is the population.  This method wasn't too hard, but now suppose we had a table with many columns. We could create a dictionary of lists, or even a dictionary of dictionaries depending on how we want to access the data, but it doesn't take a very large data set before it gets challenging to manage in basic Python.

This is where pandas comes in. 

In [None]:
# A reminder of how we read a csv file with two columsn into a dictionary.

district_data = open("ED-Canada_2016.csv", encoding="utf-8").readlines()

district_populations_dict = {}
for line in district_data:
    entries = line.split(",")
    
    district_name = entries[0].strip()
    population_entry = entries[1].strip()
    population_int = int(population_entry)
    
    district_populations_dict[district_name] = population_int

print(district_populations_dict)

## From lists to data frames

Python lists are very good at storing *one-dimensional data* (one "column" of data).

But in practice, we don't just want to store or compute on one columnâ€”we want to compute on an entire table, containing multiple columns!

While it is possible to do this in Python lists, it's more cumbersome to do so. It's also possible to do this with Python dictionaries, but we'd have to do all the work.

Fortunately, there is a library called pandas that was implemented specifically to support the kind of data processing we want to do. So, instead of writing all the code ourselves, we will rely on the Pandas library and the new data types that it implements. 

The main new data type that we get from pandas is a `DataFrame`, which is used to store (two-dimensional) tabular data.

### Importing `pandas`

To use a library in Python, we need to **import** it into our code.

You are going to start to see a lot of conventions. For example, you will almost always see pandas imported as below. There is nothing magical about `pd`, but when we all use it as a short-hand for `pandas` it is easy to remember what it means.

In [None]:
import pandas as pd

This line of code loads the `pandas` library, calling it `pd`.

Now we can use the `pd.read_csv` function to read in this data!  (I told you this was easier.)

In [None]:
district_df = pd.read_csv("ED-Canada_2016.csv", header=None)
district_df

For the record, we could import pandas as you see below, but the above convention is so common, I would strongly recommend that you follow it.

In [None]:
import pandas
district_df = pandas.read_csv("ED-Canada_2016.csv", header=None)
district_df

## Creating your own data frame

Some of the examples we will look at will involve creating small data frames "by hand", since they're easier to conceptualize than large data frames.

Creating a data frame manually consists of two steps.

### Step 1: Create a dictionary of your data

A **dictionary** is another type of Python collection that lets you created associated pairs of data.
For us, a dictionary can map *column names* (strings) to *column values* (lists) in a table.

For example:

In [None]:
snow_data = {
    "Location": ["Downtown Toronto", "Toronto Pearson", "Ottawa"],
    "Snow Jan 15": [21, 22, 18],
    "Snow Jan 25": [56, 42, 2]
}

snow_data

### Step 2: Turn the dictionary into a data frame

Now, we can turn that dictionary into a pandas `DataFrame`.  Note that the keys in the dictionary are  the column labels and the values in the lists become the column values, so we are effectively converting parallel lists into a table.

In [None]:
snow_data_frame = pd.DataFrame(snow_data)

snow_data_frame

Of course, if we wanted to we could combine the two steps into one larger statement. That's more convenient, but a bit less explicit about what's going on. Make sure you understand the previous two steps separately, and then study how they're combined below!

In [None]:
snow_data_frame2 = pd.DataFrame({'Location': ['Downtown Toronto', 'Toronto Pearson', 'Ottawa'],
 'Snow Jan 15': [21, 22, 18],
 'Snow Jan 25': [56, 42, 2]})

snow_data_frame2

### Summarizing a dataframe

It is often convenient to get a quick look at a data frame to get some summary statistics using the .describe() method. 

Let's try to interpret wht we see.  Which summary statistics are useful in this case?

In [None]:
snow_data_frame.describe()

## Create a pandas Series from a List

The `Series` is the other major data type that we get from pandas.  It is a one dimenional data structure, and is often used to hold a column of a `DataFrame`.  It can also be used like a list.

In [None]:
location = ['Downtown Toronto', 'Toronto Pearson', 'Ottawa']

location_series = pd.Series(location)

location_series

In [None]:
snow1 = [56, 42, 2]

snow1_series = pd.Series(snow1)

snow1_series

In [None]:
# a pd.Series can do what a list can do!
print(len(snow1_series))

print(snow1_series[0])

## Creating a Boolean Series based on a Condition

Create a Series where the element is `True` if `snow1_series` is greater than 40 and `False` otherwise.

In [None]:
snow1 = [56, 42, 2]

snow1_series = pd.Series(snow1)
print(snow1_series)
snow1_series > 40

Create a Series where the element is `True` if `snow1_series` is greater than 20 **AND** less than 50 and `False` otherwise.

In [None]:
print(snow1_series)

(snow1_series > 20) & (snow1_series < 50)


## Boolean logic with pandas `Series`

An extremely common operation on a data frame to extract rows with specific characterisitcs. We are going to see an example later where we want to use census data to count the number of people in Toronto who own their own home.  To do this we need to be able to select all the rows that for people living in Toronto. We can accomplish this with Boolean series in Pandas.

When comparing Boolean Series in pandas we use different logical operators.  Note that we are using `&` and `|` instead of `and` and `or`.

`Series1 = pd.Series([True, False, True])`

`Series2 = pd.Series([False, False, True])`


Operation | Description                              | Result of operation in a list
----------|------------------------------------------|------------
`Series1 & Series2` | `Series1` and `Series 2`       | `[False, False, True]`
`Series1 \| Series2` | `Series1` or `Series 2`       | `[True, False, True]`
`Series1 != Series2` | `Series1` not equal to `Series 2` | `[True, False, False]`



## Create a pandas `DataFrame` using a dictionary

- A dictionary store data in key-value pairs.

- A popular way to create a dictionary is to use curly braces {} and colons `:` to separate keys and values (`key:values`)

```
candy_dict = {"candy": ["red licorice", "caramel apple salt", "cherry sours"]}

```

- the *key* of `candy_dict` is "candy" 
- the  *values* of `candy` are: `"red licorice", "caramel apple salt", "cherry sours"`

In [None]:
candy_dict = {"candy": ["red licorice", "caramel chocolate", "cherry sours"]}
print(candy_dict)

We can create a `dict` of GGR274 course faculty.

In [None]:
data = {"academic department" : ["STA", "CSC", "GGR"], 
        "faculty": ["Michael Moon", "Karen Reid", "Alex Ramiller"],
        "favourite candy": ["red licorice", "caramel chocolate", "cherry sours"],
        "name length": [len("Michael Moon"), len("Karen Reid"), len("Alex Ramiller")]}

data

Let's store `data` in a pandas `DataFrame`.

In [None]:
pd.DataFrame(data)

Now, let's store the pandas `DataFrame` above in a variable called `GGR274fac_df`.

In [None]:
GGR274fac_df = pd.DataFrame(data)

GGR274fac_df

## Select rows of a `DataFrame` using a list of  `True` & `False` values (a.ka. Boolean values)

Let's remove the second row. 

In [None]:
print(GGR274fac_df)

GGR274fac_df[[True, False, True]]

- What happened?

- How can I remove the first row?

In [None]:
GGR274fac_df[[False, True, True]]

## Select columns of a `DataFrame` using a list of Column Names

- The column names in the `DataFrame` `GGR274fac_df` can be obtained using `list()`.

- There are other ways to get the column names, but we will focus on this for now.


In [None]:
list(GGR274fac_df)

- To select the column `favourite candy` we can add it in quotation marks inside the square brackets `[]` at the end of the `DataFrame` name.

- For example:

In [None]:
GGR274fac_df["favourite candy"]


In [None]:
# What type do you think this will be?

type(GGR274fac_df["favourite candy"])

However, if you want to select more than one column you need to put your selection in a list.

In [None]:
my_list_of_column_names = ["favourite candy", "name length"]

GGR274fac_df[my_list_of_column_names]

In [None]:
# What type will this be?
type(GGR274fac_df[my_list_of_column_names])

To be clear, when you select one column you get a Series.  When you select a list of columsn you get a DataFrame.

`GGR274fac_df[my_list_of_column_names]` or

`GGR274fac_df[["favourite candy", "name length"]]`

is NOT the same as 

`GGR274fac_df["favourite candy", "name length"]`

In [None]:
GGR274fac_df["favourite candy", "name length"] # throws an error
# you need to pass a list not multiple strings

In [None]:
# a single-element list is still a list and returns DataFrame! (This is very handy!)
GGR274fac_df[["favourite candy"]] 

In [None]:
GGR274fac_df_column_names = list(GGR274fac_df)

print(f"The list of column names is: {GGR274fac_df_column_names}")

In [None]:
GGR274fac_df_column_names[0]

In [None]:
GGR274fac_df[GGR274fac_df_column_names[0]]

You can select the column number using: 

- `GGR274fac_df[GGR274fac_df_column_names[0]]`

- `GGR274fac_df[GGR274fac_df_column_names[3]]`

## Select rows of a `DataFrame`

- Rows can be selected from a `DataFrame` using a list of Boolean values.

`GGR274fac_df[[True, False True]]`

   + selects the first and third rows of the `DataFrame` since the first and third values are `True`.  The second row is not selected since the second element of the list is `False`.

In [None]:
my_important_condition = [True, False, True]

my_important_condition

In [None]:
GGR274fac_cond = GGR274fac_df[my_important_condition] # select rows

my_important_columns = ["academic department", "faculty"]

GGR274fac_cond[my_important_columns] # select columns

## Select rows and columns of a `DataFrame`

We can get extra fancy and combine these two lines of code:

```
GGR274fac_cond = GGR274fac_df[my_important_condition] # select rows

GGR274fac_cond[my_important_columns] # select columns
```

to select rows and columns.

NOTE: You don't have to combine lines like this. While you are learning, it may be better to separate each of these operations in to separate lines so you can look at each one.

In [None]:
GGR274fac_df[my_important_condition][my_important_columns]

## Exercise

Create a pandas `DataFrame` with three columns: 

1. Your first name and two people sitting close to you -- your (new) friends.

2. The distance from home to the U of T St. George campus for you and your two (new) friends.

3. Your favourite ??? Pick some category that you want to choose a favourite from. 

In [None]:
# create your DataFrame here.
my_peers = pd.DataFrame({
    "name": ["Michael Moon", "Karen Reid", "Alex Ramiller"],
    "distance to home": [11, 5, 3],
    "favorite ice cream": ["chocolate", "cherry", "pistachio"]
})
my_peers