# Lecture: Pandas I

[Acknowledgments Page](https://ds100.org/sp24/acks/)

A high-level overview of the [`pandas`](https://pandas.pydata.org) library to accompany Lecture 2.

In [1]:
# Import the pandas library
# `pd` is the conventional alias for Pandas, as `np` is for NumPy

## Series, DataFrames, and Indices 

Series, DataFrames, and Indices are fundamental `pandas` data structures for storing tabular data and processing the data using vectorized operations.

### Series

A `Series` is a 1-D labeled array of data. We can think of it as columnar data. 

#### Creating a new `Series` object
Below, we create a `Series` object and will look into its two components: 1) values and 2) index.

In [2]:
# Create a Series

In [3]:
# Look at the values

In [4]:
# Look at the index

In the example above, `pandas` automatically generated an `Index` of integer labels. We can also create a `Series` object by providing a custom `Index`.

In [5]:
# Create a Series with a custom Index

In [6]:
# Look at the values

In [7]:
# Look at the index

After it has been created, we can reassign the Index of a `Series` to a new Index.

In [8]:
# Reassign the Index to a new Index

#### Selection in Series
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering condition

In [9]:
# Create a Series with a custom Index

**Selection using one or more label(s)**

In [10]:
# Selection using a single label
# Notice how the return value is a single array element

In [11]:
# Selection using a list of labels
# Notice how the return value is another Series

**Selection using a filter condition**

In [12]:
# Filter condition: select all elements greater than 0

In [13]:
# Use the Boolean filter to select data from the original Series

### DataFrame

A `DataFrame` is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file. 

### Creating a new `DataFrame` object
We can also create a `DataFrame` in a variety of ways. Here, we cover the following:
1. From a CSV file
2. Using a list and column names
3. From a dictionary
4. From a `Series`


#### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object. 

In [14]:
# Load data from data/elections.csv into a DataFrame

By passing a column to the `index_col` attribute, the `Index` can be defined at the initialization.

In [15]:
# Load data from data/elections.csv into a DataFrame and define the Index

#### Creating a `DataFrame` using a list and column names

In [16]:
# Creating a single-column DataFrame using a list

In [17]:
# Creating a multi-column DataFrame using a list of lists

#### Creating a `DataFrame` from a dictionary

In [18]:
# Creating a DataFrame from a dictionary of columns

In [19]:
# Creating a DataFrame from a list of row dictionaries

#### Creating a `DataFrame` from a `Series`

In [20]:
# In the examples below, we create a DataFrame from a Series

In [21]:
# Passing Series objects for columns

In [22]:
# Passing a Series to the DataFrame constructor to make a one-column dataframe

In [23]:
# Using to_frame() to convert a Series to DataFrame

In [24]:
# Creating a DataFrame from a CSV file and specifying the Index column

In [25]:
# Need to reset the index to keep 'Candidate' as one of the DataFrane Columns
# Set the index to the "Candidate" column

### `DataFrame` attributes: `index`, `columns`, and `shape`

In [26]:
# Look at the index

In [27]:
# Look at the columns

The `Index` column can be set to the default list of integers by calling `reset_index()` on a `DataFrame`.

In [28]:
# Revert the index back to its default numeric labeling

In [29]:
# Look at the shape

### Slicing in `DataFrame`s

We can use `.head` to return only a few rows of a dataframe.

In [30]:
# Loading DataFrame again to keep the original ordering of columns

In [31]:
# Show the first 3 rows

We can also use `.tail` to get the last so many rows.

In [32]:
# Show the last 5 rows

#### Label-Based Extraction Using`loc`

Arguments to `.loc` can be:
1. A list.
2. A slice (syntax is inclusive of the right-hand side of the slice).
3. A single value.


`loc` selects items by row and column *label*.

In [33]:
# Selection by a list

In [34]:
# Selection by a list and a slice of columns

In [35]:
# Extracting all rows using a colon

In [36]:
# Extracting all columns using a colon

In [37]:
# Selection by a list and a single-column label

In [38]:
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame

In [39]:
# Selection by a row label and a column label

#### Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A list.
2. A slice (syntax is exclusive of the right hand side of the slice).
3. A single value.


In [40]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!

In [41]:
# Index-based extraction using a list of rows and a slice of column indices

In [42]:
# Selecting all rows using a colon

In [43]:
# Index-based extraction and a single column

In [44]:
# Extracting the value at row 0 and the second column

#### Context-dependent Extraction using `[]`

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves essentially the same functionality. The difference is that `[]` is *context-dependent*.

`[]` only takes one argument, which may be:
1. A slice of row integers.
2. A list of column labels.
3. A single column label.


If we provide a slice of row numbers, we get the numbered rows.

In [45]:
# Provide a slice of row numbers

If we provide a list of column names, we get the listed columns.

In [46]:
# Provide a list of column names

And if we provide a single column name we get back just that column, stored as a `Series`.

In [47]:
# Provide a single column name