# Pandas II: Selection, Filtering and Dropping

Previously, we studied the three basic data structures in Pandas: `Series`, `DataFrame`, and `Index`. We also learned how to create these data structures from scratch. In this section, we will learn how to extract data from these structures. The two primary operations of data extraction are: 

1. **Selection**: Extracting subset of columns. 
2. **Filtering**: Extracting subset of rows.

Finally, we will learn how to drop rows and columns in `pandas` as well as go over some useful methods.

### Extracting values from a Series

We can select a single value or a set of values from a Series. There are 3 primary methods of selecting data.

1. A single index label
2. A list of index labels
3. A filtering condition

To demonstrate this, let’s define the Series ser

In [1]:
import pandas as pd 
ser = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
ser

a    4
b   -2
c    0
d    6
dtype: int64

#### A Single Index Label

In [2]:
ser["a"]

4

#### A List of Index Labels

In [3]:
ser[["a", "c"]]

a    4
c    0
dtype: int64

#### A Filtering Condition

Perhaps the most interesting (and useful) method of selecting data from a Series is with a filtering condition.

First, we apply a boolean condition to the Series. This create **a new Series of boolean values**.

In [4]:
ser > 0

a     True
b    False
c    False
d     True
dtype: bool

``` {image} file:///Users/fsultan/Downloads/datascience_ml/_build/html/_images/filter.png
:width: 50%
:align: center
```

We then use this boolean condition to index into our original Series. pandas will select only the entries in the original Series that satisfy the condition.

In [5]:
ser[ser > 0]

a    4
d    6
dtype: int64

## Extracing data from DataFrames

Now that we’ve learned how to create DataFrames, let’s dive more deeply into their capabilities.

The API (application programming interface) for the DataFrame class is enormous. In this section, we’ll discuss several methods of the DataFrame API that allow us to extract subsets of data.

The simplest way to manipulate a DataFrame is to extract a subset of rows and columns, known as **slicing**. We will do so with four primary methods of the DataFrame class:

`.head` and `.tail`

`.loc`

`.iloc`

`[]`


### Filtering rows with `.head` and `.tail`

The simplest scenario in which we want to extract data is when we simply want to select the first or last few rows of the DataFrame.

To extract the first n rows of a DataFrame `df`, we use the syntax `df.head(n)`.

In [6]:
import pandas as pd

url = "https://raw.githubusercontent.com/fahadsultan/datascience_ml/main/data/elections.csv"
elections = pd.read_csv(url)
# Extract the first 5 rows of the DataFrame
elections.head(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


Similarly, calling `df.tail(n)` allows us to extract the last n rows of the DataFrame.

In [7]:
# Extract the last 5 rows of the DataFrame
elections.tail(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979
181,2020,Howard Hawkins,Green,405035,loss,0.255731


### Filtering rows with `.iloc`

```{warning}
If you find yourself needing to use `.iloc` then stop and think if you are about to implement a loop. If so, there is probably a better way to do it.
```

Slicing with `.iloc` works similarily to `.loc`, however, `.iloc` uses the index positions of rows and columns rather the labels (think to yourself: **`l`**`oc` uses **l**abels; **`i`**`loc` uses **i**ndices). The arguments to the `.iloc` function also behave similarly -– single values, lists, indices, and any combination of these are permitted.

Let’s begin reproducing our results from above. We’ll begin by selecting for the first presidential candidate in our `elections` DataFrame:

In [8]:
# elections.loc[0, "Candidate"] - Previous approach
elections.iloc[0, 1]

'Andrew Jackson'

Notice how the first argument to both `.loc` and `.iloc` are the same. This is because the row with a label of 0 is conveniently in the 0th index (equivalently, the first position) of the `elections` DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.

However, when we select the first four rows and columns using `.iloc`, we notice something.

In [9]:
# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach
elections.iloc[0:4, 0:4]

Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897


Slicing is no longer inclusive in `.iloc` -– it’s exclusive. In other words, the right-end of a slice is not included when using `.iloc`. This is one of the subtleties of `pandas` syntax; you will get used to it with practice.



In [10]:
#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach
elections.iloc[[0, 1, 2, 3], [0, 1, 2, 3]]

Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897


This discussion begs the question: when should we use `.loc` vs `.iloc`? In most cases, `.loc` is generally safer to use. You can imagine `.iloc` may return incorrect values when applied to a dataset where the ordering of data can change.

### Extracting with `.loc`

The `.loc` operator selects rows and columns in a DataFrame by their row and column label(s), respectively. The **row labels** (commonly referred to as the **indices**) are the bold text on the far _left_ of a DataFrame, while the **column labels** are the column names found at the _top_ of a DataFrame.

To grab data with `.loc`, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the .loc function; the column labels are the second. For example, we can select the the row labeled `0` and the column labeled `Candidate` from the `elections` DataFrame.

In [11]:
elections.loc[0, 'Candidate']

'Andrew Jackson'

To select _multiple_ rows and columns, we can use Python slice notation. Here, we select the rows from labels `0` to `3` and the columns from labels `"Year"` to `"Popular vote"`.



In [12]:
elections.loc[0:3, 'Year':'Popular vote']


Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897


Suppose that instead, we wanted _every_ column value for the first four rows in the `elections` DataFrame. The shorthand `:` is useful for this.

In [13]:
elections.loc[0:3, :]


Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073


There are a couple of things we should note. Firstly, unlike conventional Python, Pandas allows us to slice string values (in our example, the column labels). Secondly, slicing with `.loc` is _inclusive_. Notice how our resulting DataFrame includes every row and column between and including the slice labels we specified.

Equivalently, we can use a list to obtain multiple rows and columns in our `elections` DataFrame.

In [14]:
elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']]


Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897


Lastly, we can interchange list and slicing notation.

In [15]:
elections.loc[[0, 1, 2, 3], :]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073


### Extracting with `[]`

The `[]` selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:

1. A slice of row numbers
2. A list of column labels
3. A single column label

That is, `[]` is _context dependent_. Let’s see some examples.

#### A slice of row numbers

Say we wanted the first four rows of our `elections` DataFrame.

In [16]:
elections[0:4]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073


#### A list of column labels

Suppose we now want the first four columns.

In [17]:
elections[["Year", "Candidate", "Party", "Popular vote"]]

Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897
4,1832,Andrew Jackson,Democratic,702735
...,...,...,...,...
177,2016,Jill Stein,Green,1457226
178,2020,Joseph Biden,Democratic,81268924
179,2020,Donald Trump,Republican,74216154
180,2020,Jo Jorgensen,Libertarian,1865724


#### A single column label

Lastly, `[ ]` allows us to extract only the `Candidate` column.

In [18]:
elections["Candidate"]

0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

The output is a Series! In this course, we’ll become very comfortable with `[]`, especially for selecting columns. In practice, `[]` is much more common than `.loc`.



## Conditional Extraction

Conditional selection allows us to select a subset of rows in a `DataFrame` if they follow some specified condition.

To understand how to use conditional selection, we must look at another possible input of the `.loc` and `[]` methods – a boolean array, which is simply an array or `Series` where each element is either `True` or `False`. This boolean array must have a length equal to the number of rows in the `DataFrame`. It will return all rows that correspond to a value of `True` in the array. We used a very similar technique when performing conditional extraction from a `Series` in the last lecture.

To see this in action, let’s select all even-indexed rows in the first 10 rows of our `DataFrame`.

In [19]:
# Ask yourself: why is :9 is the correct slice to select the first 10 rows?
elections_first_10_rows = elections.loc[:9, :]

# Notice how we have exactly 10 elements in our boolean array argument
elections_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
8,1836,Martin Van Buren,Democratic,763291,win,52.272472


In [20]:
elections_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
8,1836,Martin Van Buren,Democratic,763291,win,52.272472


These techniques worked well in this example, but you can imagine how tedious it might be to list out `True`s and `False`s for every row in a larger `DataFrame`. To make things easier, we can instead provide a logical condition as an input to `.loc` or `[]` that returns a boolean array with the necessary length.

For example, to return all names associated with `F` sex:

In [21]:
# First, use a logical condition to generate a boolean array
logical_operator = (elections["Result"] == "win")

# Then, use this boolean array to filter the DataFrame
elections[logical_operator].head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
11,1840,William Henry Harrison,Whig,1275583,win,53.051213


Recall from the previous lecture that `.head()` will return only the first few rows in the `DataFrame`. In reality, `elections[logical operator]` contains as many rows as there are entries in the original `elections` `DataFrame` with `Result == "win"`.

Here, `logical_operator` evaluates to a `Series` of boolean values with length 400762

In [22]:
len(logical_operator)

182

Rows starting at row 0 and ending at row 235790 evaluate to `True` and are thus returned in the `DataFrame`. Rows from 235791 onwards evaluate to `False` and are omitted from the output.

In [23]:
print("The 0th item in this 'logical_operator' is: {}".format(logical_operator.iloc[0]))
print("The 100-th item in this 'logical_operator' is: {}".format(logical_operator.iloc[100]))
print("The 181-th item in this 'logical_operator' is: {}".format(logical_operator.iloc[181]))

The 0th item in this 'logical_operator' is: False
The 100-th item in this 'logical_operator' is: True
The 181-th item in this 'logical_operator' is: False


Passing a `Series` as an argument to `elections[]` has the same affect as using a boolean array. In fact, the `[]` selection operator can take a boolean `Series`, array, and list as arguments. These three are used interchangeably thoughout the course.

We can also use `.loc` to achieve similar results.

In [24]:
elections.loc[elections["Result"] == "win"].head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
11,1840,William Henry Harrison,Whig,1275583,win,53.051213


Boolean conditions can be combined using various bitwise operators that allow us to filter results by multiple conditions.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
`~`    | ~p       | Returns negation of p
`|` | p &#124; q | p OR q
`&`    | p & q    | p AND q
`^`  | p ^ q | p XOR q (exclusive or)

When combining multiple conditions with logical operators, we surround each individual condition with a set of parenthesis `()`. This imposes an order of operations on `pandas` evaluating your logic, and can avoid code erroring.

For example, if we want to return data on all names with sex `"F"` born before the 21st century, we can write:

In [25]:
elections[(elections["Result"] == "win") & (elections["Year"] > 2000)].head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
157,2004,George W. Bush,Republican,62040610,win,50.771824
162,2008,Barack Obama,Democratic,69498516,win,53.02351
168,2012,Barack Obama,Democratic,65915795,win,51.258484
173,2016,Donald Trump,Republican,62984828,win,46.407862
178,2020,Joseph Biden,Democratic,81268924,win,51.311515


Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. In the example below, our boolean condition is long enough to extend for several lines of code.

In [26]:
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


In [27]:
# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability
(
    elections[(elections["Party"] == "Green") | 
              (elections["Party"] == "Libertarian")]
)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
125,1976,Roger MacBride,Libertarian,172557,loss,0.212451
128,1980,Ed Clark,Libertarian,921128,loss,1.067883
132,1984,David Bergland,Libertarian,228111,loss,0.247245
138,1988,Ron Paul,Libertarian,431750,loss,0.47266
139,1992,Andre Marrou,Libertarian,290087,loss,0.278516
146,1996,Harry Browne,Libertarian,485759,loss,0.505198
149,1996,Ralph Nader,Green,685297,loss,0.712721
153,2000,Harry Browne,Libertarian,384431,loss,0.365525
155,2000,Ralph Nader,Green,2882955,loss,2.741176
156,2004,David Cobb,Green,119859,loss,0.098088


Fortunately, `pandas` provides many alternative methods for constructing boolean filters.
 
The `.isin` function is one such example. This method evaluates if the values in a `Series` are contained in a different sequence (list, array, or `Series`) of values. In the cell below, we achieve equivalent result to the `DataFrame` above with far more concise code.


In [28]:
names = ["Democratic", "Republican"]
elections[elections["Party"].isin(names)]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
10,1840,Martin Van Buren,Democratic,1128854,loss,46.948787
13,1844,James Polk,Democratic,1339570,win,50.749477
...,...,...,...,...,...,...
171,2012,Mitt Romney,Republican,60933504,loss,47.384076
173,2016,Donald Trump,Republican,62984828,win,46.407862
176,2016,Hillary Clinton,Democratic,65853514,loss,48.521539
178,2020,Joseph Biden,Democratic,81268924,win,51.311515


The function `str.startswith` can be used to define a filter based on string values in a `Series` object. It checks to see if string values in a `Series` start with a particular character.

In [29]:
# Find the names that begin with the letter "N"
elections[elections["Candidate"].str.startswith("M")]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
10,1840,Martin Van Buren,Democratic,1128854,loss,46.948787
15,1848,Martin Van Buren,Free Soil,291501,loss,10.138474
22,1856,Millard Fillmore,American,873053,loss,21.554001
137,1988,Michael Dukakis,Democratic,41809074,loss,45.770691
159,2004,Michael Badnarik,Libertarian,397265,loss,0.325108
160,2004,Michael Peroutka,Constitution,143630,loss,0.117542
171,2012,Mitt Romney,Republican,60933504,loss,47.384076


## Adding, Removing, and Modifying Columns

In many data science tasks, we may need to change the columns contained in our `DataFrame` in some way. Fortunately, the syntax to do so is fairly straightforward.

To add a new column to a `DataFrame`, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing `df["column"]`, then assign this to a `Series` or array containing the values that will populate this column.


In [30]:
# Create a Series of the length of each name. We'll discuss `str` methods next week.
candidate_name_lengths = elections["Candidate"].str.len()

# Add a column named "name_lengths" that includes the length of each name
elections["name_lengths"] = candidate_name_lengths
elections.head(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%,name_lengths
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122,14
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878,17
2,1828,Andrew Jackson,Democratic,642806,win,56.203927,14
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073,17
4,1832,Andrew Jackson,Democratic,702735,win,54.574789,14


If we need to later modify an existing column, we can do so by referencing this column again with the syntax `df["column"]`, then re-assigning it to a new `Series` or array.

In [31]:
# Modify the “name_lengths” column to be one less than its original value
elections["name_lengths"] = elections["name_lengths"]-1
elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%,name_lengths
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122,13
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878,16
2,1828,Andrew Jackson,Democratic,642806,win,56.203927,13
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073,16
4,1832,Andrew Jackson,Democratic,702735,win,54.574789,13


We can rename a column using the `.rename()` method. `.rename()` takes in a dictionary that maps old column names to their new ones.

In [32]:
# Rename “name_lengths” to “Length”
elections = elections.rename(columns={"name_lengths":"Length"})
elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%,Length
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122,13
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878,16
2,1828,Andrew Jackson,Democratic,642806,win,56.203927,13
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073,16
4,1832,Andrew Jackson,Democratic,702735,win,54.574789,13


If we want to remove a column or row of a `DataFrame`, we can call the [`.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method. Use the `axis` parameter to specify whether a column or row should be dropped. Unless otherwise specified, `pandas` will assume that we are dropping a row by default. 

In [33]:
# Drop our new "Length" column from the DataFrame
elections = elections.drop("Length", axis="columns")
elections.head(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


Notice that we reassigned `elections` to the result of `elections.drop(...)`. This is a subtle, but important point: `pandas` table operations **do not occur in-place**. Calling `df.drop(...)` will output a *copy* of `df` with the row/column of interest removed, without modifying the original `df` table. 

In other words, if we simply call:

In [35]:
# This creates a copy of `elections` and removes the column "Name"...
elections.drop("Candidate", axis="columns")

# ...but the original `elections` is unchanged! 
# Notice that the "Name" column is still present
elections.head(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


## Handy Utility Functions

`pandas` contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in in this course.

Discussing all functionality offered by `pandas` could take an entire semester! We will walk you through the most commonly-used functions, and encourage you to explore and experiment on your own. 

- `.shape`
- `.size`
- `.describe() `
- `.sample()`
- `.value_counts()`
- `.unique()`
- `.sort_values()`
- `.apply()`

The `pandas` [documentation](https://pandas.pydata.org/docs/reference/index.html) will be a valuable resource.

### `.shape` and `.size`

`.shape` and `.size` are attributes of `Series` and `DataFrame`s that measure the "amount" of data stored in the structure. Calling `.shape` returns a tuple containing the number of rows and columns present in the `DataFrame` or `Series`. `.size` is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns. 

Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.


In [36]:
# Return the shape of the DataFrame, in the format (num_rows, num_columns)
elections.shape

(182, 6)

In [37]:
# Return the size of the DataFrame, equal to num_rows * num_columns
elections.size

1092

### `.dtypes`

This returns a Series with the data type of each column.
The result's index is the original DataFrame's columns. Columns with mixed types are stored with the ``object`` dtype.



In [47]:
elections.dtypes

Year              int64
Candidate        object
Party            object
Popular vote      int64
Result           object
%               float64
dtype: object

### `.astype()`

Cast a pandas object to a specified dtype

In [50]:
elections['%'].astype(int)

0      57
1      42
2      56
3      43
4      54
       ..
177     1
178    51
179    46
180     1
181     0
Name: %, Length: 182, dtype: int64

### `.describe()`

If many statistics are required from a `DataFrame` (minimum value, maximum value, mean value, etc.), then [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) can be used to compute all of them at once. 


In [38]:
elections.describe()

Unnamed: 0,Year,Popular vote,%
count,182.0,182.0,182.0
mean,1934.087912,12353640.0,27.47035
std,57.048908,19077150.0,22.968034
min,1824.0,100715.0,0.098088
25%,1889.0,387639.5,1.219996
50%,1936.0,1709375.0,37.677893
75%,1988.0,18977750.0,48.354977
max,2020.0,81268920.0,61.344703


A different set of statistics will be reported if `.describe()` is called on a Series.

In [40]:
elections["Party"].describe()

count            182
unique            36
top       Democratic
freq              47
Name: Party, dtype: object

In [43]:
elections["Popular vote"].describe().astype(int)

count         182
mean     12353635
std      19077149
min        100715
25%        387639
50%       1709375
75%      18977751
max      81268924
Name: Popular vote, dtype: int64

### `.sample()`

As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). [`.sample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).

By default, `.sample()` selects entries *without* replacement. Pass in the argument `replace=True` to sample with replacement.

In [51]:
# Sample a single row
elections.sample()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
135,1988,George H. W. Bush,Republican,48886597,win,53.518845


In [52]:
# Sample 5 random rows
elections.sample(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
155,2000,Ralph Nader,Green,2882955,loss,2.741176
134,1984,Walter Mondale,Democratic,37577352,loss,40.729429
39,1884,Grover Cleveland,Democratic,4914482,win,48.884933
84,1928,Herbert Hoover,Republican,21427123,win,58.368524
177,2016,Jill Stein,Green,1457226,loss,1.073699


In [53]:
# Randomly sample 4 names from the year 2000, with replacement
elections[elections["Result"] == "win"].sample(4, replace = True)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
53,1896,William McKinley,Republican,7112138,win,51.213817
131,1980,Ronald Reagan,Republican,43903230,win,50.897944
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
168,2012,Barack Obama,Democratic,65915795,win,51.258484


### `.value_counts()`

The [`Series.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) methods counts the number of occurrence of each unique value in a `Series`. In other words, it *counts* the number of times each unique *value* appears. This is often useful for determining the most or least common entries in a `Series`.

In the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the `"Name"` column of `elections`.

In [57]:
elections["Party"].value_counts().head()

Democratic     47
Republican     41
Libertarian    12
Prohibition    11
Socialist      10
Name: Party, dtype: int64

### `.unique()`

If we have a Series with many repeated values, then [`.unique()`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) can be used to identify only the *unique* values. Here we return an array of all the names in `elections`. 

In [58]:
elections["Result"].unique()

array(['loss', 'win'], dtype=object)

### `.sort_values()`

Ordering a `DataFrame` can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. [`.sort_values`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) allows us to order a `DataFrame` or `Series` by a specified column. We can choose to either receive the rows in `ascending` order (default) or `descending` order.

In [59]:
# Sort the "Count" column from highest to lowest
elections.sort_values(by = "%", ascending=False).head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
114,1964,Lyndon Johnson,Democratic,43127041,win,61.344703
91,1936,Franklin Roosevelt,Democratic,27752648,win,60.978107
120,1972,Richard Nixon,Republican,47168710,win,60.907806
79,1920,Warren Harding,Republican,16144093,win,60.574501
133,1984,Ronald Reagan,Republican,54455472,win,59.023326


We do not need to explicitly specify the column used for sorting when calling `.value_counts()` on a `Series`. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.

In [60]:
# Sort the "Name" Series alphabetically
elections["Candidate"].sort_values(ascending=True).head()

75     Aaron S. Watkins
27      Abraham Lincoln
23      Abraham Lincoln
108     Adlai Stevenson
105     Adlai Stevenson
Name: Candidate, dtype: object

### `.apply`

A frequent operation in `pandas` is applying a function on to either each column or row of a DataFrame. 

DataFrame’s `apply` method does exactly this. 

```{image} https://fahadsultan.com/datascience_ml/_images/vectorized2.png
:width: 30% 
:align: center
```


Let's say we wanted to count the number of unique values that each column takes on. We can use `.apply` to answer that question: 

In [61]:
def count_unique(col):
    return len(set(col))

elections.apply(count_unique, axis="index") # function is passed an individual column

Year             50
Candidate       132
Party            36
Popular vote    182
Result            2
%               182
dtype: int64

```{note}
* `axis="index"` will _apply_ function on all the COLUMNS in each ROW (default behavior)
* `axis="columns"` will _apply_ function on all the ROWS in each COLUMN
```

Similarly, let's say we wanted to count the total number of voters in an election. 

We can use `.apply` to answer that question using the following formula: 

`total`$\times \frac{\%}{100} = $`Popular vote`

In [62]:
def compute_total(row):
    return int(row['Popular vote']*100/row['%'])

elections.apply(compute_total, axis="columns") # function is passed an individual row

0         264413
1         264412
2        1143702
3        1143703
4        1287655
         ...    
177    135720167
178    158383403
179    158383403
180    158383401
181    158383402
Length: 182, dtype: int64